data cleaning - Regex that search for sentences that exclude one word -


ciao guys,

i'm creating corpus composed tweets contain keyword "catastrophic" in xml format. each tweet embedded this:

<tweet>"catastrophic loss" @ tennessee's zoo knoxville 33 reptiles found dead </tweet> <tweet>overcoming catastrophic forgetting incremental moment matching, lee et al.</tweet 

after trimming tons of unnecessary data, there still 200+ tweets don't contain keyword @ all. i'd delete them, tried regex this, didn't work:

<tweet>^.*(?!catastrophic).*$</tweet> 

does has idea?

not sure programming language or other toolset using.

but quite simple approach might re-write file (or whatever kind of input is) using filter writes entries contain catastrophic:

assuming file 1 line per tweet (just illustrate idea):

egrep '<tweet>.*catastrophic.*</tweet>' originalfile > newfile 

Comments

Popular posts from this blog

javascript - Clear button on addentry page doesn't work -

python - Error: Unresolved reference 'selenium' What is the reason? -

asp.net ajax - Jquery scroll to element just goes to top of page -