data cleaning - Regex that search for sentences that exclude one word -


ciao guys,

i'm creating corpus composed tweets contain keyword "catastrophic" in xml format. each tweet embedded this:

<tweet>"catastrophic loss" @ tennessee's zoo knoxville 33 reptiles found dead </tweet> <tweet>overcoming catastrophic forgetting incremental moment matching, lee et al.</tweet 

after trimming tons of unnecessary data, there still 200+ tweets don't contain keyword @ all. i'd delete them, tried regex this, didn't work:

<tweet>^.*(?!catastrophic).*$</tweet> 

does has idea?

not sure programming language or other toolset using.

but quite simple approach might re-write file (or whatever kind of input is) using filter writes entries contain catastrophic:

assuming file 1 line per tweet (just illustrate idea):

egrep '<tweet>.*catastrophic.*</tweet>' originalfile > newfile 

Comments

Popular posts from this blog

javascript - Clear button on addentry page doesn't work -

c# - Selenium Authentication Popup preventing driver close or quit -

tensorflow when input_data MNIST_data , zlib.error: Error -3 while decompressing: invalid block type -