data cleaning - Regex that search for sentences that exclude one word -
ciao guys,
i'm creating corpus composed tweets contain keyword "catastrophic" in xml format. each tweet embedded this:
<tweet>"catastrophic loss" @ tennessee's zoo knoxville 33 reptiles found dead </tweet> <tweet>overcoming catastrophic forgetting incremental moment matching, lee et al.</tweet
after trimming tons of unnecessary data, there still 200+ tweets don't contain keyword @ all. i'd delete them, tried regex this, didn't work:
<tweet>^.*(?!catastrophic).*$</tweet>
does has idea?
not sure programming language or other toolset using.
but quite simple approach might re-write file (or whatever kind of input is) using filter writes entries contain catastrophic:
assuming file 1 line per tweet (just illustrate idea):
egrep '<tweet>.*catastrophic.*</tweet>' originalfile > newfile
Comments
Post a Comment