machine learning - Python's Sklearn ngram accuracy decreases as ngram length increases -
i've got hate speech dataset containining 10k marked tweets: looks this
tweet | class
hi | not offensive
ugly muppet | offensive not hate speech
**** jew | hate speech
now i'm trying use multinomialnb classifier in python sklearn library, , heres code.
import pandas pd import numpy np sklearn.feature_extraction.text import countvectorizer sklearn.naive_bayes import multinomialnb data = pd.read_excel('myfile', encoding = "utf-8") data = data.sample(frac=1) training_base = 0; training_bounds = 10000; test_base = training_bounds+1; test_bounds = 12000; tweets_train = data['tweet'][training_base:training_bounds] tweets_test = data['tweet'][test_base:test_bounds] class_train = data['class'][training_base:training_bounds] class_test = data['class'][test_base:test_bounds] vectorizer = countvectorizer(analyzer='word', ngram_range=(1,1)) train_counts = vectorizer.fit_transform(tweets_train.values) classifier = multinomialnb() train_targets = class_train.values classifier.fit(train_counts, train_targets) example_counts = vectorizer.transform(tweets_test.values); predictions = classifier.predict(example_counts) accuracy = np.mean(predictions == class_test.values) print(accuracy)
the accuracy when using ngram_range(1,1) approximately 75% go (2,2) (8,8) decreases 75,72,67..55%. why this? missing?
you make problem increasingly sparse, finding exact 8 word sequence training set in test set might hard , worse accuracy.
i recommend mix different word n-gram lengths (that's why there 2 parameters), eg. (1, 3) seems reasonable choice short tweets. there might hidden information in character n-grams naturally encode more linguistic features- add them feature space.
Comments
Post a Comment