machine learning - Python's Sklearn ngram accuracy decreases as ngram length increases -


i've got hate speech dataset containining 10k marked tweets: looks this

tweet | class
hi | not offensive
ugly muppet | offensive not hate speech
**** jew | hate speech

now i'm trying use multinomialnb classifier in python sklearn library, , heres code.

import pandas pd import numpy np sklearn.feature_extraction.text import countvectorizer sklearn.naive_bayes import multinomialnb  data = pd.read_excel('myfile', encoding = "utf-8") data = data.sample(frac=1) training_base = 0; training_bounds = 10000; test_base = training_bounds+1; test_bounds = 12000; tweets_train = data['tweet'][training_base:training_bounds] tweets_test = data['tweet'][test_base:test_bounds] class_train = data['class'][training_base:training_bounds] class_test = data['class'][test_base:test_bounds] vectorizer = countvectorizer(analyzer='word', ngram_range=(1,1)) train_counts = vectorizer.fit_transform(tweets_train.values)  classifier = multinomialnb() train_targets = class_train.values classifier.fit(train_counts, train_targets) example_counts = vectorizer.transform(tweets_test.values); predictions = classifier.predict(example_counts)  accuracy = np.mean(predictions == class_test.values)    print(accuracy) 

the accuracy when using ngram_range(1,1) approximately 75% go (2,2) (8,8) decreases 75,72,67..55%. why this? missing?

you make problem increasingly sparse, finding exact 8 word sequence training set in test set might hard , worse accuracy.

i recommend mix different word n-gram lengths (that's why there 2 parameters), eg. (1, 3) seems reasonable choice short tweets. there might hidden information in character n-grams naturally encode more linguistic features- add them feature space.


Comments

Popular posts from this blog

javascript - Clear button on addentry page doesn't work -

c# - Selenium Authentication Popup preventing driver close or quit -

tensorflow when input_data MNIST_data , zlib.error: Error -3 while decompressing: invalid block type -