Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

308 views

Published on

I am a Machine Learning (ML) and Natural Language Processing enthusiast. For my university dissertation I created a realtime sentiment analysis classifier for Twitter. My talk will be about the experience and the lessons learned. I will explain how to build a scalable machine learning software as a service, consumable with a REST API. The purpose of this talk is not to dig into the mathematics behind machine learning (as I do not have this experience), but it’s more about showing how easy it can be to build a ML SaaS by using some of the amazing libraries such as NLTK, ZMQ and MrJob that have helped me make throughout the development. This talk will give several benefits: users with no ML background will have a great introduction to the subject, they will also be able to replicate my project at home. More experienced users will gain new ideas to put in practice and (most) probably build a better system than mine! Finally, I will attach a GitHub project with the slides and a finished product.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
308
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

  1. 1. MACHINE LEARNING AS A SERVICE MAKING SENTIMENT PREDICTIONS IN REALTIME WITH ZMQ AND NLTK
  2. 2. ABOUT ME
  3. 3. DISSERTATION Let's make somethingcool!
  4. 4. SOCIAL MEDIA + MACHINE LEARNING + API
  5. 5. SENTIMENT ANALYSIS AS A SERVICE A STEP-BY-STEP GUIDE
  6. 6. FundamentalTopics Machine Learning NaturalLanguage Processing Overview of the platform The process Prepare Analyze Train Use Scale
  7. 7. MACHINE LEARNING WHAT IS MACHINE LEARNING? Amethod of teachingcomputers to make and improve predictions or behaviors based on some data. Itallow computers to evolve behaviors based on empiricaldata Datacan be anything Stock marketprices Sensors and motors emailmetadata
  8. 8. SUPERVISED MACHINE LEARNING SPAM OR HAM
  9. 9. SUPERVISED MACHINE LEARNING SPAM OR HAM
  10. 10. SUPERVISED MACHINE LEARNING SPAM OR HAM
  11. 11. SUPERVISED MACHINE LEARNING SPAM OR HAM
  12. 12. SUPERVISED MACHINE LEARNING SPAM OR HAM
  13. 13. SUPERVISED MACHINE LEARNING SPAM OR HAM
  14. 14. SUPERVISED MACHINE LEARNING SPAM OR HAM
  15. 15. SUPERVISED MACHINE LEARNING SPAM OR HAM
  16. 16. NATURAL LANGUAGE PROCESSING WHAT IS NATURAL LANGUAGE PROCESSING? Interactions between computers and human languages Extractinformation from text Some NLTK features Bigrams Part-or-speech Tokenization Stemming WordNetlookup
  17. 17. NATURAL LANGUAGE PROCESSING SOME NLTK FEATURES Tokentization Stopword Removal >>>phrase="Iwishtobuyspecifiedproductsorservice" >>>phrase=nlp.tokenize(phrase) >>>phrase ['I','wish','to','buy','specified','products','or','service'] >>>phrase=nlp.remove_stopwords(tokenized_phrase) >>>phrase ['I','wish','buy','specified','products','service']
  18. 18. SENTIMENT ANALYSIS
  19. 19. CLASSIFYING TWITTER SENTIMENT IS HARD Improper language use Spellingmistakes 160 characters to express sentiment Differenttypes of english (US, UK, Pidgin) Gr8 picutre..God bless u RT @WhatsNextInGosp: Resurrection Sunday Service @PFCNY with @Donnieradio pic.twitter.com/nOgz65cpY5 7:04 PM - 21 Apr 2014 Donnie McClurkin @Donnieradio Follow 8 RETWEETS 36 FAVORITES
  20. 20. BACK TO BUILDING OUR API .. FINALLY!
  21. 21. CLASSIFIER 3 STEPS
  22. 22. THE DATASET SENTIMENT140 160.000 labelled tweets CSVformat Polarityof the tweet(0 = negative, 2 = neutral, 4 = positive) The textof the tweet(Lyx is cool)
  23. 23. FEATURE EXTRACTION How are we goingto find features from aphrase? "Bag of Words"representation my_phrase="Todaywassucharainyandhorribleday" In[12]:fromnltkimportword_tokenize In[13]:word_tokenize(my_phrase) Out[13]:['Today','was','such','a','rainy','and','horrible','day']
  24. 24. FEATURE EXTRACTION CREATE A PIPELINE OF FEATURE EXTRACTORS FORMATTER=formatting.FormatterPipeline( formatting.make_lowercase, formatting.strip_urls, formatting.strip_hashtags, formatting.strip_names, formatting.remove_repetitons, formatting.replace_html_entities, formatting.strip_nonchars, functools.partial( formatting.remove_noise, stopwords=stopwords.words('english')+['rt'] ), functools.partial( formatting.stem_words, stemmer=nltk.stem.porter.PorterStemmer() ) )
  25. 25. FEATURE EXTRACTION PASS THE REPRESENTATION DOWN THE PIPELINE In[11]:feature_extractor.extract("Todaywassucharainyandhorribleday") Out[11]:{'day':True,'horribl':True,'raini':True,'today':True} The resultis adictionaryof variable length, containingkeys as features and values as always True
  26. 26. DIMENSIONALITY REDUCTION Remove features thatare common across allclasses (noise) Increase performanceof the classifier Decrease the sizeof the model, less memoryusage and more speed
  27. 27. DIMENSIONALITY REDUCTION CHI-SQUARE TEST
  28. 28. DIMENSIONALITY REDUCTION CHI-SQUARE TEST
  29. 29. DIMENSIONALITY REDUCTION CHI-SQUARE TEST
  30. 30. DIMENSIONALITY REDUCTION CHI-SQUARE TEST
  31. 31. DIMENSIONALITY REDUCTION CHI-SQUARE TEST
  32. 32. NLTK gives us BigramAssocMeasures.chi_sq DIMENSIONALITY REDUCTION CHI-SQUARE TEST #Calculatethenumberofwordsforeachclass pos_word_count=label_word_fd['pos'].N() neg_word_count=label_word_fd['neg'].N() total_word_count=pos_word_count+neg_word_count #Foreachwordandit'stotaloccurance forword,freqinword_fd.iteritems(): #Calculateascoreforthepositiveclass pos_score=BigramAssocMeasures.chi_sq(label_word_fd['pos'][word], (freq,pos_word_count),total_word_count) #Calculateascoreforthenegativeclass neg_score=BigramAssocMeasures.chi_sq(label_word_fd['neg'][word], (freq,neg_word_count),total_word_count) #Thesumofthetwowillgiveyouit'stotalscore word_scores[word]=pos_score+neg_score
  33. 33. TRAINING Now thatwe can extractfeatures from text, we can train a classifier. The simplestand mostflexible learningalgorithm for textclassification is Naive Bayes P(label|features)=P(label)*P(features|label)/P(features) Simple to compute = fast Assumes feature indipendence = easytoupdate Supports multiclass = scalable
  34. 34. TRAINING NLTK provides built-in components 1. Train the classifier 2. Serialize classifier for later use 3. Train once, use as much as you want >>>fromnltk.classifyimportNaiveBayesClassifier >>>nb_classifier=NaiveBayesClassifier.train(train_feats) ...waitalotoftime >>>nb_classifier.labels() ['neg','pos'] >>>serializer.dump(nb_classifier,file_handle)
  35. 35. USING THE CLASSIFIER #Loadtheclassifierfromtheserializedfile classifier=pickle.loads(classifier_file.read()) #Pickanewphrase new_phrase="AtPyconItaly!Lovethefoodandthisspeakerissoamazing" #1)Preprocessing feature_vector=feature_extractor.extract(phrase) #2)Dimensionalityreduction,best_featuresisoursetofbestwords reduced_feature_vector=reduce_features(feature_vector,best_features) #3)Classify! printself.classifier.classify(reduced_feature_vector) >>>"pos"
  36. 36. BUILDING A CLASSIFICATION API Classifier is slow, no matter how much optimization is made Classifier is ablockingprocess, API mustbe event-driven
  37. 37. BUILDING A CLASSIFICATION API SCALING TOWARDS INFINITY AND BEYOND
  38. 38. BUILDING A CLASSIFICATION API ZEROMQ Fast, uses native sockets Promotes horizontalscalability Language-agnostic framework
  39. 39. BUILDING A CLASSIFICATION API ZEROMQ ... socket=context.socket(zmq.REP) ... whileTrue: message=socket.recv() phrase=json.loads(message)["text"] #1)Featureextraction feature_vector=feature_extractor.extract(phrase) #2)Dimensionalityreduction,best_featuresisoursetofbestwords reduced_feature_vector=reduce_features(feature_vector,best_features) #3)Classify! result=classifier.classify(reduced_feature_vector) socket.send(json.dumps(result))
  40. 40. DEMO
  41. 41. POST-MORTEM Real-time sentimentanalysis APIs can be implemented, and can be scalable Whatif we use Redis instead of havingserialized classifiers? Deep learningis givingverygood results in NLP, let's tryit!
  42. 42. FIN QUESTIONS

×