Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bot or Not

Like many Internet giants Twitter makes money by selling ads, but they’ve got an insidious infestation eroding their advertising credibility: bots. More than 23 million of them. Twitter bots are automatons living in the Twittersphere and ranging wildly in capability. In their simplest form, they follow you maybe fav-ing or retweeting your statuses. At their most complex, they troll and ironically, troll trolls using speech patterns that can, at times, fool humans. But when advertisers pay for engagement, they aren’t interested in a four-hour flame war between a gamergate bot and a Kanye bot. When advertisers analyze social data they want to be sure their findings are the result of human activity. In Bot or Not I describe an end-to-end data analysis to build a classifier with Python.

  • Login to see the comments

Bot or Not

  1. 1. Bot Not? @erinshellman PyData Seattle, July 26, 2015 orEnd-to-end data analysis in Python
  2. 2. PySpark Workshop @Tune August 27,6-8pm Starting a new career in software @Moz October 22,6-8pm
  3. 3. Q: Why? Bots are fun. Q: How? Python.
  4. 4. In 2009,24% of tweets were generated by bots.
  5. 5. Last year Twitter disclosed that 23 million of its active users were bots.
  6. 6. Hypothesis: Bot behavior is differentiable from human behavior.
  7. 7. ExperimentalDesign • Ingest data • Clean and process data • Create a classifier
  8. 8. ExperimentalDesign • Ingest data • python-twitter • Clean and process data • Pandas,NLTK,Seaborn,iPython Notebooks • Create a classifier • Scikit-learn
  9. 9. Step1: Getdata.
  10. 10. lollollollol
  11. 11. ConnectingtoTwitter https://github.com/bear/python-twitter
  12. 12. def get_friends(self, screen_name, count = 5000): ''' GET friends/ids i.e. people you follow returns a list of JSON blobs ''' friends = self.api.GetFriendIDs(screen_name = screen_name, count = count) return friends
  13. 13. # break query into bite-size chunks 🍔 def blow_chunks(self, data, max_chunk_size): for i in range(0, len(data), max_chunk_size): yield data[i:i + max_chunk_size]
  14. 14. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  15. 15. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  16. 16. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  17. 17. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  18. 18. { "name": "Twitter API", "location": "San Francisco, CA", "created_at": "Wed May 23 06:01:13 +0000 2007", "default_profile": true, "favourites_count": 24, "url": "http://dev.twitter.com", "id": 6253282, "profile_use_background_image": true, "listed_count": 10713, "profile_text_color": "333333", "lang": "en", "followers_count": 1198334, "protected": false, "geo_enabled": true, "description": "The Real Twitter API.”, "verified": true, "notifications": false, "time_zone": "Pacific Time (US & Canada)", "statuses_count": 3331, "status": { "coordinates": null, "created_at": "Fri Aug 24 16:15:49 +0000 2012", "favorited": false, "truncated": false,
  19. 19. sample size = 8509 accounts
  20. 20. Step2: preprocessing.
  21. 21. Who'sready 1. “Flatten” the JSON into one row per user. 2.Variable recodes. e.g. consistently denoting missing values, True/False into 1/0 3.Select only desired features for modeling. toclean?
  22. 22. Howto makedata withthis?
  23. 23. e.g.LexicalDiversity • A token is a sequence of characters that we want to treat as a group. • For instance, lol, #blessed, or 💉🔪💇 • Lexicaldiversity is the ratio of unique tokens to total tokens.
  24. 24. def lexical_diversity(text): if len(text) == 0: diversity = 0 else: diversity = float(len(set(text))) / len(text) return diversity
  25. 25. # Easily compute summaries for each user! grouped = tweets.groupby('screen_name') diversity = grouped.apply(lexical_diversity)
  26. 26. Step3: Classification.
  27. 27. # Naive Bayes bayes = GaussianNB().fit(train[features], y) bayes_predict = bayes.predict(test[features]) # Logistic regression logistic = LogisticRegression().fit(train[features], y) logistic_predict = logistic.predict(test[features]) # Random Forest rf = RandomForestClassifier().fit(train[features], y) rf_predict = rf.predict(test[features]) # Classification Metrics print(metrics.classification_report(test.bot, bayes_predict)) print(metrics.classification_report(test.bot, logistic_predict)) print(metrics.classification_report(test.bot, rf_predict))
  28. 28. precision recall f1-score 0.0 0.97 0.27 0.42 1.0 0.20 0.95 0.33 avg / total 0.84 0.38 0.41 precision recall f1-score 0.0 0.85 1.00 0.92 1.0 0.94 0.14 0.12 avg / total 0.87 0.85 0.79 precision recall f1-score 0.0 0.91 0.98 0.95 1.0 0.86 0.51 0.64 avg / total 0.90 0.91 0.90 Naive Bayes Logistic Regression Random Forest
  29. 29. # construct parameter grid param_grid = {'max_depth': [1, 3, 6, 9, 12, 15, None], 'max_features': [1, 3, 6, 9, 12], 'min_samples_split': [1, 3, 6, 9, 12, 15], 'min_samples_leaf': [1, 3, 6, 9, 12, 15], 'bootstrap': [True, False], 'criterion': ['gini', 'entropy']} # fit best classifier grid_search = GridSearchCV(RandomForestClassifier(), param_grid = param_grid).fit(train[features], y) # assess predictive accuracy predict = grid_search.predict(test[features]) print(metrics.classification_report(test.bot, predict))
  30. 30. print(grid_search.best_params_) {'bootstrap': True, 'min_samples_leaf': 15, 'min_samples_split': 9, 'criterion': 'entropy', 'max_features': 6, 'max_depth': 9} Best parameter set for random forest
  31. 31. precision recall f1-score 0.0 0.93 0.99 0.96 1.0 0.89 0.59 0.71 avg / total 0.92 0.93 0.92 precision recall f1-score 0.0 0.91 0.98 0.95 1.0 0.86 0.51 0.64 avg / total 0.90 0.91 0.90 Default Random Forest Tuned Random Forest
  32. 32. Iterative model development in Scikit-learn is laborious.
  33. 33. logistic_model = train(bot ~ statuses_count + friends_count + followers_count, data = train, method = 'glm', family = binomial, preProcess = c('center', 'scale'))
  34. 34. > confusionMatrix(logistic_predictions, test$bot) Confusion Matrix and Statistics Reference Prediction 0 1 0 394 22 1 144 70 Accuracy : 0.7365 95% CI : (0.7003, 0.7705) No Information Rate : 0.854 P-Value [Acc > NIR] : 1 Kappa : 0.3183 Mcnemars Test P-Value : <2e-16 Sensitivity : 0.7323 Specificity : 0.7609 Pos Pred Value : 0.9471 Neg Pred Value : 0.3271 Prevalence : 0.8540 Detection Rate : 0.6254 Detection Prevalence : 0.6603 Balanced Accuracy : 0.7466 'Positive' Class : 0
  35. 35. > summary(logistic_model) Call: NULL Deviance Residuals: Min 1Q Median 3Q Max -1.2620 -0.6323 -0.4834 -0.0610 6.0228 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.7136 0.7293 -7.835 4.71e-15 *** statuses_count -2.4120 0.5026 -4.799 1.59e-06 *** friends_count 30.8238 3.2536 9.474 < 2e-16 *** followers_count -69.4496 10.7190 -6.479 9.22e-11 *** --- (Dispersion parameter for binomial family taken to be 1) Null deviance: 2172.3 on 2521 degrees of freedom Residual deviance: 1858.3 on 2518 degrees of freedom AIC: 1866.3 Number of Fisher Scoring iterations: 13
  36. 36. # compare models results = resamples(list(tree_model = tree_model, bagged_model = bagged_model, boost_model = boost_model)) # plot results dotplot(results)
  37. 37. Step5: Pontificate.
  38. 38. Pythonrules! • The Python language is an incredibly powerful tool for end- to-end data analysis. • Even so,some tasks are more work than they should be.
  39. 39. Lamebots
  40. 40. Andnow…
  41. 41. thebots.
  42. 42. Clicks • Twitter Is Plagued With 23 Million Automated Accounts: http://valleywag.gawker.com/twitter-is-riddled-with-23-million- bots-1620466086 • HOW TWITTER BOTS FOOL YOU INTO THINKING THEY ARE REAL PEOPLE: http://www.fastcompany.com/3031500/how- twitter-bots-fool-you-into-thinking-they-are-real-people • Rise of the Twitter bots: Social network admits 23 MILLION of its users tweet automatically without human input: http:// www.dailymail.co.uk/sciencetech/article-2722677/Rise-Twitter-bots-Social-network-admits-23-MILLION-users-tweet- automatically-without-human-input.html • Twitter Zombies: 24% of Tweets Created by Bots: http://mashable.com/2009/08/06/twitter-bots/ • How bots are taking over the world: http://www.theguardian.com/commentisfree/2012/mar/30/how-bots-are-taking-over-the- world • That Time 2 Bots Were Talking, and Bank of America Butted In: http://www.theatlantic.com/technology/archive/2014/07/that- time-2-bots-were-talking-and-bank-of-america-butted-in/374023/ • The Rise of Twitter Bots: http://www.newyorker.com/tech/elements/the-rise-of-twitter-bots • OLIVIA TATERS, ROBOT TEENAGER: http://www.onthemedia.org/story/29-olivia-taters-robot-teenager/

×