SlideShare a Scribd company logo
1 of 64
Download to read offline
Bot Not?
@erinshellman
PyData Seattle, July 26, 2015
orEnd-to-end data analysis in Python
PySpark Workshop
@Tune
August 27,6-8pm
Starting a new career in software
@Moz
October 22,6-8pm
Q: Why?
Bots are fun.
Q: How?
Python.
In 2009,24% of tweets
were generated by bots.
Last year Twitter disclosed
that 23 million of its active
users were bots.
Hypothesis:
Bot behavior is
differentiable from
human behavior.
ExperimentalDesign
• Ingest data
• Clean and process data
• Create a classifier
ExperimentalDesign
• Ingest data
• python-twitter
• Clean and process data
• Pandas,NLTK,Seaborn,iPython Notebooks
• Create a classifier
• Scikit-learn
Step1:
Getdata.
lollollollol
ConnectingtoTwitter
https://github.com/bear/python-twitter
def get_friends(self, screen_name, count = 5000):
'''
GET friends/ids i.e. people you follow
returns a list of JSON blobs
'''
friends = self.api.GetFriendIDs(screen_name = screen_name,
count = count)
return friends
# break query into bite-size chunks 🍔
def blow_chunks(self, data, max_chunk_size):
for i in range(0, len(data), max_chunk_size):
yield data[i:i + max_chunk_size]
if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
{
"name": "Twitter API",
"location": "San Francisco, CA",
"created_at": "Wed May 23 06:01:13 +0000 2007",
"default_profile": true,
"favourites_count": 24,
"url": "http://dev.twitter.com",
"id": 6253282,
"profile_use_background_image": true,
"listed_count": 10713,
"profile_text_color": "333333",
"lang": "en",
"followers_count": 1198334,
"protected": false,
"geo_enabled": true,
"description": "The Real Twitter API.”,
"verified": true,
"notifications": false,
"time_zone": "Pacific Time (US & Canada)",
"statuses_count": 3331,
"status": {
"coordinates": null,
"created_at": "Fri Aug 24 16:15:49 +0000 2012",
"favorited": false,
"truncated": false,
sample size = 8509 accounts
Step2:
preprocessing.
Who'sready
1. “Flatten” the JSON into one
row per user.
2.Variable recodes. e.g.
consistently denoting
missing values, True/False
into 1/0
3.Select only desired features
for modeling.
toclean?
Howto
makedata
withthis?
e.g.LexicalDiversity
• A token is a sequence of characters that we want
to treat as a group.
• For instance, lol, #blessed, or 💉🔪💇
• Lexicaldiversity is the ratio of unique tokens to
total tokens.
def lexical_diversity(text):
if len(text) == 0:
diversity = 0
else:
diversity = float(len(set(text))) / len(text)
return diversity
# Easily compute summaries for each user!
grouped = tweets.groupby('screen_name')
diversity = grouped.apply(lexical_diversity)
Step3:
Classification.
# Naive Bayes
bayes = GaussianNB().fit(train[features], y)
bayes_predict = bayes.predict(test[features])
# Logistic regression
logistic = LogisticRegression().fit(train[features], y)
logistic_predict = logistic.predict(test[features])
# Random Forest
rf = RandomForestClassifier().fit(train[features], y)
rf_predict = rf.predict(test[features])
# Classification Metrics
print(metrics.classification_report(test.bot, bayes_predict))
print(metrics.classification_report(test.bot, logistic_predict))
print(metrics.classification_report(test.bot, rf_predict))
precision recall f1-score
0.0 0.97 0.27 0.42
1.0 0.20 0.95 0.33
avg / total 0.84 0.38 0.41
precision recall f1-score
0.0 0.85 1.00 0.92
1.0 0.94 0.14 0.12
avg / total 0.87 0.85 0.79
precision recall f1-score
0.0 0.91 0.98 0.95
1.0 0.86 0.51 0.64
avg / total 0.90 0.91 0.90
Naive Bayes
Logistic Regression
Random Forest
# construct parameter grid
param_grid = {'max_depth': [1, 3, 6, 9, 12, 15, None],
'max_features': [1, 3, 6, 9, 12],
'min_samples_split': [1, 3, 6, 9, 12, 15],
'min_samples_leaf': [1, 3, 6, 9, 12, 15],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
# fit best classifier
grid_search = GridSearchCV(RandomForestClassifier(), param_grid = param_grid).fit(train[features], y)
# assess predictive accuracy
predict = grid_search.predict(test[features])
print(metrics.classification_report(test.bot, predict))
print(grid_search.best_params_)
{'bootstrap': True,
'min_samples_leaf': 15,
'min_samples_split': 9,
'criterion': 'entropy',
'max_features': 6,
'max_depth': 9}
Best parameter set
for random forest
precision recall f1-score
0.0 0.93 0.99 0.96
1.0 0.89 0.59 0.71
avg / total 0.92 0.93 0.92
precision recall f1-score
0.0 0.91 0.98 0.95
1.0 0.86 0.51 0.64
avg / total 0.90 0.91 0.90
Default Random Forest
Tuned Random Forest
Iterative model development
in Scikit-learn is laborious.
logistic_model = train(bot ~ statuses_count + friends_count + followers_count,
data = train,
method = 'glm',
family = binomial,
preProcess = c('center', 'scale'))
> confusionMatrix(logistic_predictions, test$bot)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 394 22
1 144 70
Accuracy : 0.7365
95% CI : (0.7003, 0.7705)
No Information Rate : 0.854
P-Value [Acc > NIR] : 1
Kappa : 0.3183
Mcnemars Test P-Value : <2e-16
Sensitivity : 0.7323
Specificity : 0.7609
Pos Pred Value : 0.9471
Neg Pred Value : 0.3271
Prevalence : 0.8540
Detection Rate : 0.6254
Detection Prevalence : 0.6603
Balanced Accuracy : 0.7466
'Positive' Class : 0
> summary(logistic_model)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2620 -0.6323 -0.4834 -0.0610 6.0228
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.7136 0.7293 -7.835 4.71e-15 ***
statuses_count -2.4120 0.5026 -4.799 1.59e-06 ***
friends_count 30.8238 3.2536 9.474 < 2e-16 ***
followers_count -69.4496 10.7190 -6.479 9.22e-11 ***
---
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2172.3 on 2521 degrees of freedom
Residual deviance: 1858.3 on 2518 degrees of freedom
AIC: 1866.3
Number of Fisher Scoring iterations: 13
# compare models
results = resamples(list(tree_model = tree_model,
bagged_model = bagged_model,
boost_model = boost_model))
# plot results
dotplot(results)
Step5:
Pontificate.
Pythonrules!
• The Python language is an incredibly powerful tool for end-
to-end data analysis.
• Even so,some tasks are more work than they should be.
Lamebots
Andnow…
thebots.
Clicks
• Twitter Is Plagued With 23 Million Automated Accounts: http://valleywag.gawker.com/twitter-is-riddled-with-23-million-
bots-1620466086
• HOW TWITTER BOTS FOOL YOU INTO THINKING THEY ARE REAL PEOPLE: http://www.fastcompany.com/3031500/how-
twitter-bots-fool-you-into-thinking-they-are-real-people
• Rise of the Twitter bots: Social network admits 23 MILLION of its users tweet automatically without human input: http://
www.dailymail.co.uk/sciencetech/article-2722677/Rise-Twitter-bots-Social-network-admits-23-MILLION-users-tweet-
automatically-without-human-input.html
• Twitter Zombies: 24% of Tweets Created by Bots: http://mashable.com/2009/08/06/twitter-bots/
• How bots are taking over the world: http://www.theguardian.com/commentisfree/2012/mar/30/how-bots-are-taking-over-the-
world
• That Time 2 Bots Were Talking, and Bank of America Butted In: http://www.theatlantic.com/technology/archive/2014/07/that-
time-2-bots-were-talking-and-bank-of-america-butted-in/374023/
• The Rise of Twitter Bots: http://www.newyorker.com/tech/elements/the-rise-of-twitter-bots
• OLIVIA TATERS, ROBOT TEENAGER: http://www.onthemedia.org/story/29-olivia-taters-robot-teenager/

More Related Content

Viewers also liked

Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientistsErin Shellman
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BSJohn D
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4Eueung Mulyana
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupJim Chang
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 

Viewers also liked (13)

Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 

Similar to Bot or Not

Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryGiuseppe Rizzo
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)Amazon Web Services
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programmingLars Marius Garshol
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionTe-Yen Liu
 
Practical machine learning: rational approach
Practical machine learning: rational approachPractical machine learning: rational approach
Practical machine learning: rational approachDzianis Pirshtuk
 
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가JaeCheolKim10
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015Fastly
 
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...SMART Infrastructure Facility
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token EngineeringTrent McConaghy
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeWim Godden
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14Sri Ambati
 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBigML, Inc
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar
 
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!Dhiana Deva
 

Similar to Bot or Not (20)

Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom Discovery
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
wendi_ppt
wendi_pptwendi_ppt
wendi_ppt
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Practical machine learning: rational approach
Practical machine learning: rational approachPractical machine learning: rational approach
Practical machine learning: rational approach
 
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015
 
Pandas application
Pandas applicationPandas application
Pandas application
 
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token Engineering
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14
 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly Detection
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Bot or Not

  • 1. Bot Not? @erinshellman PyData Seattle, July 26, 2015 orEnd-to-end data analysis in Python
  • 2.
  • 3. PySpark Workshop @Tune August 27,6-8pm Starting a new career in software @Moz October 22,6-8pm
  • 4. Q: Why? Bots are fun. Q: How? Python.
  • 5.
  • 6.
  • 7. In 2009,24% of tweets were generated by bots.
  • 8. Last year Twitter disclosed that 23 million of its active users were bots.
  • 9.
  • 10.
  • 12. ExperimentalDesign • Ingest data • Clean and process data • Create a classifier
  • 13. ExperimentalDesign • Ingest data • python-twitter • Clean and process data • Pandas,NLTK,Seaborn,iPython Notebooks • Create a classifier • Scikit-learn
  • 15.
  • 17.
  • 18.
  • 19.
  • 21. def get_friends(self, screen_name, count = 5000): ''' GET friends/ids i.e. people you follow returns a list of JSON blobs ''' friends = self.api.GetFriendIDs(screen_name = screen_name, count = count) return friends
  • 22.
  • 23. # break query into bite-size chunks 🍔 def blow_chunks(self, data, max_chunk_size): for i in range(0, len(data), max_chunk_size): yield data[i:i + max_chunk_size]
  • 24. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  • 25. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  • 26. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  • 27. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  • 28. { "name": "Twitter API", "location": "San Francisco, CA", "created_at": "Wed May 23 06:01:13 +0000 2007", "default_profile": true, "favourites_count": 24, "url": "http://dev.twitter.com", "id": 6253282, "profile_use_background_image": true, "listed_count": 10713, "profile_text_color": "333333", "lang": "en", "followers_count": 1198334, "protected": false, "geo_enabled": true, "description": "The Real Twitter API.”, "verified": true, "notifications": false, "time_zone": "Pacific Time (US & Canada)", "statuses_count": 3331, "status": { "coordinates": null, "created_at": "Fri Aug 24 16:15:49 +0000 2012", "favorited": false, "truncated": false,
  • 29. sample size = 8509 accounts
  • 31. Who'sready 1. “Flatten” the JSON into one row per user. 2.Variable recodes. e.g. consistently denoting missing values, True/False into 1/0 3.Select only desired features for modeling. toclean?
  • 32.
  • 33.
  • 34.
  • 36.
  • 37. e.g.LexicalDiversity • A token is a sequence of characters that we want to treat as a group. • For instance, lol, #blessed, or 💉🔪💇 • Lexicaldiversity is the ratio of unique tokens to total tokens.
  • 38. def lexical_diversity(text): if len(text) == 0: diversity = 0 else: diversity = float(len(set(text))) / len(text) return diversity
  • 39. # Easily compute summaries for each user! grouped = tweets.groupby('screen_name') diversity = grouped.apply(lexical_diversity)
  • 40.
  • 42.
  • 43. # Naive Bayes bayes = GaussianNB().fit(train[features], y) bayes_predict = bayes.predict(test[features]) # Logistic regression logistic = LogisticRegression().fit(train[features], y) logistic_predict = logistic.predict(test[features]) # Random Forest rf = RandomForestClassifier().fit(train[features], y) rf_predict = rf.predict(test[features]) # Classification Metrics print(metrics.classification_report(test.bot, bayes_predict)) print(metrics.classification_report(test.bot, logistic_predict)) print(metrics.classification_report(test.bot, rf_predict))
  • 44. precision recall f1-score 0.0 0.97 0.27 0.42 1.0 0.20 0.95 0.33 avg / total 0.84 0.38 0.41 precision recall f1-score 0.0 0.85 1.00 0.92 1.0 0.94 0.14 0.12 avg / total 0.87 0.85 0.79 precision recall f1-score 0.0 0.91 0.98 0.95 1.0 0.86 0.51 0.64 avg / total 0.90 0.91 0.90 Naive Bayes Logistic Regression Random Forest
  • 45. # construct parameter grid param_grid = {'max_depth': [1, 3, 6, 9, 12, 15, None], 'max_features': [1, 3, 6, 9, 12], 'min_samples_split': [1, 3, 6, 9, 12, 15], 'min_samples_leaf': [1, 3, 6, 9, 12, 15], 'bootstrap': [True, False], 'criterion': ['gini', 'entropy']} # fit best classifier grid_search = GridSearchCV(RandomForestClassifier(), param_grid = param_grid).fit(train[features], y) # assess predictive accuracy predict = grid_search.predict(test[features]) print(metrics.classification_report(test.bot, predict))
  • 46. print(grid_search.best_params_) {'bootstrap': True, 'min_samples_leaf': 15, 'min_samples_split': 9, 'criterion': 'entropy', 'max_features': 6, 'max_depth': 9} Best parameter set for random forest
  • 47. precision recall f1-score 0.0 0.93 0.99 0.96 1.0 0.89 0.59 0.71 avg / total 0.92 0.93 0.92 precision recall f1-score 0.0 0.91 0.98 0.95 1.0 0.86 0.51 0.64 avg / total 0.90 0.91 0.90 Default Random Forest Tuned Random Forest
  • 48.
  • 49. Iterative model development in Scikit-learn is laborious.
  • 50. logistic_model = train(bot ~ statuses_count + friends_count + followers_count, data = train, method = 'glm', family = binomial, preProcess = c('center', 'scale'))
  • 51. > confusionMatrix(logistic_predictions, test$bot) Confusion Matrix and Statistics Reference Prediction 0 1 0 394 22 1 144 70 Accuracy : 0.7365 95% CI : (0.7003, 0.7705) No Information Rate : 0.854 P-Value [Acc > NIR] : 1 Kappa : 0.3183 Mcnemars Test P-Value : <2e-16 Sensitivity : 0.7323 Specificity : 0.7609 Pos Pred Value : 0.9471 Neg Pred Value : 0.3271 Prevalence : 0.8540 Detection Rate : 0.6254 Detection Prevalence : 0.6603 Balanced Accuracy : 0.7466 'Positive' Class : 0
  • 52. > summary(logistic_model) Call: NULL Deviance Residuals: Min 1Q Median 3Q Max -1.2620 -0.6323 -0.4834 -0.0610 6.0228 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.7136 0.7293 -7.835 4.71e-15 *** statuses_count -2.4120 0.5026 -4.799 1.59e-06 *** friends_count 30.8238 3.2536 9.474 < 2e-16 *** followers_count -69.4496 10.7190 -6.479 9.22e-11 *** --- (Dispersion parameter for binomial family taken to be 1) Null deviance: 2172.3 on 2521 degrees of freedom Residual deviance: 1858.3 on 2518 degrees of freedom AIC: 1866.3 Number of Fisher Scoring iterations: 13
  • 53. # compare models results = resamples(list(tree_model = tree_model, bagged_model = bagged_model, boost_model = boost_model)) # plot results dotplot(results)
  • 55. Pythonrules! • The Python language is an incredibly powerful tool for end- to-end data analysis. • Even so,some tasks are more work than they should be.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64. Clicks • Twitter Is Plagued With 23 Million Automated Accounts: http://valleywag.gawker.com/twitter-is-riddled-with-23-million- bots-1620466086 • HOW TWITTER BOTS FOOL YOU INTO THINKING THEY ARE REAL PEOPLE: http://www.fastcompany.com/3031500/how- twitter-bots-fool-you-into-thinking-they-are-real-people • Rise of the Twitter bots: Social network admits 23 MILLION of its users tweet automatically without human input: http:// www.dailymail.co.uk/sciencetech/article-2722677/Rise-Twitter-bots-Social-network-admits-23-MILLION-users-tweet- automatically-without-human-input.html • Twitter Zombies: 24% of Tweets Created by Bots: http://mashable.com/2009/08/06/twitter-bots/ • How bots are taking over the world: http://www.theguardian.com/commentisfree/2012/mar/30/how-bots-are-taking-over-the- world • That Time 2 Bots Were Talking, and Bank of America Butted In: http://www.theatlantic.com/technology/archive/2014/07/that- time-2-bots-were-talking-and-bank-of-america-butted-in/374023/ • The Rise of Twitter Bots: http://www.newyorker.com/tech/elements/the-rise-of-twitter-bots • OLIVIA TATERS, ROBOT TEENAGER: http://www.onthemedia.org/story/29-olivia-taters-robot-teenager/