This document discusses using Apache Spark and MLlib to predict how viral a tweet will get. It begins with introductions and an overview of Spark and MLlib. Then it discusses using random forests regression with features like followers count, hashtags, mentions, and media to predict tweet retweets. The results showed a RMSE of 2.084 and feature importances with text length and hashtags being most important. Next steps discussed include predicting unseen tweets' virality and analyzing more features like tweet text.
4. Asim Jalis
Galvanize/Zipfian, Data Engineering Lead Instructor
Previously
Cloudera, Senior Technical Instructor
Salesforce, Microsoft, Hewlett-Packard Senior
Software Engineer
MS in Computer Science from University of Virginia
https://www.linkedin.com/in/asimjalis
5. WHAT IS GALVANIZE’S DATA
ENGINEERING PROGRAM?
12-week Data Engineering Immersive
Capstone Project
20. Using Spark you can process datasets larger than
what can fit on a single machine.
You can process the data in parallel.
The code is executed on the machine where the data
is stored.
24. Spark has a clean elegant API.
Spark can keep the intermediate data in memory
between stages.
This dramatically speeeds up Machine Learning.
25. WHY DOES THIS SPEED UP
MACHINE LEARNING?
Machine Learning algorithm are frequently iterative.
They have this flavor:
Start with a guess
Calculate the error
Tweak the model
Try again
42. WHAT IS MLLIB?
Library for Machine Learning.
Builds on top of Spark RDDs.
Provides RDDs for Machine Learning.
Implements common Machine Learning algorithms.
45. WHAT ARE CLASSIFICATION
AND REGRESSION?
Classification and Regression are a form of supervised
learning.
You build a model using labeled features.
Then you test it with unlabeled features.
46. WHAT ARE FEATURES AND
LABELS?
Features are input variables.
Labels are output variables.
These terms are useful when you use MLlib.
47. WHAT IS THE DIFFERENCE
BETWEEN REGRESSION
AND CLASSIFICATION?
In Regression the labels are continuous.
In Classification they are discrete.
48. WHAT ARE SOME EXAMPLES
OF REGRESSION?
Predict sales of an item in a store based on:
Day of week
Weather forecast
Season
Population density
49. WHAT ARE SOME EXAMPLES
OF REGRESSION?
Predict how many retweets a tweet will get.
51. WHAT PROBLEM ARE WE
TRYING TO SOLVE?
Predict how viral a tweet will get.
52. WHY IS THIS INTERESTING?
What features play the biggest role in this?
How well can we predict the virality of a tweet?
What is the best way to build a model for virality?
54. WHAT FEATURES ARE WE
GOING TO USE?
t.getUser.getFollowersCount
t.getMediaEntities.size
t.getUserMentionEntities.size
t.getHashtagEntities.size
t.getText.length
55. DO THESE FEATURES
MULTIPLY OR ADD?
Suppose your followers double.
Is that going to add a constant number to your
retweets?
Or will it multiply them by 2?
56. USING LOG FEATURES
Intuitively the features have a multiplicative effect.
So lets apply log to them.
This turns multiplication to addition.
57. WHY IS THIS A GOOD IDEA?
Regression works better with addition.
Using log lets us convert multiplication to addition.
63. WHAT IS THE BASIC IDEA?
Random Forests are an ensemble method.
It is made up of a collection of decision trees.
The decision trees look only at a subset of features.
The final value is the average of all decision trees.
64. WHAT ARE DECISION
TREES?
Decision trees play 20 questions on your data.
At each point they ask which feature separates the
labels best.
The model consists of these features and their
cutpoints.
65. WHAT ARE FEATURE
IMPORTANCES?
Random Forests calculate feature importance.
The values show how relatively detrimental removing
that feature will be to the error.
71. START SPARK SHELL IN
SAME DIRECTORY AS THIS
FILE
SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'
org.apache.hadoop:hadoop-aws:2.7.1
com.amazonaws:aws-java-sdk-s3:1.10.30
com.databricks:spark-csv_2.10:1.3.0
com.databricks:spark-avro_2.10:2.0.1
org.apache.spark:spark-streaming-twitter_2.10:1.5.2
org.twitter4j:twitter4j-core:4.0.4
END
)
spark-shell --packages $SPARK_PKGS
72. SPARK MLLIB CODE
Grab Scala code for Spark MLlib from GitHub
Paste into Spark Shell
https://gist.github.com/asimjalis/965bd44657b90aeab887
79. WHAT ARE NEXT STEPS?
Predict an unpublished tweet’s virality.
Analyze more features.
For example, the text of the tweet.
Analyze features of most viral tweets within a user’s
timeline.