Spark MLlib and Viral Tweets

SPARK MLLIB AND VIRAL
TWEETS
BY ASIM JALIS
GALVANIZE

Asim Jalis
Galvanize/Zipﬁan, Data Engineering Lead Instructor
Previously
Cloudera, Senior Technical Instructor
Salesforce, Microsoft, Hewlett-Packard Senior
Software Engineer
MS in Computer Science from University of Virginia
https://www.linkedin.com/in/asimjalis

WHAT IS GALVANIZE’S DATA
ENGINEERING PROGRAM?
12-week Data Engineering Immersive
Capstone Project

How can we predict how viral a tweet will get using
Spark and MLlib.
Example of how to use Spark’s Machine Learning
libraries.

HOW MANY PEOPLE HERE
ARE FAMILIAR WITH
APACHE SPARK?

ARE FAMILIAR WITH MLLIB?

ARE FAMILIAR WITH
MACHINE LEARNING?

ARE FAMILIAR WITH
RANDOM FORESTS?

Framework for dividing up data
and processing it across a cluster
in a fault-tolerant way

Using Spark you can process datasets larger than
what can ﬁt on a single machine.
You can process the data in parallel.
The code is executed on the machine where the data
is stored.

WHAT ARE THE
ALTERNATIVES TO SPARK?

WHY USE SPARK INSTEAD
OF MAPREDUCE?

Spark has a clean elegant API.
Spark can keep the intermediate data in memory
between stages.
This dramatically speeeds up Machine Learning.

WHY DOES THIS SPEED UP
MACHINE LEARNING?
Machine Learning algorithm are frequently iterative.
They have this ﬂavor:
Start with a guess
Calculate the error
Tweak the model
Try again

WHAT IS THE SPARK SHELL
Let you interact with your data and code.
Provides a REPL (Read-Eval-Print-Loop).
Great development environment.

WHY SPARK SHELL IS NEAT
Lets you include dependencies on command line.
No pom.xml or build.sbt ﬁles required.

HOW DO I START SPARK
WITH ALL DEPENDENCIES?

SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'
org.apache.hadoop:hadoop-aws:2.7.1
com.amazonaws:aws-java-sdk-s3:1.10.30
com.databricks:spark-csv_2.10:1.3.0
com.databricks:spark-avro_2.10:2.0.1
org.apache.spark:spark-streaming-twitter_2.10:1.5.2
org.twitter4j:twitter4j-core:4.0.4
END
)
spark-shell --packages $SPARK_PKGS

WHAT ARE RDDS?
RDDs are Resilient Distributed Datasets.
These are the fundamental particle of Spark.

WHAT ARE RDDS
INTUITIVELY?
Sequence of records.
Managed from Driver.
Distributed across executors.

WHAT WILL THIS OUTPUT?
sc.parallelize(Array(1,2,3,4)).
map(x => x + 1).
filter(x => x % 2 == 0).
collect

WHERE IS IT EXECUTING?
The driver is deﬁning what is to be done.
The data operations happen on the Executors.
+ and % are executing on Executors.

WHAT’S HAPPENING UNDER
THE HOOD

WHAT IS MLLIB?
Library for Machine Learning.
Builds on top of Spark RDDs.
Provides RDDs for Machine Learning.
Implements common Machine Learning algorithms.

WHAT ARE CLASSIFICATION
AND REGRESSION?
Classiﬁcation and Regression are a form of supervised
learning.
You build a model using labeled features.
Then you test it with unlabeled features.

WHAT ARE FEATURES AND
LABELS?
Features are input variables.
Labels are output variables.
These terms are useful when you use MLlib.

WHAT IS THE DIFFERENCE
BETWEEN REGRESSION
AND CLASSIFICATION?
In Regression the labels are continuous.
In Classiﬁcation they are discrete.

WHAT ARE SOME EXAMPLES
OF REGRESSION?
Predict sales of an item in a store based on:
Day of week
Weather forecast
Season
Population density

WHAT ARE SOME EXAMPLES
OF REGRESSION?
Predict how many retweets a tweet will get.

WHAT PROBLEM ARE WE
TRYING TO SOLVE?
Predict how viral a tweet will get.

WHY IS THIS INTERESTING?
What features play the biggest role in this?
How well can we predict the virality of a tweet?
What is the best way to build a model for virality?

WHY RETWEETS?
Retweets feed into themselves.
Unlike favs, retweets can trigger a cascade of
retweets.

WHAT FEATURES ARE WE
GOING TO USE?
t.getUser.getFollowersCount
t.getMediaEntities.size
t.getUserMentionEntities.size
t.getHashtagEntities.size
t.getText.length

DO THESE FEATURES
MULTIPLY OR ADD?
Suppose your followers double.
Is that going to add a constant number to your
retweets?
Or will it multiply them by 2?

USING LOG FEATURES
Intuitively the features have a multiplicative effect.
So lets apply log to them.
This turns multiplication to addition.

WHY IS THIS A GOOD IDEA?
Regression works better with addition.
Using log lets us convert multiplication to addition.

WHAT ARE RANDOM
FORESTS?
Random Forests are a Classiﬁcation and Regression
algorithm.

WHO INVENTED THEM?
Random Forests were invented by Leo Breiman and
Adele Cutler.
At Berkeley in 2001.

WHAT IS THE BASIC IDEA?
Random Forests are an ensemble method.
It is made up of a collection of decision trees.
The decision trees look only at a subset of features.
The ﬁnal value is the average of all decision trees.

WHAT ARE DECISION
TREES?
Decision trees play 20 questions on your data.
At each point they ask which feature separates the
labels best.
The model consists of these features and their
cutpoints.

WHAT ARE FEATURE
IMPORTANCES?
Random Forests calculate feature importance.
The values show how relatively detrimental removing
that feature will be to the error.

GET TWITTER CREDENTIALS
Go to
Click on Create New App.
For Website use your GitHub account.
Leave Callback URL blank.
https://apps.twitter.com/

SAVE THESE 4 TWITTER
KEYS
Consumer Key (API Key)
Consumer Secret (API Secret)
Access Token
Access Token Secret

CREATE
TWITTER4J.PROPERTIES
FILE
debug=true
http.prettyDebug=true
oauth.consumerKey=xxxx
oauth.consumerSecret=xxxx
oauth.accessToken=xxxx
oauth.accessTokenSecret=xxxx

START SPARK SHELL IN
SAME DIRECTORY AS THIS
FILE
SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'
org.apache.hadoop:hadoop-aws:2.7.1
com.amazonaws:aws-java-sdk-s3:1.10.30
com.databricks:spark-csv_2.10:1.3.0
com.databricks:spark-avro_2.10:2.0.1
org.apache.spark:spark-streaming-twitter_2.10:1.5.2
org.twitter4j:twitter4j-core:4.0.4
END
)
spark-shell --packages $SPARK_PKGS

SPARK MLLIB CODE
Grab Scala code for Spark MLlib from GitHub
Paste into Spark Shell
https://gist.github.com/asimjalis/965bd44657b90aeab887

WHAT WAS THE ROOT MEAN
SQUARE ERROR?

WHAT WERE THE FEATURE
IMPORTANCES?

Feature Importance
t.getText.length 0.340
t.getHashtagEntities.size 0.328
t.getUserMentionEntities.size 0.168
t.getUser.getFollowersCount 0.085
t.getMediaEntities.size 0.076

WHAT ARE NEXT STEPS?
Predict an unpublished tweet’s virality.
Analyze more features.
For example, the text of the tweet.
Analyze features of most viral tweets within a user’s
timeline.

WANT MORE INFORMATION
ON GALVANIZE’S DATA
ENGINEERING IMMERSIVE?

Talk to me.
We are also hiring for faculty positions.

Spark MLlib and Viral Tweets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark MLlib and Viral Tweets

Similar to Spark MLlib and Viral Tweets (20)

Recently uploaded

Recently uploaded (20)

Spark MLlib and Viral Tweets