Data Science with Spark

In Apache Spark
Foundations of Data Science
with Spark
Foundations of Data Science
with Spark
July 16, 2015
@ksankar // doubleclix.wordpress.com

www.globalbigdataconference.com
Twitter : @bigdataconf

o Intro & Setup [8:00-8:20)
• Goals/non-goals
o Spark & Data Science DevOps [8:20-8:40)
• Spark in the context of Data
Science
o Where Exactly is Apache Spark headed ?
[8:40-9:30)
• Spark Yesterday, Today &
Tomorrow
• Spark Stack
o Break [9:30-10:00)
o DataFrames for the Data Scientist
[10:00-11:30)
• pySpark Classes
• Walkthru DataFrames
• Hands-on Notebooks
o [15] Discussions/Slack (11:30 - 11:45)
Agenda : Introduction To SparkAgenda : Introduction To Sparkhttp://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html

o Review (2:00-2:30)
• 004-Orders-Homework-
Solution
• MLlib StatisticalToolbox
• Summary, Correlations
o [20] Linear Regression (2:30-2:45)
o [20] “Mood Of the Union” (2:45-3:15)
• State of the Union w/
Washington, Lincoln, FDR, JFK,
Clinton, Bush & Obama
• Map reduce, parse text
o Break (3:15-3:30)
o [60] Predicting Survivors with
Classification (3:30-4:30)
• Decision Trees
• NaiveBayes (Titanic data set)
o Break (4:30-4:45)
o [20] Clustering(4:45-5:05)
• K-means for Gallactic Hoppers!
o [20]Recommendation Engine (5:05-5:25)
• Collab Filtering w/movie lens
o [15] Discussions/Slack (5:45-6:00)
Agenda : Data Wrangling w/ DataFrames & MLlibAgenda : Data Wrangling w/ DataFrames & MLlib
http://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html

Goals & non-goalsGoals & non-goals
Goals
¤Understand how to program
Machine Learning with Spark &
Python
¤Focus on programming & ML
application
¤Give you a focused time to work
thru examples
§ Work with me. I will wait
if you want to catch-up
¤Less theory, more usage - let us
see if this works
¤As straightforward as possible
§ The programs can be
optimized
Non-goals
¡Go deep into the algorithms
• We don’t have sufficient
time. The topic can be
easily a 5 day tutorial !
¡Dive into spark internals
• That is for another day
¡The underlying computation,
communication, constraints &
distribution is a fascinating subject
• Paco does a good job
explaining them
¡A passive talk
• Nope. Interactive &
hands-on

About MeAbout Me
o Data Scientist
• Decision Data Science & Product Data Science
• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]
• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]
o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513,
http://www.slideshare.net/ksankar/pydata-19] …
o Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC]
o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT]
o Reviewer : “Machine Learning with Spark” Packt Publishing
o Have done lots of things:
• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA
• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,
• Guest Lecturer at Naval PG School,…
o Volunteer as Robotics Judge at First Lego league World Competitions
o @ksankar, doubleclix.wordpress.com

Close EncountersClose Encounters
— 1st
◦ This Tutorial
— 2nd
◦ Do More Hands-on Walkthrough
— 3nd
◦ Listen To Lectures
◦ More competitions …

Spark InstallationSpark Installation
o Install Spark 1.4.1 in local Machine
• https://spark.apache.org/downloads.html
• Pre-built For Hadoop 2.6 is fine
• Download & uncompress
• Remember the path & use it wherever you see /usr/local/spark/
• I have downloaded in /usr/local & have a softlink spark to the latest version
o Install iPython

Tutorial MaterialsTutorial Materials
o Github : https://github.com/xsankar/global-bd-conf
• Clone or download zip
o Open terminal
o cd ~/global-bd-conf
o IPYTHON=1 IPYTHON_OPTS="notebook” /usr/local/spark/bin/pyspark --packages com.databricks:spark-csv_2.11:1.0.3
o Notes :
• I have a soft link “spark” in my /usr/local that points to the spark version
that I use. For example ln -s spark-1.4.1/ spark
o Click on ipython dashboard
o Run 000-PreFlightCheck.ipynb
o Run 001-TestSparkCSV.ipynb
o Now you are ready for the workshop !

Spark & Data Science DevOpsSpark & Data Science DevOps
8:208:20

Spark in the context of data
science
Spark in the context of data
science

Data Science :
The art of building a model with known knowns, which when let loose, works with unknown
unknowns!
Data Science :
The art of building a model with known knowns, which when let loose, works with unknown
unknowns!
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World
Knowns
Unknowns
You
UnKnown Known
o Others
know,
you
don’t o What
we
do
o Facts,
outcomes
or

scenarios
we
have
not

encountered,
nor

considered
o “Black
swans”,
outliers,

long
tails
of
probability

distributions
o Lack
of
experience,

imagination
o Potential
facts,

outcomes
we

are
aware,
but

not

with

certainty
o Stochastic

processes,

Probabilities
o Known Knowns
o There are things we know that we know
o Known Unknowns
o That is to say, there are things that we
now know we don't know
o But there are also Unknown Unknowns
o There are things we do not know we
don't know

The curious case of the Data ScientistThe curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data Products
o Data Scientist should tell a story
http://doubleclix.wordpress.com/2014/01/25/the-‐curious-‐case-‐of-‐the-‐data-‐scientist-‐profession/
Large is hard; Infinite is much easier !
– Titus Brown

Data Science - ContextData Science - Context
o Scalable Model
Deployment
o Big Data
automation &
purpose built
appliances
(soft/hard)
o Manage SLAs &
response times
o Scalable Model
Deployment
o Big Data
automation &
purpose built
appliances
(soft/hard)
o Manage SLAs &
response times
o Volume
o Velocity
o Streaming Data
o Volume
o Velocity
o Streaming Data
o Canonical form
o Data catalog
o Data Fabric across the
organization
o Access to multiple
sources of data
o Think Hybrid – Big Data
Apps, Appliances &
Infrastructure
o Canonical form
o Data catalog
o Data Fabric across the
organization
o Access to multiple
sources of data
o Think Hybrid – Big Data
Apps, Appliances &
Infrastructure
CollectCollect StoreStore TransformTransform
o Metadata
o Monitor counters &
Metrics
o Structured vs. Multi-‐
structured
o Metadata
o Monitor counters &
Metrics
o Structured vs. Multi-‐
structured
o Flexible & Selectable
§ Data Subsets
§ Attribute sets
o Flexible & Selectable
§ Data Subsets
§ Attribute sets
o Refine model with
§ Extended Data
subsets
§ Engineered
Attribute sets
o Validation run across a
larger data set
o Refine model with
§ Extended Data
subsets
§ Engineered
Attribute sets
o Validation run across a
larger data set
ReasonReason ModelModel DeployDeploy
Data ManagementData Management
Data ScienceData Science
o Dynamic Data Sets
o 2 way key-‐value tagging of
datasets
o Extended attribute sets
o Advanced Analytics
o Dynamic Data Sets
o 2 way key-‐value tagging of
datasets
o Extended attribute sets
o Advanced Analytics
ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict
o Performance
o Scalability
o Refresh Latency
o In-‐memory Analytics
o Performance
o Scalability
o Refresh Latency
o In-‐memory Analytics
o Advanced Visualization
o Interactive Dashboards
o Map Overlay
o Infographics
o Advanced Visualization
o Interactive Dashboards
o Map Overlay
o Infographics
¤ Bytes to Business
a.k.a. Build the full
stack
¤ Find Relevant Data
For Business
¤ Connect the Dots

VolumeVolume
VelocityVelocity
VarietyVariety
Data Science - ContextData Science - Context
ContextContext
Connect
edness
Connect
edness
IntelligenceIntelligence
InterfaceInterface
InferenceInference
“Data of unusual size”
that can't be brute forced
o Three Amigos
o Interface = Cognition
o Intelligence = Compute(CPU) & Computational(GPU)
o Infer Significance & Causality

Day in the life of a (super) ModelDay in the life of a (super) Model
IntelligenceIntelligence
InferenceInference
Data RepresentationData Representation
InterfaceInterface
AlgorithmsAlgorithms
ParametersParametersAttributesAttributes
Data
(Scoring)Data
(Scoring)
Model
SelectionModel
Selection
Reason
&

Learn
Reason
&

Learn
ModelsModels
Visualize,

Recommend,

Explore
Visualize,

Recommend,

Explore
Model
AssessmentModel
Assessment
Feature
SelectionFeature
SelectionDimensionality
ReductionDimensionality
Reduction

Data Science Maturity Model & SparkData Science Maturity Model & SparkIsolated Analytics Integrated Analytics Aggregated Analytics Automated Analytics
Data Small
Data Larger
Data
set Big
Data Big
Data
Factory
Model
Context Local Domain Cross-‐domain +
External Cross
domain
+
External
Model,
Reason &
Deploy
• Single
set
of
boxes,
usually

owned
by
the
Model
Builders
• Departmental
• Deploy
-‐ Central
Analytics

Infrastructure
• Models
still
owned
&

operated
by
Modelers
• Partly Enterprise-‐wide
• Central
Analytics
Infrastructure
• Model
&
Reason
– by
Model
Builders
• Deploy,
Operate
– by
ops
• Residuals and
other
metrics
monitored

by
modelers
• Enterprise-‐wide
• Distributed
Analytics
Infrastructure
• AI
Augmented
models
• Model
&
Reason
– by
Model

Builders
• Deploy,
Operate
– by
ops
• Data
as
a
monetized
service,

extending
to
eco
system
partners
• Reports • Dashboards • Dashboards
+
some
APIs • Dashboards
+
Well
defined
APIs
+

programming
models
Type • Descriptive
&
Reactive • +
Predictive • +
Adaptive • Adaptive
Datasets • All
in
the
same
box • Fixed
data
sets,
usually
in

temp
data
spaces
• Flexible
Data
&
Attribute
sets • Dynamic
datasets
with
well-‐defined

refresh
policies

Workload • Skunk works • Business
relevant
apps
with

approx SLAs
• High
performance
appliance
clusters • Appliances
and
clusters
for
multiple

workloads
including
real
time
apps
• Infrastructure
for
emerging

technologies
Strategy • Informal
definitions • Data
definitions
buried
in
the

analytics
models
• Some
data
definitions • Data
catalogue,
metadata
&

Annotations
• Big
Data
MDM
Strategy

The
Sense
&
Sensibility
of
a

DataScientist DevOps
The
Sense
&
Sensibility
of
a

DataScientist DevOps
Factory
=
Operational
Lab
=
Investigative
http://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐
sensibility-‐of-‐a-‐data-‐scientist-‐devops/

Where exactly is Apache
Spark headed ?
Where exactly is Apache
Spark headed ?
Spark Yesterday, Today & Tomorrow …
“Unified engine acrossdiverse data sources, workloads & environments”
8:408:40

http://free-‐stock-‐illustration.com/winding+road+clip+art
Spark 1.x
• Fast engine for big data processing
• Fast to run code & fast to write code
• In-memory computation graphs with
compatibility with the Hadoop eco
system and an interesting very usable
APIs
• Iterative & interactive apps that
operated on data multiple times, which
are not a good use case for Hadoop.
Spark 1.3 & beyond
has been the catalyst
for a renaissance in
Data Science !
Spark 1.4+
• Multi-pass analytics - ML pipelines, GraphX
• Ad-hoc queries - DataFrames
• Real-time stream processing – Spark Streaming
• Parallel Machine Learning Algorithms beyond the
basic RDDs
• More types of data sources as input & output
• More integration with R to span statistical computing
beyond “single-node tools”
• More integration with apps like visualization
dashboards
• More performance with even larger datasets &
complex applications – Project Tungsten
Spark Yesterday, Today &
Tomorrow …

Spark DirectionsSpark Directions
Data ScienceData Science Platform APIsPlatform APIs Streaming, DAG
Visualization &
Debugging
Streaming, DAG
Visualization &
Debugging
Execution Optimization
(Project Tungsten)
Execution Optimization
(Project Tungsten)o DataFrames
o ML Pipelines
o SparkR
o Growing the eco system
§ Data Sources -‐Uniform access to
diverse data sources
§ Pluggable “smart” DataSource
API for reading/writing
DataFrame while minimizing I/O
§ Spark Packages
§ Deployment utilities for Google
Compute, Azure & Job Server
o Focus on CPU Efficiency
o Run-‐Time Code Generation
o Cache Locality & cache aware
data structures
o Binary Format for aggregations
o Spark managed Memory
o Off-‐heap memory
management
o Spark Streaming
flow control &
optimized state
management

Spark-The (simple) StackSpark-The (simple) Stack

RDD – The workhorse of Core SparkRDD – The workhorse of Core Spark
o Resilient Distributed Datasets
• Collection that can be operated in parallel
o Transformations – create RDDs
• Map, Filter,…
o Actions – Get values
• Collect, Take,…
o We will apply these operations during this tutorial

DataFrame API
Spark Core
Spark SQL
Spark
Streaming
Spark
R
MLlib GraphX Packages
ML Pipelines Advanced Analytics
Neural
Networks
Deep
Learning
Parameter
Server
R Scala JavaPython
Catalyst Optimizer – optimize execution plan
Data Sources - Parquet, Hadoop, Cassandra, JSON, CSV, JDBC,…
Tungsten Execution
RDD

SQL Query
DataFrame
Unresolved
Logical Plan
Logical
Plan
Optimized
Logical Plan
Physical
Plans
Physical
Plans
Physical
Plans
CostModel
Selected
Physical Plan
RDDs
Catalog
Analysis
Logical
Optimization Physical Planning Code Generation
Query Optimization-Execution pipeline
Ref: Spark SQL paper

Spark DataFrames for the
Data Scientist
Spark DataFrames for the
Data Scientist
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any
man who can hitch the length and breadth of the Galaxy, rough it … win through, and still
know where his towel is, is clearly a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
DataFrames ! The Most Massively useful thing a Data Scientist can have …
10:0010:00

Data Science “folk knowledge” (1 of A)Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it will confess to anything." – Hal Varian,
Computer Mediated Transactions
o Learning = Representation + Evaluation + Optimization
o It’s Generalization that counts
• The fundamental goal of machine learning is to generalize beyond the
examples in the training set
o Data alone is not enough
• Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond it
o Machine Learning is not magic – one cannot get something from nothing
• In order to infer, one needs the knobs & the dials
• One also needs a rich expressive dataset
A few useful things to know about machine learning- by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755

pyspark
pyspark.SparkContext()
pyspark.SparkConf()
pyspark.RDD()
pyspark.Broadcast()
pyspark.Accululator()
pyspark.SparkFiles()
pyspark.StorageLevel()
pyspark.sqlpyspark.streaming pyspark.mllib pyspark.ml
pyspark.streaming.StreamingContext()
pyspark.streaming.Dstream()
pyspark.streaming.kafka
pyspark.streaming.kafka.Broker()
…
pyspark.sql.SQLContext()
pyspark.sql.DataFrame()
pyspark.sql.DataFrameNaFunctions()
pyspark.sql.DataFrameStatFunctions()
pyspark.sql.DataFrameReader()
pyspark.sql.DataFrameWriter()
pyspark.sql.Column()
pyspark.sql.Row()
pyspark.sql.functions()
pyspark.sql.types()
pyspark.sql.Window()
pyspark.sql.WindowSpec()
pyspark.sql.GroupedData()
pyspark.sql.HiveContext()
pyspark.mllib.classification
pyspark.mllib.clustering
pyspark.mllib.evaluation
pyspark.mllib.feature
pyspark.mllib.fpm
pyspark.mllib.linalg
pyspark.mllib.random
pyspark.mllib.recommendation
pyspark.mllib.regression
pyspark.mllib.stat
pyspark.mllib.tree
pyspark.mllib.util
ML
Pipeline
APIs
pyspark.ml.Transformer
pyspark.ml.Estimator
pyspark.ml.Model
pyspark.ml.Pipeline
pyspark.ml.PipelineModel
pyspark.ml.param
pyspark.ml.feature
pyspark.ml.classification
pyspark.ml.recommendation
pyspark.ml.regression
pyspark.ml.tuning
pyspark.ml.evaluation

1. SparkContext()1. SparkContext()

3. Convert3. Convert
pyspark.sql.DataFrame
table
pandas.DataFrame
sqlContext.registerDataFrameAsTable(df,
"aTable")
df2
=
sqlContext.table("aTable")
df =
createDataFrame(pandas.DataFrame)
p_df =
df.toPandas()

4. Columns & Rows (1 of 3)4. Columns & Rows (1 of 3)
o Select a column
• by the df(“<columnName>”) notation or the df.<columnName> notation.
• The recommended way is the df(“<columnName>”), reason being a column
name can collide with a dataframe method if we use the df.<columnName>
o Column-wise operations line +,-, *,/,% (modulo),&&,||, <,<=,> and >=
• df(“total”) = df(“price”) * df(“qty”)
• inequality operator is !==, the usual equalTo operator is === and an
equality test that is safe for null values <=>
o Meta operations – type conversion (cast), alias, not null, …
• df_cars.mpg.cast("double").alias('mpg')
o Run arbitrary udfs on a column (see next page)

Run arbitrary udfs on a column

Interesting Operations …
Adding a column…

5. DataFrame : RDD-like Operations5. DataFrame : RDD-like OperationsFunction Description

df.sort(<sort
expression>)
or
df.orderBy(<sort
expression>) Returns
a
sorted
DataFrame.
There
are
multiple
ways
of
specifying
the
sort
expression.
Use
of
the
orderBy
is

recommended
(as
the
syntax
is
closer
to
SQL)
for
example:
df_orders_1.groupBy("CustomerID","Year").count()
.orderBy(‘count’,ascending=False).show()
df.filter(<condition>)
or
df.where(<condition>) Returns
a
new
DataFrame
after
applying
the
<condition>.
The
condition
is
usually
based
on
a
column.
Use
of
the
where
form
is
recommended
(as
the
syntax
is
closer
to
the
SQL
world),
for
example

df_orders.where(df_orders.[‘shipCountry’]
==
’France’)
df.colasce(n) Returns
a
DataFrame
with
n
partitions,
same
as
colasce(n)
method
of
RDD
df.foreach(<function>) Applies
a
function
on
all
the
rows
of
a
DataFrame
df.map(lambda
r:..) Applies
the
function
on
all
the
rows
and
returns
the
resulting
object
df.flatMap(lambda
r:
…) Returns
an
RDD,
flattened,
after
applying
the
function
on
all
the
rows
of
the
DataFrame

df.rdd() Returns
the
DataFrame
as
an
rdd
of
Row
objects
df.na.replace([<list
of
values
to
be
replaced],[list
of
replacing

values],subset=[list
of
columns])
or

DataFrame.replace()
or

DataFrameNaFunctions.replace()
An
interesting
function,
very
useful
and
a
little
strange
syntax-‐‑wise.
The
recommended
form
is
the

df.na.replace()
even
though
the
.na namespace
throws
it
a
little
bit.
Use
the
subset=
for
column
names.
The

syntax
is
different
from
the
Scala
syntax.

6. DataFrame : Action6. DataFrame : Action
o cache()
o collect(), collectAsList()
o count()
o describe(),
o first(), head(), show(), take()
o …

7. DataFrame : Scientific Functions7. DataFrame : Scientific Functions

8. DataFrame : Statistical Functions8. DataFrame : Statistical Functions
The pair-wise frequency (contingency table) of transmission type and no of speeds show interesting observation.
• All automatic cars in the dataset are 3 speed while most of the manual transmission cars have 4 or 5 speeds
• Almost all the manual cars have 2 barrels while the automatic cars have 2 and 4 barrels

9. DataFrame : Aggregate Functions9. DataFrame : Aggregate Functions
o The pyspark.sql.functions class (and the org.apache.spark.sql.functions for Scala)
contains the aggregation functions
o There are two types of aggregations, one on column values and the other on
subsets of column values i.e. grouped values of some other columns
• pyspark.sql.functions.avg(“sales”)
• pyspark.sql.functions.groupby(“year”).agg(“sales”:”avg”)
o Count(), countDistinct()
o First(),last()

10. DataFrame : na10. DataFrame : na
o One of the tenets of big data and data science is that data is never fully clean-while we
can handle types, formats et al, missing values is always challenging
o One easy solution is to drop the rows that have missing values, but then we would lose
valuable data in the columns that do have values.
o A better solution is to impute data based on some criteria. It is true that data cannot be
created out of thin air, but data can be inferred with some success – it is better than
dropping the rows.
• We can replace null with 0
• A better solution is to replace numerical values with the average of the rest of the valid
values; for categorical replacing with the most common value is a good strategy
• We could use mode or median instead of mean
• Another good strategy is to infer the missing value from other attributes ie “Evidence
from multiple fields”.
• For example the Titanic data has name and for imputing the missing age field, we could use the
Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field
from the corresponding designation. So a row with missing age with Master. In name would get
the average age of all records with “Master.”
• There is also the filed for number of siblings and number of spouse. We could average the age
based on the value of that field.
• We could even average the ages from different strategies.

10. DataFrame : na10. DataFrame : na

11. Joins/Set Operations a.k.a.11. Joins/Set Operations a.k.a. Language Integrated Queries

12. SQL on Tables12. SQL on Tables

Hands-OnHands-On
o 003-DataFrame-For-DS
• Understand and run the iPython Notebook
o 004-Orders
• Homework – we will go thru the solution when we meet in the afternoon

Data Wrangling with SparkData Wrangling with Spark
2:002:00

Algorithm spectrumAlgorithm spectrum
o Regression
o Logit
o CART
o Ensemble
:Random
Forest
o Clustering
o KNN
o Genetic Alg
o Simulated
Annealing
o Collab
Filtering
o SVM
o Kernels
o SVD
o NNet
o Boltzman
Machine
o Feature
Learning
Machine
Learning Cute
Math
Artificial

Intelligence

Statistical ToolboxStatistical Toolbox
o Sample data : Car mileage data

Linear RegressionLinear Regression
2:302:30

Linear Regression - APILinear Regression - API
LabeledPoint The features and labels of a data point
LinearModel weights, intercept
LinearRegressionModel
Base predict()
LinearRegressionModel
LinearRegressionWithS
GD
train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0,
initialWeights=None, regParam=1.0, regType=None, intercept=False)
LassoModel Least-squares fit with an l_1 penalty term.
LassoWithSGD
train(cls, data, iterations=100, step=1.0, regParam=1.0,
miniBatchFraction=1.0,initialWeights=None)
RidgeRegressionModel Least-squares fit with an l_2 penalty term.
RidgeRegressionWithS
GD
train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,
initialWeights=None)

Basic Linear RegressionBasic Linear Regression

Use LR model for prediction & calculate MSEUse LR model for prediction & calculate MSE

Step size is important, the model can diverge !Step size is important, the model can diverge !

Interesting step sizeInteresting step size

“Mood Of the Union” with TF-
IDF
“Mood Of the Union” with TF-
IDF
2:452:45

Scenario – Mood Of the UnionScenario – Mood Of the Union
o It has been said that the State of the Union speech by the President of USA
reflects the social challenge faced by the country ?
o If so, can we infer the mood of the country by analyzing SOTU ?
o If we embark on this line of thought, how would we do it with Spark & python ?
o Is it different from Hadoop-MapReduce ?
o Is it better ?

POA (Plan Of Action)POA (Plan Of Action)
o Collect State of the Union speech by George Washington, Abe Lincoln, FDR,
JFK, Bill Clinton, GW Bush & Barack Obama
o Read the 7 SOTU from the 7 presidents into 7 RDDs
o Create word vectors
o Transform into word frequency vectors
o Remove stock common words
o Inspect to n words to see if they reflect the sentiment of the time
o Compute set difference and see how new words have cropped up
o Compute TF-IDF (homework!)

Lookout for these interesting Spark
features
features
o RDD Map-Reduce
o How to parse input
o Removing common words
o Sort rdd by value

Read & Create word vectorRead & Create word vector
iPython notebook at https://github.com/xsankar/cloaked-ironman

Remove Common Words – 1 of 3Remove Common Words – 1 of 3

FDR vs. Barack Obama as reflected by SOTUFDR vs. Barack Obama as reflected by SOTU

Barack Obama vs. Bill ClintonBarack Obama vs. Bill Clinton

GWB vs Abe Lincoln as reflected by SOTUGWB vs Abe Lincoln as reflected by SOTU

EpilogueEpilogue
o Interesting Exercise
o Highlights
• Map-reduce in a couple of lines !
• But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)
• Set differences using substractByKey
• Ability to sort a map by values (or any arbitrary function, for that matter)
o To Explore as homework:
• TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
http://blog.cloudera.com/blog/2014/09/how-‐to-‐translate-‐from-‐mapreduce-‐to-‐apache-‐spark/

Predicting Survivors with
Classification
Predicting Survivors with
Classification
3:303:30

Data Science “folk knowledge” (Wisdom of Kaggle)
Jeremy’s Axioms
Data Science “folk knowledge” (Wisdom of Kaggle)
Jeremy’s Axioms
o Iteratively explore data
o Tools
• Excel Format, Perl, Perl Book, Spark !
o Get your head around data
• Pivot Table
o Don’t over-complicate
o If people give you data, don’t assume that you
need to use all of it
o Look at pictures !
o History of your submissions – keep a tab
o Don’t be afraid to submit simple solutions
• We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-
jeremy-howard/

Titanic
Passenger
Metadata
• Small
• 3
Predictors
• Class
• Sex
• Age
• Survived?
Classification - ScenarioClassification - Scenario
o This is a knowledge exercise
o Classify survival from the titanic data
o Gives us a quick dataset to run & test classification

Classifying ClassifiersClassifying Classifiers
StatisticalStatistical StructuralStructural
RegressionRegression Naïve
Bayes
Naïve
Bayes
Bayesian
Networks
Bayesian
Networks
Rule-‐basedRule-‐based Distance-‐basedDistance-‐based
Neural
Networks
Neural
Networks
Production RulesProduction Rules Decision TreesDecision Trees
Multi-‐layer
Perception
Multi-‐layer
Perception
FunctionalFunctional Nearest NeighborNearest Neighbor
LinearLinear Spectral
Wavelet
Spectral
Wavelet
kNNkNN Learning vector
Quantization
Learning vector
Quantization
EnsembleEnsemble
Random ForestsRandom Forests
Logistic
Regression1
Logistic
Regression1
SVMSVMBoostingBoosting
1Max Entropy Classifier 1Max Entropy Classifier
Ref: Algorithms of the Intelligent Web, Marmanis & Babenko

Classifiers
Regression
Continuous
Variables
Categorical
Variables
Decision

Trees
k-‐NN(Nearest

Neighbors)
Bias
Variance
Model Complexity
Over-fitting
BoostingBagging
CART

Classification - Spark APIClassification - Spark API
o Logistic Regression
o SVMWIthSGD
o DecisionTrees
o Data as LabelledPoint (we will see in a moment)
o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini",
maxDepth=4, maxBins=100)
o Impurity – “entropy” or “gini”
o maxBins = control to throttle communication at the expense of accuracy
• Larger = Higher Accuracy
• Smaller = less communication (as # of bins = number of instances)
o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing
i.e. the places you slice for binning
o intelligent framework - need this for scale

features
features
o Concept of Labeled Point & how to create an RDD of LPs
o Print the tree
o Calculate Accuracy & MSE from RDDs

Read data & extract featuresRead data & extract features

Create the modelCreate the model

Extract labels & featuresExtract labels & features

Calculate Accuracy & MSECalculate Accuracy & MSE

Use NaiveBayes AlgorithmUse NaiveBayes Algorithm

Decision Tree – Best PracticesDecision Tree – Best Practices
maxDepth Tune
with
Data/Model
Selection
maxBins Set
low,
monitor communications,
increase
if
needed
#
RDD
partitions Set
to
#
of
cores
• Usually the recommendation is that the RDD partitions should be over
partitioned ie “more partitions than cores”, because tasks take different
times, we need to utilize the compute power and in the end they average out
• But for Machine Learning especially trees, all tasks are approx equal
computationally intensive, so over partitioning doesn’t help
• Joe Bradley talk (reference below) has interesting insights
https://speakerdeck.com/jkbradley/mllib-‐decision-‐trees-‐at-‐sf-‐scala-‐baml-‐meetup
DecisionTree.trainClassifier(data,
numClasses,
categoricalFeaturesInfo,
impurity="gini",
maxDepth=4,

maxBins=100)

Future …Future …
o Actually we should split the data to training & test sets
o Then use different feature sets to see if we can increase the accuracy
o Leave it as Homework
o In 1.2 …
o Random Forest
• Bagging
• PR for Random Forest
o Boosting
o Alpine lab sequoia Forest: coordinating merge
o Model Selection Pipeline ; Design Doc

◦ “Output
of
weak
classifiers
into
a
powerful
committee”
◦ Final
Prediction
=
weighted
majority
vote

◦ Later
classifiers
get
misclassified
points

– With
higher
weight,

– So
they
are
forced

– To
concentrate
on
them
◦ AdaBoost (AdaptiveBoosting)
◦ Boosting
vs Bagging
– Bagging
– independent
trees
<-‐ Spark
shines
here
– Boosting
– successively
weighted
BoostingBoosting
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)

◦ Builds
large
collection
of
de-‐correlated
trees
&
averages
them
◦ Improves
Bagging
by
selecting
i.i.d*
random
variables
for

splitting
◦ Simpler
to
train
&
tune
◦ “Do
remarkably
well,
with
very
little
tuning
required”
– ESLII
◦ Less
susceptible
to
over
fitting
(than
boosting)
◦ Many
RF
implementations
– Original
version
-‐ Fortran-‐77
!
By
Breiman/Cutler
– Python,
R,
Mahout,
Weka,
Milk
(ML
toolkit
for
py),
matlab
* i.i.d – independent identically distributed
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+Random Forests+
— Goal
◦ Variance (-)

◦ Two
Step
– Develop
a
set
of
learners
– Combine
the
results
to
develop
a
composite
predictor
◦ Ensemble
methods
can
take
the
form
of:
– Using
different
algorithms,

– Using
the
same
algorithm
with
different
settings
– Assigning
different
parts
of
the
dataset
to
different
classifiers
◦ Bagging
&
Random
Forests
are
examples
of
ensemble

method

Ref: Machine Learning In Action
Ensemble MethodsEnsemble Methods
— Goal
◦ Variance (-)

Random ForestsRandom Forests
o While Boosting splits based on best among all variables, RF splits based on best among
randomly chosen variables
o Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees
(500 for large dataset, 150 for smaller)
o Error prediction
• For each iteration, predict for dataset that is not in the sample
(OOB data)
• Aggregate OOB predictions
• Calculate Prediction Error for the aggregate, which is
basically the OOB estimate of error rate
• Can use this to search for optimal # of predictors
• We will see how close this is to the actual error in the
Heritage Health Prize
o Assumes equal cost for mis-prediction. Can add a cost function
o Proximity matrix & applications like adding missing data, dropping outliersRef: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg

Why didn’t RF do better ? Bias/VarianceWhy didn’t RF do better ? Bias/Variance
o High Bias
• Due to Underfitting
• Add more features
• More sophisticated model
• Quadratic Terms, complex equations,…
• Decrease regularization
o High Variance
• Due to Overfitting
• Use fewer features
• Use more training sample
• Increase Regularization
Prediction Error
Training
Error
Ref: Strata 2013 Tutorial by Olivier Grisel
Learning Curve
Need more features or more
complex model to improve
Need more data
to improve
'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
http://www.slideshare.net/ksankar/data-‐science-‐folk-‐knowledge

Data Science “folk knowledge” (3 of A)Data Science “folk knowledge” (3 of A)
o More Data Beats a Cleverer Algorithm
• Or conversely select algorithms that improve with data
• Don’t optimize prematurely without getting more data
o Learn many models, not Just One
• Ensembles ! – Change the hypothesis space
• Netflix prize
• E.g. Bagging, Boosting, Stacking
o Simplicity Does not necessarily imply Accuracy
o Representable Does not imply Learnable
• Just because a function can be represented does not
mean it can be learned
o Correlation Does not imply Causation o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o A fewuseful things to knowabout machine learning -by Pedro Domingos
§ http://dl.acm.org/citation.cfm?id=2347755

Scenario – Clustering with SparkScenario – Clustering with Spark
o InterGallactic Airlines have the GallacticHoppers frequent flyer program & have
data about their customers who participate in the program.
o The airlines execs have a feeling that other airlines will poach their customers if
they do not keep their loyal customers happy.
o So the business want to customize promotions to their frequent flier program.
o Can they just have one type of promotion ?
o Should they have different types of incentives ?
o Who exactly are the customers in their GallacticHoppers program ?
o Recently they have deployed an infrastructure with Spark
o Can Spark help in this business problem ?

Clustering - TheoryClustering - Theory
o Clustering is unsupervised learning
o While the computers can dissect a dataset into “similar” clusters, it still needs
human direction & domain knowledge to interpret & guide
o Two types:
• Centroid based clustering – k-means clustering
• Tree based Clustering – hierarchical clustering
o Spark implements the Scalable Kmeans++
• Paper : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

features
features
o Application of Statistics toolbox
o Center & Scale RDD
o Filter RDDs

Clustering - APIClustering - API
o from pyspark.mllib.clustering import KMeans
o Kmeans.train
o train(cls, data, k, maxIterations=100, runs=1, initializationMode="k-means||")
o K = number of clusters to create, default=2
o initializationMode = The initialization algorithm. This can be either "random" to
choose random points as initial cluster centers, or "k-means||" to use a parallel
variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default:
k-means||
o KMeansModel.predict
o Maps a point to a cluster

DataData

Read Data & Create RDDRead Data & Create RDD

Train & PredictTrain & Predict

Calculate errorCalculate error

But Data is not evenBut Data is not even

So let us center & scale the data and try againSo let us center & scale the data and try again

Looks GoodLooks Good
Let us try with 5 clustersLet us try with 5 clusters

Let us map the cluster to our dataLet us map the cluster to our data

InterpretationInterpretation
C# AVG Interpretation
1
2
3
4
5
Note
:

• This
is
just
a
sample
interpretation.
• In
real
life
we
would
“noodle”
over
the
clusters
&
tweak

them
to
be
useful,
interpretable
and
distinguishable.
• May
be
3
is
more
suited
to
create
targeted
promotions

EpilogueEpilogue
o KMeans in Spark has enough controls
o It does a decent job
o We were able to control the clusters based on our experience (2 cluster is too
low, 10 is too high, 5 seems to be right)
o We can see that the Scalable KMeans has control over runs, parallelism et al.
(Home work : explore the scalability)
o We were able to interpret the results with domain knowledge and arrive at a
scheme to solve the business opportunity
o Naturally we would tweak the clusters to fit the business viability. 20 clusters
with corresponding promotion schemes are unwieldy, even if the WSSE is the
minimum.

Recommendation EngineRecommendation Engine
5:055:05

Recommendation & Personalization - SparkRecommendation & Personalization - Spark
Automated Analytics-‐ Let Data tell story
Feature Learning, AI, Deep Learning
Learning Models -‐ fit parameters
as it gets more data
Dynamic Models – model
selection based on context
o Knowledge Based
o Demographic Based
o Content Based
o Collaborative Filtering
o Item Based
o User Based
o Latent Factor based
o User Rating
o Purchased
o Looked/Not purchased
Spark
implements
the
user
based
ALS
collaborative
filtering
Ref:

ALS
-‐ Collaborative
Filtering
for
Implicit
Feedback
Datasets,
Yifan Hu
;
AT&T
Labs.,
Florham
Park,
NJ
;
Koren,
Y.
;
Volinsky,
C.
ALS-‐WR
-‐ Large-‐Scale
Parallel
Collaborative
Filtering
for
the
Netflix
Prize,
Yunhong Zhou,
Dennis
Wilkinson,
Robert
Schreiber,
Rong Pan

Spark Collaborative Filtering APISpark Collaborative Filtering API
o ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1)
o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1,
alpha=0.01)
o MatrixFactorizationModel.predict(self, user, product)
o MatrixFactorizationModel.predictAll(self, usersProducts)

EpilogueEpilogue
o We explored interesting APIs in Spark
o ALS-Collab Filtering
o RDD Operations
• Join (HashJoin)
• In memory, Grace, Recursive hash join
http://technet.microsoft.com/en-‐us/library/ms189313(v=sql.105).aspx

Questions ?Questions ?
4:454:45

ReferenceReference
1. SF Scala & SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on
Spark http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-
decision-trees-on-spark
2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-
feature-scaling-needed-for-k-means-clustering
3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-
e-g-standardised-before-making-a-model-when-is
4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-
content*uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/
5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-
meetup
6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-
model.html
7. http://blogs.gartner.com/matthew-davis/

Essential Reading ListEssential Reading List
o A few useful things to know about machine learning - by Pedro Domingos
• http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper
t.pdf
o http://www.no-free-lunch.org/
o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y.
and Hochberg, Y. C
• http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD
R.pdf
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo
• https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance
• http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap
er_LeakingInDataMining.pdf

For your reading & viewing pleasure … An ordered ListFor your reading & viewing pleasure … An ordered List
① An Introduction to Statistical Learning
• http://www-bcf.usc.edu/~gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
• http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo
• https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng
• https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube
• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning
• http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-
machine-learning/

References:References:
o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas
• http://pyvideo.org/video/1655/an-introduction-to-scikit-
learn-machine-learning
o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel
• http://pyvideo.org/video/1719/advanced-machine-learning-
with-scikit-learn
o Just The Basics, Strata 2013, William Cukierski & Ben Hamner
• http://strataconf.com/strata2013/public/schedule/detail/27291
o The Problem of Multiple Testing
• http://download.journals.elsevierhealth.com/pdfs/journals/193
4-1482/PIIS1934148209014609.pdf
o Thanks to Ana Crisan for the Titanic inset. Picture courtesy http://emileeid.com/2012/02/11/titanic-3d-exclusive-posters/

The Beginning As The EndThe Beginning As The End
How did we do ?
4:454:45

Data Science with Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Data Science with Spark

Similar to Data Science with Spark (20)

More from Krishna Sankar

More from Krishna Sankar (20)

Recently uploaded

Recently uploaded (20)

Data Science with Spark