Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
In Apache Spark
Foundations of Data Science
with Spark
Foundations of Data Science
with Spark
July 16, 2015
@ksankar // do...
www.globalbigdataconference.com
Twitter : @bigdataconf
o Intro & Setup [8:00-8:20)
• Goals/non-goals
o Spark & Data Science DevOps [8:20-8:40)
• Spark in the context of Data
Sci...
o Review (2:00-2:30)
• 004-Orders-Homework-
Solution
• MLlib StatisticalToolbox
• Summary, Correlations
o [20] Linear Regr...
Goals & non-goalsGoals & non-goals
Goals
¤Understand how to program
Machine Learning with Spark &
Python
¤Focus on progr...
About MeAbout Me
o Data Scientist
• Decision Data Science & Product Data Science
• Insights = Intelligence + Inference + I...
Close EncountersClose Encounters
— 1st
◦ This Tutorial
— 2nd
◦ Do More Hands-on Walkthrough
— 3nd
◦ Listen To Lectures
...
Spark InstallationSpark Installation
o Install Spark 1.4.1 in local Machine
• https://spark.apache.org/downloads.html
• Pr...
Tutorial MaterialsTutorial Materials
o Github : https://github.com/xsankar/global-bd-conf
• Clone or download zip
o Open t...
Spark & Data Science DevOpsSpark & Data Science DevOps
8:208:20
Spark in the context of data
science
Spark in the context of data
science
Data Science :
The art of building a model with known knowns, which when let loose, works with unknown
unknowns!
Data Scie...
The curious case of the Data ScientistThe curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextua...
Data Science - ContextData Science - Context
o Scalable  Model  
Deployment
o Big  Data  
automation  &  
purpose  built  ...
VolumeVolume
VelocityVelocity
VarietyVariety
Data Science - ContextData Science - Context
ContextContext
Connect
edness
Co...
Day in the life of a (super) ModelDay in the life of a (super) Model
IntelligenceIntelligence
InferenceInference
Data Repr...
Data Science Maturity Model & SparkData Science Maturity Model & SparkIsolated Analytics Integrated Analytics Aggregated A...
The	
  Sense	
  &	
  Sensibility	
  of	
  a	
  
DataScientist DevOps
The	
  Sense	
  &	
  Sensibility	
  of	
  a	
  
DataS...
Where exactly is Apache
Spark headed ?
Where exactly is Apache
Spark headed ?
Spark Yesterday, Today & Tomorrow …
“Unified...
http://free-­‐stock-­‐illustration.com/winding+road+clip+art
Spark 1.x
• Fast engine for big data processing
• Fast to run...
Spark DirectionsSpark Directions
Data ScienceData Science Platform APIsPlatform APIs Streaming, DAG
Visualization &
Debugg...
Spark-The (simple) StackSpark-The (simple) Stack
RDD – The workhorse of Core SparkRDD – The workhorse of Core Spark
o Resilient Distributed Datasets
• Collection that can ...
DataFrame API
Spark Core
Spark SQL
Spark
Streaming
Spark
R
MLlib GraphX Packages
ML Pipelines Advanced Analytics
Neural
Ne...
SQL Query
DataFrame
Unresolved
Logical Plan
Logical
Plan
Optimized
Logical Plan
Physical
Plans
Physical
Plans
Physical
Pla...
Spark DataFrames for the
Data Scientist
Spark DataFrames for the
Data Scientist
“A towel is about the most massively usefu...
Data Science “folk knowledge” (1 of A)Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it wi...
pyspark
pyspark.SparkContext()
pyspark.SparkConf()
pyspark.RDD()
pyspark.Broadcast()
pyspark.Accululator()
pyspark.SparkFi...
1. SparkContext()1. SparkContext()
2. Read/Write2. Read/Write
3. Convert3. Convert
pyspark.sql.DataFrame
table
pandas.DataFrame
sqlContext.registerDataFrameAsTable(df,	
  "aTable")
df2...
4. Columns & Rows (1 of 3)4. Columns & Rows (1 of 3)
o Select a column
• by the df(“<columnName>”) notation or the df.<col...
4. Columns & Rows (2 of 3)4. Columns & Rows (2 of 3)
Run arbitrary udfs on a column
4. Columns & Rows (3 of 3)4. Columns & Rows (3 of 3)
Interesting Operations …
Adding a column…
5. DataFrame : RDD-like Operations5. DataFrame : RDD-like OperationsFunction Description	
  
df.sort(<sort	
  expression>)...
6. DataFrame : Action6. DataFrame : Action
o cache()
o collect(), collectAsList()
o count()
o describe(),
o first(), head(...
7. DataFrame : Scientific Functions7. DataFrame : Scientific Functions
8. DataFrame : Statistical Functions8. DataFrame : Statistical Functions
The pair-wise frequency (contingency table) of tr...
9. DataFrame : Aggregate Functions9. DataFrame : Aggregate Functions
o The pyspark.sql.functions class (and the org.apache...
10. DataFrame : na10. DataFrame : na
o One of the tenets of big data and data science is that data is never fully clean-wh...
10. DataFrame : na10. DataFrame : na
11. Joins/Set Operations a.k.a.11. Joins/Set Operations a.k.a. Language Integrated Queries
12. SQL on Tables12. SQL on Tables
Hands-OnHands-On
o 003-DataFrame-For-DS
• Understand and run the iPython Notebook
o 004-Orders
• Homework – we will go thr...
Data Wrangling with SparkData Wrangling with Spark
2:002:00
Algorithm spectrumAlgorithm spectrum
o Regression
o Logit
o CART
o Ensemble
:Random
Forest
o Clustering
o KNN
o Genetic Al...
Statistical ToolboxStatistical Toolbox
o Sample data : Car mileage data
Linear RegressionLinear Regression
2:302:30
Linear Regression - APILinear Regression - API
LabeledPoint The features and labels of a data point
LinearModel weights, i...
Basic Linear RegressionBasic Linear Regression
Use LR model for prediction & calculate MSEUse LR model for prediction & calculate MSE
Step size is important, the model can diverge !Step size is important, the model can diverge !
Interesting step sizeInteresting step size
“Mood Of the Union” with TF-
IDF
“Mood Of the Union” with TF-
IDF
2:452:45
Scenario – Mood Of the UnionScenario – Mood Of the Union
o It has been said that the State of the Union speech by the Pres...
POA (Plan Of Action)POA (Plan Of Action)
o Collect State of the Union speech by George Washington, Abe Lincoln, FDR,
JFK, ...
Lookout for these interesting Spark
features
Lookout for these interesting Spark
features
o RDD Map-Reduce
o How to parse ...
Read & Create word vectorRead & Create word vector
iPython notebook at https://github.com/xsankar/cloaked-ironman
Remove Common Words – 1 of 3Remove Common Words – 1 of 3
iPython notebook at https://github.com/xsankar/cloaked-ironman
Remove Common Words – 2 of 3Remove Common Words – 2 of 3
Remove Common Words – 3 of 3Remove Common Words – 3 of 3
FDR vs. Barack Obama as reflected by SOTUFDR vs. Barack Obama as reflected by SOTU
Barack Obama vs. Bill ClintonBarack Obama vs. Bill Clinton
GWB vs Abe Lincoln as reflected by SOTUGWB vs Abe Lincoln as reflected by SOTU
EpilogueEpilogue
o Interesting Exercise
o Highlights
• Map-reduce in a couple of lines !
• But it is not exactly the same ...
BreakBreak
3:153:15
Predicting Survivors with
Classification
Predicting Survivors with
Classification
3:303:30
Data Science “folk knowledge” (Wisdom of Kaggle)
Jeremy’s Axioms
Data Science “folk knowledge” (Wisdom of Kaggle)
Jeremy’s...
Titanic	
  Passenger	
  Metadata
• Small
• 3	
  Predictors
• Class
• Sex
• Age
• Survived?
Classification - ScenarioClassi...
Classifying ClassifiersClassifying Classifiers
StatisticalStatistical StructuralStructural
RegressionRegression Naïve  
Ba...
Classifiers
Regression
Continuous
Variables
Categorical
Variables
Decision	
  
Trees
k-­‐NN(Nearest	
  
Neighbors)
Bias
Va...
Classification - Spark APIClassification - Spark API
o Logistic Regression
o SVMWIthSGD
o DecisionTrees
o Data as Labelled...
Lookout for these interesting Spark
features
Lookout for these interesting Spark
features
o Concept of Labeled Point & how...
Read data & extract featuresRead data & extract features
iPython notebook at https://github.com/xsankar/cloaked-ironman
Create the modelCreate the model
Extract labels & featuresExtract labels & features
Calculate Accuracy & MSECalculate Accuracy & MSE
Use NaiveBayes AlgorithmUse NaiveBayes Algorithm
Decision Tree – Best PracticesDecision Tree – Best Practices
maxDepth Tune	
  with	
  Data/Model	
  Selection
maxBins Set	...
Future …Future …
o Actually we should split the data to training & test sets
o Then use different feature sets to see if w...
◦ “Output	
  of	
  weak	
  classifiers	
  into	
  a	
  powerful	
  committee”
◦ Final	
  Prediction	
  =	
  weighted	
  ma...
◦ Builds	
  large	
  collection	
  of	
  de-­‐correlated	
  trees	
  &	
  averages	
  them
◦ Improves	
  Bagging	
  by	
  ...
◦ Two	
  Step
– Develop	
  a	
  set	
  of	
  learners
– Combine	
  the	
  results	
  to	
  develop	
  a	
  composite	
  ...
Random ForestsRandom Forests
o While Boosting splits based on best among all variables, RF splits based on best among
rand...
Why didn’t RF do better ? Bias/VarianceWhy didn’t RF do better ? Bias/Variance
o High Bias
• Due to Underfitting
• Add mor...
BreakBreak
4:304:30
ClusteringClustering
4:454:45
Data Science “folk knowledge” (3 of A)Data Science “folk knowledge” (3 of A)
o More Data Beats a Cleverer Algorithm
• Or c...
Scenario – Clustering with SparkScenario – Clustering with Spark
o InterGallactic Airlines have the GallacticHoppers frequ...
Clustering - TheoryClustering - Theory
o Clustering is unsupervised learning
o While the computers can dissect a dataset i...
Lookout for these interesting Spark
features
Lookout for these interesting Spark
features
o Application of Statistics tool...
Clustering - APIClustering - API
o from pyspark.mllib.clustering import KMeans
o Kmeans.train
o train(cls, data, k, maxIte...
DataData
iPython notebook at https://github.com/xsankar/cloaked-ironman
Read Data & Create RDDRead Data & Create RDD
Train & PredictTrain & Predict
Calculate errorCalculate error
But Data is not evenBut Data is not even
So let us center & scale the data and try againSo let us center & scale the data and try again
Looks GoodLooks Good
Let us try with 5 clustersLet us try with 5 clusters
Let us map the cluster to our dataLet us map the cluster to our data
InterpretationInterpretation
C# AVG Interpretation
1
2
3
4
5
Note	
  :	
  
• This	
  is	
  just	
  a	
  sample	
  interpre...
EpilogueEpilogue
o KMeans in Spark has enough controls
o It does a decent job
o We were able to control the clusters based...
Recommendation EngineRecommendation Engine
5:055:05
Recommendation & Personalization - SparkRecommendation & Personalization - Spark
Automated  Analytics-­‐ Let  Data  tell  ...
Spark Collaborative Filtering APISpark Collaborative Filtering API
o ALS.train(cls, ratings, rank, iterations=5, lambda_=0...
Read & ParseRead & Parse
Split & TrainSplit & Train
EvaluateEvaluate
EpilogueEpilogue
o We explored interesting APIs in Spark
o ALS-Collab Filtering
o RDD Operations
• Join (HashJoin)
• In me...
Questions ?Questions ?
4:454:45
ReferenceReference
1. SF Scala & SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on
Spark http://functional.t...
Essential Reading ListEssential Reading List
o A few useful things to know about machine learning - by Pedro Domingos
• ht...
For your reading & viewing pleasure … An ordered ListFor your reading & viewing pleasure … An ordered List
① An Introducti...
References:References:
o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas
• http://pyvideo.org/video/1655/an-i...
The Beginning As The EndThe Beginning As The End
How did we do ?
4:454:45
Data Science with Spark
Data Science with Spark
Data Science with Spark
Data Science with Spark
Upcoming SlideShare
Loading in …5
×

Data Science with Spark

5,071 views

Published on

Slides for the Spark workshop - github https://github.com/xsankar/global-bd-conf

Published in: Technology

Data Science with Spark

  1. 1. In Apache Spark Foundations of Data Science with Spark Foundations of Data Science with Spark July 16, 2015 @ksankar // doubleclix.wordpress.com
  2. 2. www.globalbigdataconference.com Twitter : @bigdataconf
  3. 3. o Intro & Setup [8:00-8:20) • Goals/non-goals o Spark & Data Science DevOps [8:20-8:40) • Spark in the context of Data Science o Where Exactly is Apache Spark headed ? [8:40-9:30) • Spark Yesterday, Today & Tomorrow • Spark Stack o Break [9:30-10:00) o DataFrames for the Data Scientist [10:00-11:30) • pySpark Classes • Walkthru DataFrames • Hands-on Notebooks o [15] Discussions/Slack (11:30 - 11:45) Agenda : Introduction To SparkAgenda : Introduction To Sparkhttp://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html
  4. 4. o Review (2:00-2:30) • 004-Orders-Homework- Solution • MLlib StatisticalToolbox • Summary, Correlations o [20] Linear Regression (2:30-2:45) o [20] “Mood Of the Union” (2:45-3:15) • State of the Union w/ Washington, Lincoln, FDR, JFK, Clinton, Bush & Obama • Map reduce, parse text o Break (3:15-3:30) o [60] Predicting Survivors with Classification (3:30-4:30) • Decision Trees • NaiveBayes (Titanic data set) o Break (4:30-4:45) o [20] Clustering(4:45-5:05) • K-means for Gallactic Hoppers! o [20]Recommendation Engine (5:05-5:25) • Collab Filtering w/movie lens o [15] Discussions/Slack (5:45-6:00) Agenda : Data Wrangling w/ DataFrames & MLlibAgenda : Data Wrangling w/ DataFrames & MLlib http://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html
  5. 5. Goals & non-goalsGoals & non-goals Goals ¤Understand how to program Machine Learning with Spark & Python ¤Focus on programming & ML application ¤Give you a focused time to work thru examples § Work with me. I will wait if you want to catch-up ¤Less theory, more usage - let us see if this works ¤As straightforward as possible § The programs can be optimized Non-goals ¡Go deep into the algorithms • We don’t have sufficient time. The topic can be easily a 5 day tutorial ! ¡Dive into spark internals • That is for another day ¡The underlying computation, communication, constraints & distribution is a fascinating subject • Paco does a good job explaining them ¡A passive talk • Nope. Interactive & hands-on
  6. 6. About MeAbout Me o Data Scientist • Decision Data Science & Product Data Science • Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L] • Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3] o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] … o Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC] o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT] o Reviewer : “Machine Learning with Spark” Packt Publishing o Have done lots of things: • Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA • Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI, • Guest Lecturer at Naval PG School,… o Volunteer as Robotics Judge at First Lego league World Competitions o @ksankar, doubleclix.wordpress.com
  7. 7. Close EncountersClose Encounters — 1st ◦ This Tutorial — 2nd ◦ Do More Hands-on Walkthrough — 3nd ◦ Listen To Lectures ◦ More competitions …
  8. 8. Spark InstallationSpark Installation o Install Spark 1.4.1 in local Machine • https://spark.apache.org/downloads.html • Pre-built For Hadoop 2.6 is fine • Download & uncompress • Remember the path & use it wherever you see /usr/local/spark/ • I have downloaded in /usr/local & have a softlink spark to the latest version o Install iPython
  9. 9. Tutorial MaterialsTutorial Materials o Github : https://github.com/xsankar/global-bd-conf • Clone or download zip o Open terminal o cd ~/global-bd-conf o IPYTHON=1 IPYTHON_OPTS="notebook” /usr/local/spark/bin/pyspark --packages com.databricks:spark-csv_2.11:1.0.3 o Notes : • I have a soft link “spark” in my /usr/local that points to the spark version that I use. For example ln -s spark-1.4.1/ spark o Click on ipython dashboard o Run 000-PreFlightCheck.ipynb o Run 001-TestSparkCSV.ipynb o Now you are ready for the workshop !
  10. 10. Spark & Data Science DevOpsSpark & Data Science DevOps 8:208:20
  11. 11. Spark in the context of data science Spark in the context of data science
  12. 12. Data Science : The art of building a model with known knowns, which when let loose, works with unknown unknowns! Data Science : The art of building a model with known knowns, which when let loose, works with unknown unknowns! Donald Rumsfeld is an armchair Data Scientist ! http://smartorg.com/2013/07/valuepoint19/ The World Knowns Unknowns You UnKnown Known o Others  know,  you  don’t o What  we  do o Facts,  outcomes  or   scenarios  we  have  not   encountered,  nor   considered o “Black  swans”,  outliers,   long  tails  of  probability   distributions o Lack  of  experience,   imagination o Potential  facts,   outcomes  we   are  aware,  but   not    with   certainty o Stochastic   processes,   Probabilities o Known Knowns o There are things we know that we know o Known Unknowns o That is to say, there are things that we now know we don't know o But there are also Unknown Unknowns o There are things we do not know we don't know
  13. 13. The curious case of the Data ScientistThe curious case of the Data Scientist o Data Scientist is multi-faceted & Contextual o Data Scientist should be building Data Products o Data Scientist should tell a story http://doubleclix.wordpress.com/2014/01/25/the-­‐curious-­‐case-­‐of-­‐the-­‐data-­‐scientist-­‐profession/ Large is hard; Infinite is much easier ! – Titus Brown
  14. 14. Data Science - ContextData Science - Context o Scalable  Model   Deployment o Big  Data   automation  &   purpose  built   appliances   (soft/hard) o Manage  SLAs  &   response  times o Scalable  Model   Deployment o Big  Data   automation  &   purpose  built   appliances   (soft/hard) o Manage  SLAs  &   response  times o Volume o Velocity o Streaming  Data o Volume o Velocity o Streaming  Data o Canonical   form o Data  catalog o Data  Fabric  across  the   organization o Access  to  multiple   sources  of  data   o Think  Hybrid  – Big  Data   Apps,  Appliances  &   Infrastructure o Canonical   form o Data  catalog o Data  Fabric  across  the   organization o Access  to  multiple   sources  of  data   o Think  Hybrid  – Big  Data   Apps,  Appliances  &   Infrastructure CollectCollect StoreStore TransformTransform o Metadata o Monitor  counters  &   Metrics o Structured  vs.  Multi-­‐ structured o Metadata o Monitor  counters  &   Metrics o Structured  vs.  Multi-­‐ structured o Flexible  &  Selectable § Data  Subsets   § Attribute  sets o Flexible  &  Selectable § Data  Subsets   § Attribute  sets o Refine  model  with § Extended  Data   subsets § Engineered   Attribute  sets o Validation  run  across  a   larger  data  set o Refine  model  with § Extended  Data   subsets § Engineered   Attribute  sets o Validation  run  across  a   larger  data  set ReasonReason ModelModel DeployDeploy Data ManagementData Management Data ScienceData Science o Dynamic  Data  Sets o 2  way  key-­‐value  tagging  of   datasets o Extended  attribute  sets o Advanced  Analytics o Dynamic  Data  Sets o 2  way  key-­‐value  tagging  of   datasets o Extended  attribute  sets o Advanced  Analytics ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict o Performance o Scalability o Refresh  Latency o In-­‐memory  Analytics o Performance o Scalability o Refresh  Latency o In-­‐memory  Analytics o Advanced  Visualization o Interactive  Dashboards o Map  Overlay o Infographics o Advanced  Visualization o Interactive  Dashboards o Map  Overlay o Infographics ¤ Bytes to Business a.k.a. Build the full stack ¤ Find Relevant Data For Business ¤ Connect the Dots
  15. 15. VolumeVolume VelocityVelocity VarietyVariety Data Science - ContextData Science - Context ContextContext Connect edness Connect edness IntelligenceIntelligence InterfaceInterface InferenceInference “Data of unusual size” that can't be brute forced o Three Amigos o Interface = Cognition o Intelligence = Compute(CPU) & Computational(GPU) o Infer Significance & Causality
  16. 16. Day in the life of a (super) ModelDay in the life of a (super) Model IntelligenceIntelligence InferenceInference Data RepresentationData Representation InterfaceInterface AlgorithmsAlgorithms ParametersParametersAttributesAttributes Data  (Scoring)Data  (Scoring) Model  SelectionModel  Selection Reason  &   Learn Reason  &   Learn ModelsModels Visualize,   Recommend,   Explore Visualize,   Recommend,   Explore Model  AssessmentModel  Assessment Feature  SelectionFeature  SelectionDimensionality  ReductionDimensionality  Reduction
  17. 17. Data Science Maturity Model & SparkData Science Maturity Model & SparkIsolated Analytics Integrated Analytics Aggregated Analytics Automated Analytics Data Small  Data Larger  Data  set Big  Data Big  Data  Factory  Model Context Local Domain Cross-­‐domain +  External Cross  domain  +  External Model, Reason & Deploy • Single  set  of  boxes,  usually   owned  by  the  Model  Builders • Departmental • Deploy  -­‐ Central  Analytics   Infrastructure • Models  still  owned  &   operated  by  Modelers • Partly Enterprise-­‐wide • Central  Analytics  Infrastructure • Model  &  Reason  – by  Model  Builders • Deploy,  Operate  – by  ops • Residuals and  other  metrics  monitored   by  modelers • Enterprise-­‐wide • Distributed  Analytics  Infrastructure • AI  Augmented  models • Model  &  Reason  – by  Model   Builders • Deploy,  Operate  – by  ops • Data  as  a  monetized  service,   extending  to  eco  system  partners • Reports • Dashboards • Dashboards  +  some  APIs • Dashboards  +  Well  defined  APIs  +   programming  models Type • Descriptive  &  Reactive • +  Predictive • +  Adaptive • Adaptive Datasets • All  in  the  same  box • Fixed  data  sets,  usually  in   temp  data  spaces • Flexible  Data  &  Attribute  sets • Dynamic  datasets  with  well-­‐defined   refresh  policies   Workload • Skunk works • Business  relevant  apps  with   approx SLAs • High  performance  appliance  clusters • Appliances  and  clusters  for  multiple   workloads  including  real  time  apps • Infrastructure  for  emerging   technologies Strategy • Informal  definitions • Data  definitions  buried  in  the   analytics  models • Some  data  definitions • Data  catalogue,  metadata  &   Annotations • Big  Data  MDM  Strategy
  18. 18. The  Sense  &  Sensibility  of  a   DataScientist DevOps The  Sense  &  Sensibility  of  a   DataScientist DevOps Factory  =  Operational Lab  =  Investigative http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐ sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/
  19. 19. Where exactly is Apache Spark headed ? Where exactly is Apache Spark headed ? Spark Yesterday, Today & Tomorrow … “Unified engine acrossdiverse data sources, workloads & environments” 8:408:40
  20. 20. http://free-­‐stock-­‐illustration.com/winding+road+clip+art Spark 1.x • Fast engine for big data processing • Fast to run code & fast to write code • In-memory computation graphs with compatibility with the Hadoop eco system and an interesting very usable APIs • Iterative & interactive apps that operated on data multiple times, which are not a good use case for Hadoop. Spark 1.3 & beyond has been the catalyst for a renaissance in Data Science ! Spark 1.4+ • Multi-pass analytics - ML pipelines, GraphX • Ad-hoc queries - DataFrames • Real-time stream processing – Spark Streaming • Parallel Machine Learning Algorithms beyond the basic RDDs • More types of data sources as input & output • More integration with R to span statistical computing beyond “single-node tools” • More integration with apps like visualization dashboards • More performance with even larger datasets & complex applications – Project Tungsten Spark Yesterday, Today & Tomorrow …
  21. 21. Spark DirectionsSpark Directions Data ScienceData Science Platform APIsPlatform APIs Streaming, DAG Visualization & Debugging Streaming, DAG Visualization & Debugging Execution Optimization (Project Tungsten) Execution Optimization (Project Tungsten)o DataFrames o ML  Pipelines o SparkR o Growing  the  eco  system § Data  Sources  -­‐Uniform  access  to   diverse  data  sources § Pluggable  “smart”  DataSource API  for  reading/writing   DataFrame while  minimizing  I/O § Spark  Packages § Deployment  utilities  for  Google   Compute,  Azure  &  Job  Server o Focus  on  CPU  Efficiency o Run-­‐Time  Code  Generation o Cache  Locality  &  cache  aware   data  structures o Binary  Format  for  aggregations o Spark  managed  Memory o Off-­‐heap  memory   management o Spark  Streaming   flow  control  &   optimized  state   management
  22. 22. Spark-The (simple) StackSpark-The (simple) Stack
  23. 23. RDD – The workhorse of Core SparkRDD – The workhorse of Core Spark o Resilient Distributed Datasets • Collection that can be operated in parallel o Transformations – create RDDs • Map, Filter,… o Actions – Get values • Collect, Take,… o We will apply these operations during this tutorial
  24. 24. DataFrame API Spark Core Spark SQL Spark Streaming Spark R MLlib GraphX Packages ML Pipelines Advanced Analytics Neural Networks Deep Learning Parameter Server R Scala JavaPython Catalyst Optimizer – optimize execution plan Data Sources - Parquet, Hadoop, Cassandra, JSON, CSV, JDBC,… Tungsten Execution RDD
  25. 25. SQL Query DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans Physical Plans Physical Plans CostModel Selected Physical Plan RDDs Catalog Analysis Logical   Optimization Physical  Planning Code  Generation Query Optimization-Execution pipeline Ref:  Spark  SQL  paper
  26. 26. Spark DataFrames for the Data Scientist Spark DataFrames for the Data Scientist “A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.” - From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. DataFrames ! The Most Massively useful thing a Data Scientist can have … 10:0010:00
  27. 27. Data Science “folk knowledge” (1 of A)Data Science “folk knowledge” (1 of A) o "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer Mediated Transactions o Learning = Representation + Evaluation + Optimization o It’s Generalization that counts • The fundamental goal of machine learning is to generalize beyond the examples in the training set o Data alone is not enough • Induction not deduction - Every learner should embody some knowledge or assumptions beyond the data it is given in order to generalize beyond it o Machine Learning is not magic – one cannot get something from nothing • In order to infer, one needs the knobs & the dials • One also needs a rich expressive dataset A few useful things to know about machine learning- by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
  28. 28. pyspark pyspark.SparkContext() pyspark.SparkConf() pyspark.RDD() pyspark.Broadcast() pyspark.Accululator() pyspark.SparkFiles() pyspark.StorageLevel() pyspark.sqlpyspark.streaming pyspark.mllib pyspark.ml pyspark.streaming.StreamingContext() pyspark.streaming.Dstream() pyspark.streaming.kafka pyspark.streaming.kafka.Broker() … pyspark.sql.SQLContext() pyspark.sql.DataFrame() pyspark.sql.DataFrameNaFunctions() pyspark.sql.DataFrameStatFunctions() pyspark.sql.DataFrameReader() pyspark.sql.DataFrameWriter() pyspark.sql.Column() pyspark.sql.Row() pyspark.sql.functions() pyspark.sql.types() pyspark.sql.Window() pyspark.sql.WindowSpec() pyspark.sql.GroupedData() pyspark.sql.HiveContext() pyspark.mllib.classification pyspark.mllib.clustering pyspark.mllib.evaluation pyspark.mllib.feature pyspark.mllib.fpm pyspark.mllib.linalg pyspark.mllib.random pyspark.mllib.recommendation pyspark.mllib.regression pyspark.mllib.stat pyspark.mllib.tree pyspark.mllib.util ML  Pipeline  APIs pyspark.ml.Transformer pyspark.ml.Estimator pyspark.ml.Model pyspark.ml.Pipeline pyspark.ml.PipelineModel pyspark.ml.param pyspark.ml.feature pyspark.ml.classification pyspark.ml.recommendation pyspark.ml.regression pyspark.ml.tuning pyspark.ml.evaluation
  29. 29. 1. SparkContext()1. SparkContext()
  30. 30. 2. Read/Write2. Read/Write
  31. 31. 3. Convert3. Convert pyspark.sql.DataFrame table pandas.DataFrame sqlContext.registerDataFrameAsTable(df,  "aTable") df2  =  sqlContext.table("aTable") df =  createDataFrame(pandas.DataFrame) p_df =  df.toPandas()
  32. 32. 4. Columns & Rows (1 of 3)4. Columns & Rows (1 of 3) o Select a column • by the df(“<columnName>”) notation or the df.<columnName> notation. • The recommended way is the df(“<columnName>”), reason being a column name can collide with a dataframe method if we use the df.<columnName> o Column-wise operations line +,-, *,/,% (modulo),&&,||, <,<=,> and >= • df(“total”) = df(“price”) * df(“qty”) • inequality operator is !==, the usual equalTo operator is === and an equality test that is safe for null values <=> o Meta operations – type conversion (cast), alias, not null, … • df_cars.mpg.cast("double").alias('mpg') o Run arbitrary udfs on a column (see next page)
  33. 33. 4. Columns & Rows (2 of 3)4. Columns & Rows (2 of 3) Run arbitrary udfs on a column
  34. 34. 4. Columns & Rows (3 of 3)4. Columns & Rows (3 of 3) Interesting Operations … Adding a column…
  35. 35. 5. DataFrame : RDD-like Operations5. DataFrame : RDD-like OperationsFunction Description   df.sort(<sort  expression>)  or  df.orderBy(<sort  expression>) Returns  a  sorted  DataFrame.  There  are  multiple  ways  of  specifying  the  sort  expression.  Use  of  the  orderBy  is   recommended  (as  the  syntax  is  closer  to  SQL)  for  example: df_orders_1.groupBy("CustomerID","Year").count()   .orderBy(‘count’,ascending=False).show() df.filter(<condition>)  or  df.where(<condition>) Returns  a  new  DataFrame  after  applying  the  <condition>.  The  condition  is  usually  based  on  a  column. Use  of  the  where  form  is  recommended  (as  the  syntax  is  closer  to  the  SQL  world),  for  example   df_orders.where(df_orders.[‘shipCountry’]  ==  ’France’) df.colasce(n) Returns  a  DataFrame  with  n  partitions,  same  as  colasce(n)  method  of  RDD df.foreach(<function>) Applies  a  function  on  all  the  rows  of  a  DataFrame df.map(lambda   r:..) Applies  the  function  on  all  the  rows  and  returns  the  resulting  object df.flatMap(lambda   r:  …) Returns  an  RDD,  flattened,  after  applying  the  function  on  all  the  rows  of  the  DataFrame   df.rdd() Returns  the  DataFrame  as  an  rdd  of  Row  objects df.na.replace([<list  of  values  to  be  replaced],[list  of  replacing   values],subset=[list  of  columns])  or    DataFrame.replace()   or   DataFrameNaFunctions.replace() An  interesting  function,  very  useful  and  a  little  strange  syntax-­‐‑wise.  The  recommended  form  is  the   df.na.replace()   even  though  the  .na namespace  throws  it  a  little  bit.  Use  the  subset=  for  column  names.  The   syntax  is  different  from  the  Scala  syntax.
  36. 36. 6. DataFrame : Action6. DataFrame : Action o cache() o collect(), collectAsList() o count() o describe(), o first(), head(), show(), take() o …
  37. 37. 7. DataFrame : Scientific Functions7. DataFrame : Scientific Functions
  38. 38. 8. DataFrame : Statistical Functions8. DataFrame : Statistical Functions The pair-wise frequency (contingency table) of transmission type and no of speeds show interesting observation. • All automatic cars in the dataset are 3 speed while most of the manual transmission cars have 4 or 5 speeds • Almost all the manual cars have 2 barrels while the automatic cars have 2 and 4 barrels
  39. 39. 9. DataFrame : Aggregate Functions9. DataFrame : Aggregate Functions o The pyspark.sql.functions class (and the org.apache.spark.sql.functions for Scala) contains the aggregation functions o There are two types of aggregations, one on column values and the other on subsets of column values i.e. grouped values of some other columns • pyspark.sql.functions.avg(“sales”) • pyspark.sql.functions.groupby(“year”).agg(“sales”:”avg”) o Count(), countDistinct() o First(),last()
  40. 40. 10. DataFrame : na10. DataFrame : na o One of the tenets of big data and data science is that data is never fully clean-while we can handle types, formats et al, missing values is always challenging o One easy solution is to drop the rows that have missing values, but then we would lose valuable data in the columns that do have values. o A better solution is to impute data based on some criteria. It is true that data cannot be created out of thin air, but data can be inferred with some success – it is better than dropping the rows. • We can replace null with 0 • A better solution is to replace numerical values with the average of the rest of the valid values; for categorical replacing with the most common value is a good strategy • We could use mode or median instead of mean • Another good strategy is to infer the missing value from other attributes ie “Evidence from multiple fields”. • For example the Titanic data has name and for imputing the missing age field, we could use the Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field from the corresponding designation. So a row with missing age with Master. In name would get the average age of all records with “Master.” • There is also the filed for number of siblings and number of spouse. We could average the age based on the value of that field. • We could even average the ages from different strategies.
  41. 41. 10. DataFrame : na10. DataFrame : na
  42. 42. 11. Joins/Set Operations a.k.a.11. Joins/Set Operations a.k.a. Language Integrated Queries
  43. 43. 12. SQL on Tables12. SQL on Tables
  44. 44. Hands-OnHands-On o 003-DataFrame-For-DS • Understand and run the iPython Notebook o 004-Orders • Homework – we will go thru the solution when we meet in the afternoon
  45. 45. Data Wrangling with SparkData Wrangling with Spark 2:002:00
  46. 46. Algorithm spectrumAlgorithm spectrum o Regression o Logit o CART o Ensemble :Random Forest o Clustering o KNN o Genetic Alg o Simulated Annealing o Collab Filtering o SVM o Kernels o SVD o NNet o Boltzman Machine o Feature Learning Machine  Learning Cute  Math Artificial   Intelligence
  47. 47. Statistical ToolboxStatistical Toolbox o Sample data : Car mileage data
  48. 48. Linear RegressionLinear Regression 2:302:30
  49. 49. Linear Regression - APILinear Regression - API LabeledPoint The features and labels of a data point LinearModel weights, intercept LinearRegressionModel Base predict() LinearRegressionModel LinearRegressionWithS GD train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=1.0, regType=None, intercept=False) LassoModel Least-squares fit with an l_1 penalty term. LassoWithSGD train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,initialWeights=None) RidgeRegressionModel Least-squares fit with an l_2 penalty term. RidgeRegressionWithS GD train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0, initialWeights=None)
  50. 50. Basic Linear RegressionBasic Linear Regression
  51. 51. Use LR model for prediction & calculate MSEUse LR model for prediction & calculate MSE
  52. 52. Step size is important, the model can diverge !Step size is important, the model can diverge !
  53. 53. Interesting step sizeInteresting step size
  54. 54. “Mood Of the Union” with TF- IDF “Mood Of the Union” with TF- IDF 2:452:45
  55. 55. Scenario – Mood Of the UnionScenario – Mood Of the Union o It has been said that the State of the Union speech by the President of USA reflects the social challenge faced by the country ? o If so, can we infer the mood of the country by analyzing SOTU ? o If we embark on this line of thought, how would we do it with Spark & python ? o Is it different from Hadoop-MapReduce ? o Is it better ?
  56. 56. POA (Plan Of Action)POA (Plan Of Action) o Collect State of the Union speech by George Washington, Abe Lincoln, FDR, JFK, Bill Clinton, GW Bush & Barack Obama o Read the 7 SOTU from the 7 presidents into 7 RDDs o Create word vectors o Transform into word frequency vectors o Remove stock common words o Inspect to n words to see if they reflect the sentiment of the time o Compute set difference and see how new words have cropped up o Compute TF-IDF (homework!)
  57. 57. Lookout for these interesting Spark features Lookout for these interesting Spark features o RDD Map-Reduce o How to parse input o Removing common words o Sort rdd by value
  58. 58. Read & Create word vectorRead & Create word vector iPython notebook at https://github.com/xsankar/cloaked-ironman
  59. 59. Remove Common Words – 1 of 3Remove Common Words – 1 of 3 iPython notebook at https://github.com/xsankar/cloaked-ironman
  60. 60. Remove Common Words – 2 of 3Remove Common Words – 2 of 3
  61. 61. Remove Common Words – 3 of 3Remove Common Words – 3 of 3
  62. 62. FDR vs. Barack Obama as reflected by SOTUFDR vs. Barack Obama as reflected by SOTU
  63. 63. Barack Obama vs. Bill ClintonBarack Obama vs. Bill Clinton
  64. 64. GWB vs Abe Lincoln as reflected by SOTUGWB vs Abe Lincoln as reflected by SOTU
  65. 65. EpilogueEpilogue o Interesting Exercise o Highlights • Map-reduce in a couple of lines ! • But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1) • Set differences using substractByKey • Ability to sort a map by values (or any arbitrary function, for that matter) o To Explore as homework: • TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf http://blog.cloudera.com/blog/2014/09/how-­‐to-­‐translate-­‐from-­‐mapreduce-­‐to-­‐apache-­‐spark/
  66. 66. BreakBreak 3:153:15
  67. 67. Predicting Survivors with Classification Predicting Survivors with Classification 3:303:30
  68. 68. Data Science “folk knowledge” (Wisdom of Kaggle) Jeremy’s Axioms Data Science “folk knowledge” (Wisdom of Kaggle) Jeremy’s Axioms o Iteratively explore data o Tools • Excel Format, Perl, Perl Book, Spark ! o Get your head around data • Pivot Table o Don’t over-complicate o If people give you data, don’t assume that you need to use all of it o Look at pictures ! o History of your submissions – keep a tab o Don’t be afraid to submit simple solutions • We will do this during this workshop Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by- jeremy-howard/
  69. 69. Titanic  Passenger  Metadata • Small • 3  Predictors • Class • Sex • Age • Survived? Classification - ScenarioClassification - Scenario o This is a knowledge exercise o Classify survival from the titanic data o Gives us a quick dataset to run & test classification iPython notebook at https://github.com/xsankar/cloaked-ironman
  70. 70. Classifying ClassifiersClassifying Classifiers StatisticalStatistical StructuralStructural RegressionRegression Naïve   Bayes Naïve   Bayes Bayesian   Networks Bayesian   Networks Rule-­‐basedRule-­‐based Distance-­‐basedDistance-­‐based Neural   Networks Neural   Networks Production   RulesProduction   Rules Decision  TreesDecision  Trees Multi-­‐layer   Perception Multi-­‐layer   Perception FunctionalFunctional Nearest  NeighborNearest  Neighbor LinearLinear Spectral   Wavelet Spectral   Wavelet kNNkNN Learning  vector   Quantization Learning  vector   Quantization EnsembleEnsemble Random  ForestsRandom  Forests Logistic   Regression1 Logistic   Regression1 SVMSVMBoostingBoosting 1Max  Entropy  Classifier  1Max  Entropy  Classifier   Ref: Algorithms of the Intelligent Web, Marmanis & Babenko
  71. 71. Classifiers Regression Continuous Variables Categorical Variables Decision   Trees k-­‐NN(Nearest   Neighbors) Bias Variance Model Complexity Over-fitting BoostingBagging CART
  72. 72. Classification - Spark APIClassification - Spark API o Logistic Regression o SVMWIthSGD o DecisionTrees o Data as LabelledPoint (we will see in a moment) o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini", maxDepth=4, maxBins=100) o Impurity – “entropy” or “gini” o maxBins = control to throttle communication at the expense of accuracy • Larger = Higher Accuracy • Smaller = less communication (as # of bins = number of instances) o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing i.e. the places you slice for binning o intelligent framework - need this for scale
  73. 73. Lookout for these interesting Spark features Lookout for these interesting Spark features o Concept of Labeled Point & how to create an RDD of LPs o Print the tree o Calculate Accuracy & MSE from RDDs
  74. 74. Read data & extract featuresRead data & extract features iPython notebook at https://github.com/xsankar/cloaked-ironman
  75. 75. Create the modelCreate the model
  76. 76. Extract labels & featuresExtract labels & features
  77. 77. Calculate Accuracy & MSECalculate Accuracy & MSE
  78. 78. Use NaiveBayes AlgorithmUse NaiveBayes Algorithm
  79. 79. Decision Tree – Best PracticesDecision Tree – Best Practices maxDepth Tune  with  Data/Model  Selection maxBins Set  low,  monitor communications,   increase  if  needed #  RDD  partitions Set  to  #  of  cores • Usually the recommendation is that the RDD partitions should be over partitioned ie “more partitions than cores”, because tasks take different times, we need to utilize the compute power and in the end they average out • But for Machine Learning especially trees, all tasks are approx equal computationally intensive, so over partitioning doesn’t help • Joe Bradley talk (reference below) has interesting insights https://speakerdeck.com/jkbradley/mllib-­‐decision-­‐trees-­‐at-­‐sf-­‐scala-­‐baml-­‐meetup DecisionTree.trainClassifier(data,  numClasses,  categoricalFeaturesInfo,  impurity="gini",   maxDepth=4,   maxBins=100)
  80. 80. Future …Future … o Actually we should split the data to training & test sets o Then use different feature sets to see if we can increase the accuracy o Leave it as Homework o In 1.2 … o Random Forest • Bagging • PR for Random Forest o Boosting o Alpine lab sequoia Forest: coordinating merge o Model Selection Pipeline ; Design Doc
  81. 81. ◦ “Output  of  weak  classifiers  into  a  powerful  committee” ◦ Final  Prediction  =  weighted  majority  vote   ◦ Later  classifiers  get  misclassified  points   – With  higher  weight,   – So  they  are  forced   – To  concentrate  on  them ◦ AdaBoost (AdaptiveBoosting) ◦ Boosting  vs Bagging – Bagging  – independent  trees  <-­‐ Spark  shines  here – Boosting  – successively  weighted BoostingBoosting — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
  82. 82. ◦ Builds  large  collection  of  de-­‐correlated  trees  &  averages  them ◦ Improves  Bagging  by  selecting  i.i.d*  random  variables  for   splitting ◦ Simpler  to  train  &  tune ◦ “Do  remarkably  well,  with  very  little  tuning  required”  – ESLII ◦ Less  susceptible  to  over  fitting  (than  boosting) ◦ Many  RF  implementations – Original  version  -­‐ Fortran-­‐77  !  By  Breiman/Cutler – Python,  R,  Mahout,  Weka,  Milk  (ML  toolkit  for  py),  matlab * i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm Random Forests+Random Forests+ — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
  83. 83. ◦ Two  Step – Develop  a  set  of  learners – Combine  the  results  to  develop  a  composite  predictor ◦ Ensemble  methods  can  take  the  form  of: – Using  different  algorithms,   – Using  the  same  algorithm  with  different  settings – Assigning  different  parts  of  the  dataset  to  different  classifiers ◦ Bagging  &  Random  Forests  are  examples  of  ensemble   method   Ref: Machine Learning In Action Ensemble MethodsEnsemble Methods — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
  84. 84. Random ForestsRandom Forests o While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables o Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller) o Error prediction • For each iteration, predict for dataset that is not in the sample (OOB data) • Aggregate OOB predictions • Calculate Prediction Error for the aggregate, which is basically the OOB estimate of error rate • Can use this to search for optimal # of predictors • We will see how close this is to the actual error in the Heritage Health Prize o Assumes equal cost for mis-prediction. Can add a cost function o Proximity matrix & applications like adding missing data, dropping outliersRef: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk A Brief Overview of RF by Dan Steinberg
  85. 85. Why didn’t RF do better ? Bias/VarianceWhy didn’t RF do better ? Bias/Variance o High Bias • Due to Underfitting • Add more features • More sophisticated model • Quadratic Terms, complex equations,… • Decrease regularization o High Variance • Due to Overfitting • Use fewer features • Use more training sample • Increase Regularization Prediction Error Training Error Ref: Strata 2013 Tutorial by Olivier Grisel Learning Curve Need  more  features  or  more   complex  model  to  improve Need  more  data   to  improve 'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos http://www.slideshare.net/ksankar/data-­‐science-­‐folk-­‐knowledge
  86. 86. BreakBreak 4:304:30
  87. 87. ClusteringClustering 4:454:45
  88. 88. Data Science “folk knowledge” (3 of A)Data Science “folk knowledge” (3 of A) o More Data Beats a Cleverer Algorithm • Or conversely select algorithms that improve with data • Don’t optimize prematurely without getting more data o Learn many models, not Just One • Ensembles ! – Change the hypothesis space • Netflix prize • E.g. Bagging, Boosting, Stacking o Simplicity Does not necessarily imply Accuracy o Representable Does not imply Learnable • Just because a function can be represented does not mean it can be learned o Correlation Does not imply Causation o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o A fewuseful things to knowabout machine learning -by Pedro Domingos § http://dl.acm.org/citation.cfm?id=2347755
  89. 89. Scenario – Clustering with SparkScenario – Clustering with Spark o InterGallactic Airlines have the GallacticHoppers frequent flyer program & have data about their customers who participate in the program. o The airlines execs have a feeling that other airlines will poach their customers if they do not keep their loyal customers happy. o So the business want to customize promotions to their frequent flier program. o Can they just have one type of promotion ? o Should they have different types of incentives ? o Who exactly are the customers in their GallacticHoppers program ? o Recently they have deployed an infrastructure with Spark o Can Spark help in this business problem ?
  90. 90. Clustering - TheoryClustering - Theory o Clustering is unsupervised learning o While the computers can dissect a dataset into “similar” clusters, it still needs human direction & domain knowledge to interpret & guide o Two types: • Centroid based clustering – k-means clustering • Tree based Clustering – hierarchical clustering o Spark implements the Scalable Kmeans++ • Paper : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
  91. 91. Lookout for these interesting Spark features Lookout for these interesting Spark features o Application of Statistics toolbox o Center & Scale RDD o Filter RDDs
  92. 92. Clustering - APIClustering - API o from pyspark.mllib.clustering import KMeans o Kmeans.train o train(cls, data, k, maxIterations=100, runs=1, initializationMode="k-means||") o K = number of clusters to create, default=2 o initializationMode = The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means|| o KMeansModel.predict o Maps a point to a cluster
  93. 93. DataData iPython notebook at https://github.com/xsankar/cloaked-ironman
  94. 94. Read Data & Create RDDRead Data & Create RDD
  95. 95. Train & PredictTrain & Predict
  96. 96. Calculate errorCalculate error
  97. 97. But Data is not evenBut Data is not even
  98. 98. So let us center & scale the data and try againSo let us center & scale the data and try again
  99. 99. Looks GoodLooks Good Let us try with 5 clustersLet us try with 5 clusters
  100. 100. Let us map the cluster to our dataLet us map the cluster to our data
  101. 101. InterpretationInterpretation C# AVG Interpretation 1 2 3 4 5 Note  :   • This  is  just  a  sample  interpretation. • In  real  life  we  would  “noodle”  over  the  clusters  &  tweak   them  to  be  useful,  interpretable  and  distinguishable. • May  be  3  is  more  suited  to  create  targeted  promotions
  102. 102. EpilogueEpilogue o KMeans in Spark has enough controls o It does a decent job o We were able to control the clusters based on our experience (2 cluster is too low, 10 is too high, 5 seems to be right) o We can see that the Scalable KMeans has control over runs, parallelism et al. (Home work : explore the scalability) o We were able to interpret the results with domain knowledge and arrive at a scheme to solve the business opportunity o Naturally we would tweak the clusters to fit the business viability. 20 clusters with corresponding promotion schemes are unwieldy, even if the WSSE is the minimum.
  103. 103. Recommendation EngineRecommendation Engine 5:055:05
  104. 104. Recommendation & Personalization - SparkRecommendation & Personalization - Spark Automated  Analytics-­‐ Let  Data  tell  story Feature  Learning,  AI,  Deep  Learning Learning  Models  -­‐ fit  parameters   as  it  gets  more  data   Dynamic  Models  – model   selection  based  on  context o Knowledge  Based o Demographic  Based o Content  Based o Collaborative  Filtering o Item  Based o User  Based o Latent  Factor  based o User  Rating o Purchased o Looked/Not  purchased Spark  implements  the  user  based  ALS  collaborative  filtering Ref:   ALS  -­‐ Collaborative  Filtering  for  Implicit  Feedback  Datasets,  Yifan Hu  ;  AT&T  Labs.,  Florham  Park,  NJ  ;  Koren,  Y.  ;  Volinsky,   C. ALS-­‐WR  -­‐ Large-­‐Scale  Parallel  Collaborative  Filtering  for  the  Netflix  Prize,  Yunhong Zhou,  Dennis  Wilkinson,   Robert  Schreiber,  Rong Pan
  105. 105. Spark Collaborative Filtering APISpark Collaborative Filtering API o ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1) o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, alpha=0.01) o MatrixFactorizationModel.predict(self, user, product) o MatrixFactorizationModel.predictAll(self, usersProducts)
  106. 106. Read & ParseRead & Parse
  107. 107. Split & TrainSplit & Train
  108. 108. EvaluateEvaluate
  109. 109. EpilogueEpilogue o We explored interesting APIs in Spark o ALS-Collab Filtering o RDD Operations • Join (HashJoin) • In memory, Grace, Recursive hash join http://technet.microsoft.com/en-­‐us/library/ms189313(v=sql.105).aspx
  110. 110. Questions ?Questions ? 4:454:45
  111. 111. ReferenceReference 1. SF Scala & SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on Spark http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley- decision-trees-on-spark 2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and- feature-scaling-needed-for-k-means-clustering 3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted- e-g-standardised-before-making-a-model-when-is 4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp- content*uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/ 5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml- meetup 6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity- model.html 7. http://blogs.gartner.com/matthew-davis/
  112. 112. Essential Reading ListEssential Reading List o A few useful things to know about machine learning - by Pedro Domingos • http://dl.acm.org/citation.cfm?id=2347755 o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert • http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper t.pdf o http://www.no-free-lunch.org/ o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, Y. C • http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD R.pdf o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe • http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o Avoid these three mistakes, James Faghmo • https://medium.com/about-data/73258b3848a4 o Leakage in Data Mining: Formulation, Detection, and Avoidance • http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap er_LeakingInDataMining.pdf
  113. 113. For your reading & viewing pleasure … An ordered ListFor your reading & viewing pleasure … An ordered List ① An Introduction to Statistical Learning • http://www-bcf.usc.edu/~gareth/ISL/ ② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning • http://online.stanford.edu/course/statistical-learning-winter-2014 ③ Prof. Pedro Domingo • https://class.coursera.org/machlearning-001/lecture/preview ④ Prof. Andrew Ng • https://class.coursera.org/ml-003/lecture/preview ⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data • https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120 ⑥ Mathematicalmonk @ YouTube • https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA ⑦ The Elements Of Statistical Learning • http://statweb.stanford.edu/~tibs/ElemStatLearn/ http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn- machine-learning/
  114. 114. References:References: o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas • http://pyvideo.org/video/1655/an-introduction-to-scikit- learn-machine-learning o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel • http://pyvideo.org/video/1719/advanced-machine-learning- with-scikit-learn o Just The Basics, Strata 2013, William Cukierski & Ben Hamner • http://strataconf.com/strata2013/public/schedule/detail/27291 o The Problem of Multiple Testing • http://download.journals.elsevierhealth.com/pdfs/journals/193 4-1482/PIIS1934148209014609.pdf o Thanks to Ana Crisan for the Titanic inset. Picture courtesy http://emileeid.com/2012/02/11/titanic-3d-exclusive-posters/
  115. 115. The Beginning As The EndThe Beginning As The End How did we do ? 4:454:45

×