SlideShare a Scribd company logo
1 of 119
Download to read offline
In Apache Spark
Foundations of Data Science
with Spark
Foundations of Data Science
with Spark
July 16, 2015
@ksankar // doubleclix.wordpress.com
www.globalbigdataconference.com
Twitter : @bigdataconf
o Intro & Setup [8:00-8:20)
• Goals/non-goals
o Spark & Data Science DevOps [8:20-8:40)
• Spark in the context of Data
Science
o Where Exactly is Apache Spark headed ?
[8:40-9:30)
• Spark Yesterday, Today &
Tomorrow
• Spark Stack
o Break [9:30-10:00)
o DataFrames for the Data Scientist
[10:00-11:30)
• pySpark Classes
• Walkthru DataFrames
• Hands-on Notebooks
o [15] Discussions/Slack (11:30 - 11:45)
Agenda : Introduction To SparkAgenda : Introduction To Sparkhttp://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html
o Review (2:00-2:30)
• 004-Orders-Homework-
Solution
• MLlib StatisticalToolbox
• Summary, Correlations
o [20] Linear Regression (2:30-2:45)
o [20] “Mood Of the Union” (2:45-3:15)
• State of the Union w/
Washington, Lincoln, FDR, JFK,
Clinton, Bush & Obama
• Map reduce, parse text
o Break (3:15-3:30)
o [60] Predicting Survivors with
Classification (3:30-4:30)
• Decision Trees
• NaiveBayes (Titanic data set)
o Break (4:30-4:45)
o [20] Clustering(4:45-5:05)
• K-means for Gallactic Hoppers!
o [20]Recommendation Engine (5:05-5:25)
• Collab Filtering w/movie lens
o [15] Discussions/Slack (5:45-6:00)
Agenda : Data Wrangling w/ DataFrames & MLlibAgenda : Data Wrangling w/ DataFrames & MLlib
http://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html
Goals & non-goalsGoals & non-goals
Goals
¤Understand how to program
Machine Learning with Spark &
Python
¤Focus on programming & ML
application
¤Give you a focused time to work
thru examples
§ Work with me. I will wait
if you want to catch-up
¤Less theory, more usage - let us
see if this works
¤As straightforward as possible
§ The programs can be
optimized
Non-goals
¡Go deep into the algorithms
• We don’t have sufficient
time. The topic can be
easily a 5 day tutorial !
¡Dive into spark internals
• That is for another day
¡The underlying computation,
communication, constraints &
distribution is a fascinating subject
• Paco does a good job
explaining them
¡A passive talk
• Nope. Interactive &
hands-on
About MeAbout Me
o Data Scientist
• Decision Data Science & Product Data Science
• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]
• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]
o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513,
http://www.slideshare.net/ksankar/pydata-19] …
o Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC]
o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT]
o Reviewer : “Machine Learning with Spark” Packt Publishing
o Have done lots of things:
• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA
• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,
• Guest Lecturer at Naval PG School,…
o Volunteer as Robotics Judge at First Lego league World Competitions
o @ksankar, doubleclix.wordpress.com
Close EncountersClose Encounters
— 1st
◦ This Tutorial
— 2nd
◦ Do More Hands-on Walkthrough
— 3nd
◦ Listen To Lectures
◦ More competitions …
Spark InstallationSpark Installation
o Install Spark 1.4.1 in local Machine
• https://spark.apache.org/downloads.html
• Pre-built For Hadoop 2.6 is fine
• Download & uncompress
• Remember the path & use it wherever you see /usr/local/spark/
• I have downloaded in /usr/local & have a softlink spark to the latest version
o Install iPython
Tutorial MaterialsTutorial Materials
o Github : https://github.com/xsankar/global-bd-conf
• Clone or download zip
o Open terminal
o cd ~/global-bd-conf
o IPYTHON=1 IPYTHON_OPTS="notebook” /usr/local/spark/bin/pyspark --packages com.databricks:spark-csv_2.11:1.0.3
o Notes :
• I have a soft link “spark” in my /usr/local that points to the spark version
that I use. For example ln -s spark-1.4.1/ spark
o Click on ipython dashboard
o Run 000-PreFlightCheck.ipynb
o Run 001-TestSparkCSV.ipynb
o Now you are ready for the workshop !
Spark & Data Science DevOpsSpark & Data Science DevOps
8:208:20
Spark in the context of data
science
Spark in the context of data
science
Data Science :
The art of building a model with known knowns, which when let loose, works with unknown
unknowns!
Data Science :
The art of building a model with known knowns, which when let loose, works with unknown
unknowns!
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World
Knowns
Unknowns
You
UnKnown Known
o Others	
  know,	
  you	
  don’t o What	
  we	
  do
o Facts,	
  outcomes	
  or	
  
scenarios	
  we	
  have	
  not	
  
encountered,	
  nor	
  
considered
o “Black	
  swans”,	
  outliers,	
  
long	
  tails	
  of	
  probability	
  
distributions
o Lack	
  of	
  experience,	
  
imagination
o Potential	
  facts,	
  
outcomes	
  we	
  
are	
  aware,	
  but	
  
not	
  	
  with	
  
certainty
o Stochastic	
  
processes,	
  
Probabilities
o Known Knowns
o There are things we know that we know
o Known Unknowns
o That is to say, there are things that we
now know we don't know
o But there are also Unknown Unknowns
o There are things we do not know we
don't know
The curious case of the Data ScientistThe curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data Products
o Data Scientist should tell a story
http://doubleclix.wordpress.com/2014/01/25/the-­‐curious-­‐case-­‐of-­‐the-­‐data-­‐scientist-­‐profession/
Large is hard; Infinite is much easier !
– Titus Brown
Data Science - ContextData Science - Context
o Scalable  Model  
Deployment
o Big  Data  
automation  &  
purpose  built  
appliances  
(soft/hard)
o Manage  SLAs  &  
response  times
o Scalable  Model  
Deployment
o Big  Data  
automation  &  
purpose  built  
appliances  
(soft/hard)
o Manage  SLAs  &  
response  times
o Volume
o Velocity
o Streaming  Data
o Volume
o Velocity
o Streaming  Data
o Canonical   form
o Data  catalog
o Data  Fabric  across  the  
organization
o Access  to  multiple  
sources  of  data  
o Think  Hybrid  – Big  Data  
Apps,  Appliances  &  
Infrastructure
o Canonical   form
o Data  catalog
o Data  Fabric  across  the  
organization
o Access  to  multiple  
sources  of  data  
o Think  Hybrid  – Big  Data  
Apps,  Appliances  &  
Infrastructure
CollectCollect StoreStore TransformTransform
o Metadata
o Monitor  counters  &  
Metrics
o Structured  vs.  Multi-­‐
structured
o Metadata
o Monitor  counters  &  
Metrics
o Structured  vs.  Multi-­‐
structured
o Flexible  &  Selectable
§ Data  Subsets  
§ Attribute  sets
o Flexible  &  Selectable
§ Data  Subsets  
§ Attribute  sets
o Refine  model  with
§ Extended  Data  
subsets
§ Engineered  
Attribute  sets
o Validation  run  across  a  
larger  data  set
o Refine  model  with
§ Extended  Data  
subsets
§ Engineered  
Attribute  sets
o Validation  run  across  a  
larger  data  set
ReasonReason ModelModel DeployDeploy
Data ManagementData Management
Data ScienceData Science
o Dynamic  Data  Sets
o 2  way  key-­‐value  tagging  of  
datasets
o Extended  attribute  sets
o Advanced  Analytics
o Dynamic  Data  Sets
o 2  way  key-­‐value  tagging  of  
datasets
o Extended  attribute  sets
o Advanced  Analytics
ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict
o Performance
o Scalability
o Refresh  Latency
o In-­‐memory  Analytics
o Performance
o Scalability
o Refresh  Latency
o In-­‐memory  Analytics
o Advanced  Visualization
o Interactive  Dashboards
o Map  Overlay
o Infographics
o Advanced  Visualization
o Interactive  Dashboards
o Map  Overlay
o Infographics
¤ Bytes to Business
a.k.a. Build the full
stack
¤ Find Relevant Data
For Business
¤ Connect the Dots
VolumeVolume
VelocityVelocity
VarietyVariety
Data Science - ContextData Science - Context
ContextContext
Connect
edness
Connect
edness
IntelligenceIntelligence
InterfaceInterface
InferenceInference
“Data of unusual size”
that can't be brute forced
o Three Amigos
o Interface = Cognition
o Intelligence = Compute(CPU) & Computational(GPU)
o Infer Significance & Causality
Day in the life of a (super) ModelDay in the life of a (super) Model
IntelligenceIntelligence
InferenceInference
Data RepresentationData Representation
InterfaceInterface
AlgorithmsAlgorithms
ParametersParametersAttributesAttributes
Data	
  (Scoring)Data	
  (Scoring)
Model	
  SelectionModel	
  Selection
Reason	
  &	
  
Learn
Reason	
  &	
  
Learn
ModelsModels
Visualize,	
  
Recommend,	
  
Explore
Visualize,	
  
Recommend,	
  
Explore
Model	
  AssessmentModel	
  Assessment
Feature	
  SelectionFeature	
  SelectionDimensionality	
  ReductionDimensionality	
  Reduction
Data Science Maturity Model & SparkData Science Maturity Model & SparkIsolated Analytics Integrated Analytics Aggregated Analytics Automated Analytics
Data Small	
  Data Larger	
  Data	
  set Big	
  Data Big	
  Data	
  Factory	
  Model
Context Local Domain Cross-­‐domain +	
  External Cross	
  domain	
  +	
  External
Model,
Reason &
Deploy
• Single	
  set	
  of	
  boxes,	
  usually	
  
owned	
  by	
  the	
  Model	
  Builders
• Departmental
• Deploy	
  -­‐ Central	
  Analytics	
  
Infrastructure
• Models	
  still	
  owned	
  &	
  
operated	
  by	
  Modelers
• Partly Enterprise-­‐wide
• Central	
  Analytics	
  Infrastructure
• Model	
  &	
  Reason	
  – by	
  Model	
  Builders
• Deploy,	
  Operate	
  – by	
  ops
• Residuals and	
  other	
  metrics	
  monitored	
  
by	
  modelers
• Enterprise-­‐wide
• Distributed	
  Analytics	
  Infrastructure
• AI	
  Augmented	
  models
• Model	
  &	
  Reason	
  – by	
  Model	
  
Builders
• Deploy,	
  Operate	
  – by	
  ops
• Data	
  as	
  a	
  monetized	
  service,	
  
extending	
  to	
  eco	
  system	
  partners
• Reports • Dashboards • Dashboards	
  +	
  some	
  APIs • Dashboards	
  +	
  Well	
  defined	
  APIs	
  +	
  
programming	
  models
Type • Descriptive	
  &	
  Reactive • +	
  Predictive • +	
  Adaptive • Adaptive
Datasets • All	
  in	
  the	
  same	
  box • Fixed	
  data	
  sets,	
  usually	
  in	
  
temp	
  data	
  spaces
• Flexible	
  Data	
  &	
  Attribute	
  sets • Dynamic	
  datasets	
  with	
  well-­‐defined	
  
refresh	
  policies	
  
Workload • Skunk works • Business	
  relevant	
  apps	
  with	
  
approx SLAs
• High	
  performance	
  appliance	
  clusters • Appliances	
  and	
  clusters	
  for	
  multiple	
  
workloads	
  including	
  real	
  time	
  apps
• Infrastructure	
  for	
  emerging	
  
technologies
Strategy • Informal	
  definitions • Data	
  definitions	
  buried	
  in	
  the	
  
analytics	
  models
• Some	
  data	
  definitions • Data	
  catalogue,	
  metadata	
  &	
  
Annotations
• Big	
  Data	
  MDM	
  Strategy
The	
  Sense	
  &	
  Sensibility	
  of	
  a	
  
DataScientist DevOps
The	
  Sense	
  &	
  Sensibility	
  of	
  a	
  
DataScientist DevOps
Factory	
  =	
  Operational
Lab	
  =	
  Investigative
http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐
sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/
Where exactly is Apache
Spark headed ?
Where exactly is Apache
Spark headed ?
Spark Yesterday, Today & Tomorrow …
“Unified engine acrossdiverse data sources, workloads & environments”
8:408:40
http://free-­‐stock-­‐illustration.com/winding+road+clip+art
Spark 1.x
• Fast engine for big data processing
• Fast to run code & fast to write code
• In-memory computation graphs with
compatibility with the Hadoop eco
system and an interesting very usable
APIs
• Iterative & interactive apps that
operated on data multiple times, which
are not a good use case for Hadoop.
Spark 1.3 & beyond
has been the catalyst
for a renaissance in
Data Science !
Spark 1.4+
• Multi-pass analytics - ML pipelines, GraphX
• Ad-hoc queries - DataFrames
• Real-time stream processing – Spark Streaming
• Parallel Machine Learning Algorithms beyond the
basic RDDs
• More types of data sources as input & output
• More integration with R to span statistical computing
beyond “single-node tools”
• More integration with apps like visualization
dashboards
• More performance with even larger datasets &
complex applications – Project Tungsten
Spark Yesterday, Today &
Tomorrow …
Spark DirectionsSpark Directions
Data ScienceData Science Platform APIsPlatform APIs Streaming, DAG
Visualization &
Debugging
Streaming, DAG
Visualization &
Debugging
Execution Optimization
(Project Tungsten)
Execution Optimization
(Project Tungsten)o DataFrames
o ML  Pipelines
o SparkR
o Growing  the  eco  system
§ Data  Sources  -­‐Uniform  access  to  
diverse  data  sources
§ Pluggable  “smart”  DataSource
API  for  reading/writing  
DataFrame while  minimizing  I/O
§ Spark  Packages
§ Deployment  utilities  for  Google  
Compute,  Azure  &  Job  Server
o Focus  on  CPU  Efficiency
o Run-­‐Time  Code  Generation
o Cache  Locality  &  cache  aware  
data  structures
o Binary  Format  for  aggregations
o Spark  managed  Memory
o Off-­‐heap  memory  
management
o Spark  Streaming  
flow  control  &  
optimized  state  
management
Spark-The (simple) StackSpark-The (simple) Stack
RDD – The workhorse of Core SparkRDD – The workhorse of Core Spark
o Resilient Distributed Datasets
• Collection that can be operated in parallel
o Transformations – create RDDs
• Map, Filter,…
o Actions – Get values
• Collect, Take,…
o We will apply these operations during this tutorial
DataFrame API
Spark Core
Spark SQL
Spark
Streaming
Spark
R
MLlib GraphX Packages
ML Pipelines Advanced Analytics
Neural
Networks
Deep
Learning
Parameter
Server
R Scala JavaPython
Catalyst Optimizer – optimize execution plan
Data Sources - Parquet, Hadoop, Cassandra, JSON, CSV, JDBC,…
Tungsten Execution
RDD
SQL Query
DataFrame
Unresolved
Logical Plan
Logical
Plan
Optimized
Logical Plan
Physical
Plans
Physical
Plans
Physical
Plans
CostModel
Selected
Physical Plan
RDDs
Catalog
Analysis
Logical  
Optimization Physical  Planning Code  Generation
Query Optimization-Execution pipeline
Ref:  Spark  SQL  paper
Spark DataFrames for the
Data Scientist
Spark DataFrames for the
Data Scientist
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any
man who can hitch the length and breadth of the Galaxy, rough it … win through, and still
know where his towel is, is clearly a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
DataFrames ! The Most Massively useful thing a Data Scientist can have …
10:0010:00
Data Science “folk knowledge” (1 of A)Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it will confess to anything." – Hal Varian,
Computer Mediated Transactions
o Learning = Representation + Evaluation + Optimization
o It’s Generalization that counts
• The fundamental goal of machine learning is to generalize beyond the
examples in the training set
o Data alone is not enough
• Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond it
o Machine Learning is not magic – one cannot get something from nothing
• In order to infer, one needs the knobs & the dials
• One also needs a rich expressive dataset
A few useful things to know about machine learning- by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
pyspark
pyspark.SparkContext()
pyspark.SparkConf()
pyspark.RDD()
pyspark.Broadcast()
pyspark.Accululator()
pyspark.SparkFiles()
pyspark.StorageLevel()
pyspark.sqlpyspark.streaming pyspark.mllib pyspark.ml
pyspark.streaming.StreamingContext()
pyspark.streaming.Dstream()
pyspark.streaming.kafka
pyspark.streaming.kafka.Broker()
…
pyspark.sql.SQLContext()
pyspark.sql.DataFrame()
pyspark.sql.DataFrameNaFunctions()
pyspark.sql.DataFrameStatFunctions()
pyspark.sql.DataFrameReader()
pyspark.sql.DataFrameWriter()
pyspark.sql.Column()
pyspark.sql.Row()
pyspark.sql.functions()
pyspark.sql.types()
pyspark.sql.Window()
pyspark.sql.WindowSpec()
pyspark.sql.GroupedData()
pyspark.sql.HiveContext()
pyspark.mllib.classification
pyspark.mllib.clustering
pyspark.mllib.evaluation
pyspark.mllib.feature
pyspark.mllib.fpm
pyspark.mllib.linalg
pyspark.mllib.random
pyspark.mllib.recommendation
pyspark.mllib.regression
pyspark.mllib.stat
pyspark.mllib.tree
pyspark.mllib.util
ML	
  Pipeline	
  APIs
pyspark.ml.Transformer
pyspark.ml.Estimator
pyspark.ml.Model
pyspark.ml.Pipeline
pyspark.ml.PipelineModel
pyspark.ml.param
pyspark.ml.feature
pyspark.ml.classification
pyspark.ml.recommendation
pyspark.ml.regression
pyspark.ml.tuning
pyspark.ml.evaluation
1. SparkContext()1. SparkContext()
2. Read/Write2. Read/Write
3. Convert3. Convert
pyspark.sql.DataFrame
table
pandas.DataFrame
sqlContext.registerDataFrameAsTable(df,	
  "aTable")
df2	
  =	
  sqlContext.table("aTable")
df =	
  createDataFrame(pandas.DataFrame)
p_df =	
  df.toPandas()
4. Columns & Rows (1 of 3)4. Columns & Rows (1 of 3)
o Select a column
• by the df(“<columnName>”) notation or the df.<columnName> notation.
• The recommended way is the df(“<columnName>”), reason being a column
name can collide with a dataframe method if we use the df.<columnName>
o Column-wise operations line +,-, *,/,% (modulo),&&,||, <,<=,> and >=
• df(“total”) = df(“price”) * df(“qty”)
• inequality operator is !==, the usual equalTo operator is === and an
equality test that is safe for null values <=>
o Meta operations – type conversion (cast), alias, not null, …
• df_cars.mpg.cast("double").alias('mpg')
o Run arbitrary udfs on a column (see next page)
4. Columns & Rows (2 of 3)4. Columns & Rows (2 of 3)
Run arbitrary udfs on a column
4. Columns & Rows (3 of 3)4. Columns & Rows (3 of 3)
Interesting Operations …
Adding a column…
5. DataFrame : RDD-like Operations5. DataFrame : RDD-like OperationsFunction Description	
  
df.sort(<sort	
  expression>)	
  or	
  df.orderBy(<sort	
  expression>) Returns	
  a	
  sorted	
  DataFrame.	
  There	
  are	
  multiple	
  ways	
  of	
  specifying	
  the	
  sort	
  expression.	
  Use	
  of	
  the	
  orderBy	
  is	
  
recommended	
  (as	
  the	
  syntax	
  is	
  closer	
  to	
  SQL)	
  for	
  example:
df_orders_1.groupBy("CustomerID","Year").count()	
   .orderBy(‘count’,ascending=False).show()
df.filter(<condition>)	
  or	
  df.where(<condition>) Returns	
  a	
  new	
  DataFrame	
  after	
  applying	
  the	
  <condition>.	
  The	
  condition	
  is	
  usually	
  based	
  on	
  a	
  column.
Use	
  of	
  the	
  where	
  form	
  is	
  recommended	
  (as	
  the	
  syntax	
  is	
  closer	
  to	
  the	
  SQL	
  world),	
  for	
  example	
  
df_orders.where(df_orders.[‘shipCountry’]	
  ==	
  ’France’)
df.colasce(n) Returns	
  a	
  DataFrame	
  with	
  n	
  partitions,	
  same	
  as	
  colasce(n)	
  method	
  of	
  RDD
df.foreach(<function>) Applies	
  a	
  function	
  on	
  all	
  the	
  rows	
  of	
  a	
  DataFrame
df.map(lambda	
   r:..) Applies	
  the	
  function	
  on	
  all	
  the	
  rows	
  and	
  returns	
  the	
  resulting	
  object
df.flatMap(lambda	
   r:	
  …) Returns	
  an	
  RDD,	
  flattened,	
  after	
  applying	
  the	
  function	
  on	
  all	
  the	
  rows	
  of	
  the	
  DataFrame	
  
df.rdd() Returns	
  the	
  DataFrame	
  as	
  an	
  rdd	
  of	
  Row	
  objects
df.na.replace([<list	
  of	
  values	
  to	
  be	
  replaced],[list	
  of	
  replacing	
  
values],subset=[list	
  of	
  columns])	
  or	
  	
  DataFrame.replace()	
   or	
  
DataFrameNaFunctions.replace()
An	
  interesting	
  function,	
  very	
  useful	
  and	
  a	
  little	
  strange	
  syntax-­‐‑wise.	
  The	
  recommended	
  form	
  is	
  the	
  
df.na.replace()	
   even	
  though	
  the	
  .na namespace	
  throws	
  it	
  a	
  little	
  bit.	
  Use	
  the	
  subset=	
  for	
  column	
  names.	
  The	
  
syntax	
  is	
  different	
  from	
  the	
  Scala	
  syntax.
6. DataFrame : Action6. DataFrame : Action
o cache()
o collect(), collectAsList()
o count()
o describe(),
o first(), head(), show(), take()
o …
7. DataFrame : Scientific Functions7. DataFrame : Scientific Functions
8. DataFrame : Statistical Functions8. DataFrame : Statistical Functions
The pair-wise frequency (contingency table) of transmission type and no of speeds show interesting observation.
• All automatic cars in the dataset are 3 speed while most of the manual transmission cars have 4 or 5 speeds
• Almost all the manual cars have 2 barrels while the automatic cars have 2 and 4 barrels
9. DataFrame : Aggregate Functions9. DataFrame : Aggregate Functions
o The pyspark.sql.functions class (and the org.apache.spark.sql.functions for Scala)
contains the aggregation functions
o There are two types of aggregations, one on column values and the other on
subsets of column values i.e. grouped values of some other columns
• pyspark.sql.functions.avg(“sales”)
• pyspark.sql.functions.groupby(“year”).agg(“sales”:”avg”)
o Count(), countDistinct()
o First(),last()
10. DataFrame : na10. DataFrame : na
o One of the tenets of big data and data science is that data is never fully clean-while we
can handle types, formats et al, missing values is always challenging
o One easy solution is to drop the rows that have missing values, but then we would lose
valuable data in the columns that do have values.
o A better solution is to impute data based on some criteria. It is true that data cannot be
created out of thin air, but data can be inferred with some success – it is better than
dropping the rows.
• We can replace null with 0
• A better solution is to replace numerical values with the average of the rest of the valid
values; for categorical replacing with the most common value is a good strategy
• We could use mode or median instead of mean
• Another good strategy is to infer the missing value from other attributes ie “Evidence
from multiple fields”.
• For example the Titanic data has name and for imputing the missing age field, we could use the
Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field
from the corresponding designation. So a row with missing age with Master. In name would get
the average age of all records with “Master.”
• There is also the filed for number of siblings and number of spouse. We could average the age
based on the value of that field.
• We could even average the ages from different strategies.
10. DataFrame : na10. DataFrame : na
11. Joins/Set Operations a.k.a.11. Joins/Set Operations a.k.a. Language Integrated Queries
12. SQL on Tables12. SQL on Tables
Hands-OnHands-On
o 003-DataFrame-For-DS
• Understand and run the iPython Notebook
o 004-Orders
• Homework – we will go thru the solution when we meet in the afternoon
Data Wrangling with SparkData Wrangling with Spark
2:002:00
Algorithm spectrumAlgorithm spectrum
o Regression
o Logit
o CART
o Ensemble
:Random
Forest
o Clustering
o KNN
o Genetic Alg
o Simulated
Annealing
o Collab
Filtering
o SVM
o Kernels
o SVD
o NNet
o Boltzman
Machine
o Feature
Learning
Machine	
  Learning Cute	
  Math
Artificial	
  
Intelligence
Statistical ToolboxStatistical Toolbox
o Sample data : Car mileage data
Linear RegressionLinear Regression
2:302:30
Linear Regression - APILinear Regression - API
LabeledPoint The features and labels of a data point
LinearModel weights, intercept
LinearRegressionModel
Base predict()
LinearRegressionModel
LinearRegressionWithS
GD
train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0,
initialWeights=None, regParam=1.0, regType=None, intercept=False)
LassoModel Least-squares fit with an l_1 penalty term.
LassoWithSGD
train(cls, data, iterations=100, step=1.0, regParam=1.0,
miniBatchFraction=1.0,initialWeights=None)
RidgeRegressionModel Least-squares fit with an l_2 penalty term.
RidgeRegressionWithS
GD
train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,
initialWeights=None)
Basic Linear RegressionBasic Linear Regression
Use LR model for prediction & calculate MSEUse LR model for prediction & calculate MSE
Step size is important, the model can diverge !Step size is important, the model can diverge !
Interesting step sizeInteresting step size
“Mood Of the Union” with TF-
IDF
“Mood Of the Union” with TF-
IDF
2:452:45
Scenario – Mood Of the UnionScenario – Mood Of the Union
o It has been said that the State of the Union speech by the President of USA
reflects the social challenge faced by the country ?
o If so, can we infer the mood of the country by analyzing SOTU ?
o If we embark on this line of thought, how would we do it with Spark & python ?
o Is it different from Hadoop-MapReduce ?
o Is it better ?
POA (Plan Of Action)POA (Plan Of Action)
o Collect State of the Union speech by George Washington, Abe Lincoln, FDR,
JFK, Bill Clinton, GW Bush & Barack Obama
o Read the 7 SOTU from the 7 presidents into 7 RDDs
o Create word vectors
o Transform into word frequency vectors
o Remove stock common words
o Inspect to n words to see if they reflect the sentiment of the time
o Compute set difference and see how new words have cropped up
o Compute TF-IDF (homework!)
Lookout for these interesting Spark
features
Lookout for these interesting Spark
features
o RDD Map-Reduce
o How to parse input
o Removing common words
o Sort rdd by value
Read & Create word vectorRead & Create word vector
iPython notebook at https://github.com/xsankar/cloaked-ironman
Remove Common Words – 1 of 3Remove Common Words – 1 of 3
iPython notebook at https://github.com/xsankar/cloaked-ironman
Remove Common Words – 2 of 3Remove Common Words – 2 of 3
Remove Common Words – 3 of 3Remove Common Words – 3 of 3
FDR vs. Barack Obama as reflected by SOTUFDR vs. Barack Obama as reflected by SOTU
Barack Obama vs. Bill ClintonBarack Obama vs. Bill Clinton
GWB vs Abe Lincoln as reflected by SOTUGWB vs Abe Lincoln as reflected by SOTU
EpilogueEpilogue
o Interesting Exercise
o Highlights
• Map-reduce in a couple of lines !
• But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)
• Set differences using substractByKey
• Ability to sort a map by values (or any arbitrary function, for that matter)
o To Explore as homework:
• TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
http://blog.cloudera.com/blog/2014/09/how-­‐to-­‐translate-­‐from-­‐mapreduce-­‐to-­‐apache-­‐spark/
BreakBreak
3:153:15
Predicting Survivors with
Classification
Predicting Survivors with
Classification
3:303:30
Data Science “folk knowledge” (Wisdom of Kaggle)
Jeremy’s Axioms
Data Science “folk knowledge” (Wisdom of Kaggle)
Jeremy’s Axioms
o Iteratively explore data
o Tools
• Excel Format, Perl, Perl Book, Spark !
o Get your head around data
• Pivot Table
o Don’t over-complicate
o If people give you data, don’t assume that you
need to use all of it
o Look at pictures !
o History of your submissions – keep a tab
o Don’t be afraid to submit simple solutions
• We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-
jeremy-howard/
Titanic	
  Passenger	
  Metadata
• Small
• 3	
  Predictors
• Class
• Sex
• Age
• Survived?
Classification - ScenarioClassification - Scenario
o This is a knowledge exercise
o Classify survival from the titanic data
o Gives us a quick dataset to run & test classification
iPython notebook at https://github.com/xsankar/cloaked-ironman
Classifying ClassifiersClassifying Classifiers
StatisticalStatistical StructuralStructural
RegressionRegression Naïve  
Bayes
Naïve  
Bayes
Bayesian  
Networks
Bayesian  
Networks
Rule-­‐basedRule-­‐based Distance-­‐basedDistance-­‐based
Neural  
Networks
Neural  
Networks
Production   RulesProduction   Rules Decision  TreesDecision  Trees
Multi-­‐layer  
Perception
Multi-­‐layer  
Perception
FunctionalFunctional Nearest  NeighborNearest  Neighbor
LinearLinear Spectral  
Wavelet
Spectral  
Wavelet
kNNkNN Learning  vector  
Quantization
Learning  vector  
Quantization
EnsembleEnsemble
Random  ForestsRandom  Forests
Logistic  
Regression1
Logistic  
Regression1
SVMSVMBoostingBoosting
1Max  Entropy  Classifier  1Max  Entropy  Classifier  
Ref: Algorithms of the Intelligent Web, Marmanis & Babenko
Classifiers
Regression
Continuous
Variables
Categorical
Variables
Decision	
  
Trees
k-­‐NN(Nearest	
  
Neighbors)
Bias
Variance
Model Complexity
Over-fitting
BoostingBagging
CART
Classification - Spark APIClassification - Spark API
o Logistic Regression
o SVMWIthSGD
o DecisionTrees
o Data as LabelledPoint (we will see in a moment)
o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini",
maxDepth=4, maxBins=100)
o Impurity – “entropy” or “gini”
o maxBins = control to throttle communication at the expense of accuracy
• Larger = Higher Accuracy
• Smaller = less communication (as # of bins = number of instances)
o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing
i.e. the places you slice for binning
o intelligent framework - need this for scale
Lookout for these interesting Spark
features
Lookout for these interesting Spark
features
o Concept of Labeled Point & how to create an RDD of LPs
o Print the tree
o Calculate Accuracy & MSE from RDDs
Read data & extract featuresRead data & extract features
iPython notebook at https://github.com/xsankar/cloaked-ironman
Create the modelCreate the model
Extract labels & featuresExtract labels & features
Calculate Accuracy & MSECalculate Accuracy & MSE
Use NaiveBayes AlgorithmUse NaiveBayes Algorithm
Decision Tree – Best PracticesDecision Tree – Best Practices
maxDepth Tune	
  with	
  Data/Model	
  Selection
maxBins Set	
  low,	
  monitor communications,	
   increase	
  if	
  needed
#	
  RDD	
  partitions Set	
  to	
  #	
  of	
  cores
• Usually the recommendation is that the RDD partitions should be over
partitioned ie “more partitions than cores”, because tasks take different
times, we need to utilize the compute power and in the end they average out
• But for Machine Learning especially trees, all tasks are approx equal
computationally intensive, so over partitioning doesn’t help
• Joe Bradley talk (reference below) has interesting insights
https://speakerdeck.com/jkbradley/mllib-­‐decision-­‐trees-­‐at-­‐sf-­‐scala-­‐baml-­‐meetup
DecisionTree.trainClassifier(data,	
  numClasses,	
  categoricalFeaturesInfo,	
  impurity="gini",	
   maxDepth=4,	
  
maxBins=100)
Future …Future …
o Actually we should split the data to training & test sets
o Then use different feature sets to see if we can increase the accuracy
o Leave it as Homework
o In 1.2 …
o Random Forest
• Bagging
• PR for Random Forest
o Boosting
o Alpine lab sequoia Forest: coordinating merge
o Model Selection Pipeline ; Design Doc
◦ “Output	
  of	
  weak	
  classifiers	
  into	
  a	
  powerful	
  committee”
◦ Final	
  Prediction	
  =	
  weighted	
  majority	
  vote	
  
◦ Later	
  classifiers	
  get	
  misclassified	
  points	
  
– With	
  higher	
  weight,	
  
– So	
  they	
  are	
  forced	
  
– To	
  concentrate	
  on	
  them
◦ AdaBoost (AdaptiveBoosting)
◦ Boosting	
  vs Bagging
– Bagging	
  – independent	
  trees	
  <-­‐ Spark	
  shines	
  here
– Boosting	
  – successively	
  weighted
BoostingBoosting
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
◦ Builds	
  large	
  collection	
  of	
  de-­‐correlated	
  trees	
  &	
  averages	
  them
◦ Improves	
  Bagging	
  by	
  selecting	
  i.i.d*	
  random	
  variables	
  for	
  
splitting
◦ Simpler	
  to	
  train	
  &	
  tune
◦ “Do	
  remarkably	
  well,	
  with	
  very	
  little	
  tuning	
  required”	
  – ESLII
◦ Less	
  susceptible	
  to	
  over	
  fitting	
  (than	
  boosting)
◦ Many	
  RF	
  implementations
– Original	
  version	
  -­‐ Fortran-­‐77	
  !	
  By	
  Breiman/Cutler
– Python,	
  R,	
  Mahout,	
  Weka,	
  Milk	
  (ML	
  toolkit	
  for	
  py),	
  matlab
* i.i.d – independent identically distributed
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+Random Forests+
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
◦ Two	
  Step
– Develop	
  a	
  set	
  of	
  learners
– Combine	
  the	
  results	
  to	
  develop	
  a	
  composite	
  predictor
◦ Ensemble	
  methods	
  can	
  take	
  the	
  form	
  of:
– Using	
  different	
  algorithms,	
  
– Using	
  the	
  same	
  algorithm	
  with	
  different	
  settings
– Assigning	
  different	
  parts	
  of	
  the	
  dataset	
  to	
  different	
  classifiers
◦ Bagging	
  &	
  Random	
  Forests	
  are	
  examples	
  of	
  ensemble	
  
method	
  
Ref: Machine Learning In Action
Ensemble MethodsEnsemble Methods
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
Random ForestsRandom Forests
o While Boosting splits based on best among all variables, RF splits based on best among
randomly chosen variables
o Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees
(500 for large dataset, 150 for smaller)
o Error prediction
• For each iteration, predict for dataset that is not in the sample
(OOB data)
• Aggregate OOB predictions
• Calculate Prediction Error for the aggregate, which is
basically the OOB estimate of error rate
• Can use this to search for optimal # of predictors
• We will see how close this is to the actual error in the
Heritage Health Prize
o Assumes equal cost for mis-prediction. Can add a cost function
o Proximity matrix & applications like adding missing data, dropping outliersRef: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg
Why didn’t RF do better ? Bias/VarianceWhy didn’t RF do better ? Bias/Variance
o High Bias
• Due to Underfitting
• Add more features
• More sophisticated model
• Quadratic Terms, complex equations,…
• Decrease regularization
o High Variance
• Due to Overfitting
• Use fewer features
• Use more training sample
• Increase Regularization
Prediction Error
Training
Error
Ref: Strata 2013 Tutorial by Olivier Grisel
Learning Curve
Need  more  features  or  more  
complex  model  to  improve
Need  more  data  
to  improve
'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
http://www.slideshare.net/ksankar/data-­‐science-­‐folk-­‐knowledge
BreakBreak
4:304:30
ClusteringClustering
4:454:45
Data Science “folk knowledge” (3 of A)Data Science “folk knowledge” (3 of A)
o More Data Beats a Cleverer Algorithm
• Or conversely select algorithms that improve with data
• Don’t optimize prematurely without getting more data
o Learn many models, not Just One
• Ensembles ! – Change the hypothesis space
• Netflix prize
• E.g. Bagging, Boosting, Stacking
o Simplicity Does not necessarily imply Accuracy
o Representable Does not imply Learnable
• Just because a function can be represented does not
mean it can be learned
o Correlation Does not imply Causation o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o A fewuseful things to knowabout machine learning -by Pedro Domingos
§ http://dl.acm.org/citation.cfm?id=2347755
Scenario – Clustering with SparkScenario – Clustering with Spark
o InterGallactic Airlines have the GallacticHoppers frequent flyer program & have
data about their customers who participate in the program.
o The airlines execs have a feeling that other airlines will poach their customers if
they do not keep their loyal customers happy.
o So the business want to customize promotions to their frequent flier program.
o Can they just have one type of promotion ?
o Should they have different types of incentives ?
o Who exactly are the customers in their GallacticHoppers program ?
o Recently they have deployed an infrastructure with Spark
o Can Spark help in this business problem ?
Clustering - TheoryClustering - Theory
o Clustering is unsupervised learning
o While the computers can dissect a dataset into “similar” clusters, it still needs
human direction & domain knowledge to interpret & guide
o Two types:
• Centroid based clustering – k-means clustering
• Tree based Clustering – hierarchical clustering
o Spark implements the Scalable Kmeans++
• Paper : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
Lookout for these interesting Spark
features
Lookout for these interesting Spark
features
o Application of Statistics toolbox
o Center & Scale RDD
o Filter RDDs
Clustering - APIClustering - API
o from pyspark.mllib.clustering import KMeans
o Kmeans.train
o train(cls, data, k, maxIterations=100, runs=1, initializationMode="k-means||")
o K = number of clusters to create, default=2
o initializationMode = The initialization algorithm. This can be either "random" to
choose random points as initial cluster centers, or "k-means||" to use a parallel
variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default:
k-means||
o KMeansModel.predict
o Maps a point to a cluster
DataData
iPython notebook at https://github.com/xsankar/cloaked-ironman
Read Data & Create RDDRead Data & Create RDD
Train & PredictTrain & Predict
Calculate errorCalculate error
But Data is not evenBut Data is not even
So let us center & scale the data and try againSo let us center & scale the data and try again
Looks GoodLooks Good
Let us try with 5 clustersLet us try with 5 clusters
Let us map the cluster to our dataLet us map the cluster to our data
InterpretationInterpretation
C# AVG Interpretation
1
2
3
4
5
Note	
  :	
  
• This	
  is	
  just	
  a	
  sample	
  interpretation.
• In	
  real	
  life	
  we	
  would	
  “noodle”	
  over	
  the	
  clusters	
  &	
  tweak	
  
them	
  to	
  be	
  useful,	
  interpretable	
  and	
  distinguishable.
• May	
  be	
  3	
  is	
  more	
  suited	
  to	
  create	
  targeted	
  promotions
EpilogueEpilogue
o KMeans in Spark has enough controls
o It does a decent job
o We were able to control the clusters based on our experience (2 cluster is too
low, 10 is too high, 5 seems to be right)
o We can see that the Scalable KMeans has control over runs, parallelism et al.
(Home work : explore the scalability)
o We were able to interpret the results with domain knowledge and arrive at a
scheme to solve the business opportunity
o Naturally we would tweak the clusters to fit the business viability. 20 clusters
with corresponding promotion schemes are unwieldy, even if the WSSE is the
minimum.
Recommendation EngineRecommendation Engine
5:055:05
Recommendation & Personalization - SparkRecommendation & Personalization - Spark
Automated  Analytics-­‐ Let  Data  tell  story
Feature  Learning,  AI,  Deep  Learning
Learning  Models  -­‐ fit  parameters  
as  it  gets  more  data  
Dynamic  Models  – model  
selection  based  on  context
o Knowledge  Based
o Demographic  Based
o Content  Based
o Collaborative  Filtering
o Item  Based
o User  Based
o Latent  Factor  based
o User  Rating
o Purchased
o Looked/Not  purchased
Spark	
  implements	
  the	
  user	
  based	
  ALS	
  collaborative	
  filtering
Ref:	
  
ALS	
  -­‐ Collaborative	
  Filtering	
  for	
  Implicit	
  Feedback	
  Datasets,	
  Yifan Hu	
  ;	
  AT&T	
  Labs.,	
  Florham	
  Park,	
  NJ	
  ;	
  Koren,	
  Y.	
  ;	
  Volinsky,	
   C.
ALS-­‐WR	
  -­‐ Large-­‐Scale	
  Parallel	
  Collaborative	
  Filtering	
  for	
  the	
  Netflix	
  Prize,	
  Yunhong Zhou,	
  Dennis	
  Wilkinson,	
   Robert	
  Schreiber,	
  Rong Pan
Spark Collaborative Filtering APISpark Collaborative Filtering API
o ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1)
o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1,
alpha=0.01)
o MatrixFactorizationModel.predict(self, user, product)
o MatrixFactorizationModel.predictAll(self, usersProducts)
Read & ParseRead & Parse
Split & TrainSplit & Train
EvaluateEvaluate
EpilogueEpilogue
o We explored interesting APIs in Spark
o ALS-Collab Filtering
o RDD Operations
• Join (HashJoin)
• In memory, Grace, Recursive hash join
http://technet.microsoft.com/en-­‐us/library/ms189313(v=sql.105).aspx
Questions ?Questions ?
4:454:45
ReferenceReference
1. SF Scala & SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on
Spark http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-
decision-trees-on-spark
2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-
feature-scaling-needed-for-k-means-clustering
3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-
e-g-standardised-before-making-a-model-when-is
4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-
content*uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/
5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-
meetup
6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-
model.html
7. http://blogs.gartner.com/matthew-davis/
Essential Reading ListEssential Reading List
o A few useful things to know about machine learning - by Pedro Domingos
• http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper
t.pdf
o http://www.no-free-lunch.org/
o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y.
and Hochberg, Y. C
• http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD
R.pdf
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo
• https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance
• http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap
er_LeakingInDataMining.pdf
For your reading & viewing pleasure … An ordered ListFor your reading & viewing pleasure … An ordered List
① An Introduction to Statistical Learning
• http://www-bcf.usc.edu/~gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
• http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo
• https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng
• https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube
• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning
• http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-
machine-learning/
References:References:
o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas
• http://pyvideo.org/video/1655/an-introduction-to-scikit-
learn-machine-learning
o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel
• http://pyvideo.org/video/1719/advanced-machine-learning-
with-scikit-learn
o Just The Basics, Strata 2013, William Cukierski & Ben Hamner
• http://strataconf.com/strata2013/public/schedule/detail/27291
o The Problem of Multiple Testing
• http://download.journals.elsevierhealth.com/pdfs/journals/193
4-1482/PIIS1934148209014609.pdf
o Thanks to Ana Crisan for the Titanic inset. Picture courtesy http://emileeid.com/2012/02/11/titanic-3d-exclusive-posters/
The Beginning As The EndThe Beginning As The End
How did we do ?
4:454:45
Data Science with Spark

More Related Content

What's hot

Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
 
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...Databricks
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With PythonSarah Guido
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark Summit
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
 

What's hot (19)

Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 

Similar to Data Science with Spark

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato ReviewHang Li
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Unlock the power of spatial analysis using CARTO and python [CARTOframes]
Unlock the power of spatial analysis using CARTO and python [CARTOframes]Unlock the power of spatial analysis using CARTO and python [CARTOframes]
Unlock the power of spatial analysis using CARTO and python [CARTOframes]CARTO
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark ZaranTech LLC
 
Productionizing Data Science at Experience
Productionizing Data Science at ExperienceProductionizing Data Science at Experience
Productionizing Data Science at ExperienceMatt Mills
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldOpenSource Connections
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric TeamsData Con LA
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
 

Similar to Data Science with Spark (20)

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Unlock the power of spatial analysis using CARTO and python [CARTOframes]
Unlock the power of spatial analysis using CARTO and python [CARTOframes]Unlock the power of spatial analysis using CARTO and python [CARTOframes]
Unlock the power of spatial analysis using CARTO and python [CARTOframes]
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
 
Productionizing Data Science at Experience
Productionizing Data Science at ExperienceProductionizing Data Science at Experience
Productionizing Data Science at Experience
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 

More from Krishna Sankar

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data ScienceKrishna Sankar
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkKrishna Sankar
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesKrishna Sankar
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsKrishna Sankar
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonKrishna Sankar
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsKrishna Sankar
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team Krishna Sankar
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time SynchronizationKrishna Sankar
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleKrishna Sankar
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Krishna Sankar
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0Krishna Sankar
 

More from Krishna Sankar (20)

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Data Science with Spark

  • 1. In Apache Spark Foundations of Data Science with Spark Foundations of Data Science with Spark July 16, 2015 @ksankar // doubleclix.wordpress.com
  • 3. o Intro & Setup [8:00-8:20) • Goals/non-goals o Spark & Data Science DevOps [8:20-8:40) • Spark in the context of Data Science o Where Exactly is Apache Spark headed ? [8:40-9:30) • Spark Yesterday, Today & Tomorrow • Spark Stack o Break [9:30-10:00) o DataFrames for the Data Scientist [10:00-11:30) • pySpark Classes • Walkthru DataFrames • Hands-on Notebooks o [15] Discussions/Slack (11:30 - 11:45) Agenda : Introduction To SparkAgenda : Introduction To Sparkhttp://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html
  • 4. o Review (2:00-2:30) • 004-Orders-Homework- Solution • MLlib StatisticalToolbox • Summary, Correlations o [20] Linear Regression (2:30-2:45) o [20] “Mood Of the Union” (2:45-3:15) • State of the Union w/ Washington, Lincoln, FDR, JFK, Clinton, Bush & Obama • Map reduce, parse text o Break (3:15-3:30) o [60] Predicting Survivors with Classification (3:30-4:30) • Decision Trees • NaiveBayes (Titanic data set) o Break (4:30-4:45) o [20] Clustering(4:45-5:05) • K-means for Gallactic Hoppers! o [20]Recommendation Engine (5:05-5:25) • Collab Filtering w/movie lens o [15] Discussions/Slack (5:45-6:00) Agenda : Data Wrangling w/ DataFrames & MLlibAgenda : Data Wrangling w/ DataFrames & MLlib http://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html
  • 5. Goals & non-goalsGoals & non-goals Goals ¤Understand how to program Machine Learning with Spark & Python ¤Focus on programming & ML application ¤Give you a focused time to work thru examples § Work with me. I will wait if you want to catch-up ¤Less theory, more usage - let us see if this works ¤As straightforward as possible § The programs can be optimized Non-goals ¡Go deep into the algorithms • We don’t have sufficient time. The topic can be easily a 5 day tutorial ! ¡Dive into spark internals • That is for another day ¡The underlying computation, communication, constraints & distribution is a fascinating subject • Paco does a good job explaining them ¡A passive talk • Nope. Interactive & hands-on
  • 6. About MeAbout Me o Data Scientist • Decision Data Science & Product Data Science • Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L] • Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3] o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] … o Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC] o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT] o Reviewer : “Machine Learning with Spark” Packt Publishing o Have done lots of things: • Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA • Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI, • Guest Lecturer at Naval PG School,… o Volunteer as Robotics Judge at First Lego league World Competitions o @ksankar, doubleclix.wordpress.com
  • 7. Close EncountersClose Encounters — 1st ◦ This Tutorial — 2nd ◦ Do More Hands-on Walkthrough — 3nd ◦ Listen To Lectures ◦ More competitions …
  • 8. Spark InstallationSpark Installation o Install Spark 1.4.1 in local Machine • https://spark.apache.org/downloads.html • Pre-built For Hadoop 2.6 is fine • Download & uncompress • Remember the path & use it wherever you see /usr/local/spark/ • I have downloaded in /usr/local & have a softlink spark to the latest version o Install iPython
  • 9. Tutorial MaterialsTutorial Materials o Github : https://github.com/xsankar/global-bd-conf • Clone or download zip o Open terminal o cd ~/global-bd-conf o IPYTHON=1 IPYTHON_OPTS="notebook” /usr/local/spark/bin/pyspark --packages com.databricks:spark-csv_2.11:1.0.3 o Notes : • I have a soft link “spark” in my /usr/local that points to the spark version that I use. For example ln -s spark-1.4.1/ spark o Click on ipython dashboard o Run 000-PreFlightCheck.ipynb o Run 001-TestSparkCSV.ipynb o Now you are ready for the workshop !
  • 10. Spark & Data Science DevOpsSpark & Data Science DevOps 8:208:20
  • 11. Spark in the context of data science Spark in the context of data science
  • 12. Data Science : The art of building a model with known knowns, which when let loose, works with unknown unknowns! Data Science : The art of building a model with known knowns, which when let loose, works with unknown unknowns! Donald Rumsfeld is an armchair Data Scientist ! http://smartorg.com/2013/07/valuepoint19/ The World Knowns Unknowns You UnKnown Known o Others  know,  you  don’t o What  we  do o Facts,  outcomes  or   scenarios  we  have  not   encountered,  nor   considered o “Black  swans”,  outliers,   long  tails  of  probability   distributions o Lack  of  experience,   imagination o Potential  facts,   outcomes  we   are  aware,  but   not    with   certainty o Stochastic   processes,   Probabilities o Known Knowns o There are things we know that we know o Known Unknowns o That is to say, there are things that we now know we don't know o But there are also Unknown Unknowns o There are things we do not know we don't know
  • 13. The curious case of the Data ScientistThe curious case of the Data Scientist o Data Scientist is multi-faceted & Contextual o Data Scientist should be building Data Products o Data Scientist should tell a story http://doubleclix.wordpress.com/2014/01/25/the-­‐curious-­‐case-­‐of-­‐the-­‐data-­‐scientist-­‐profession/ Large is hard; Infinite is much easier ! – Titus Brown
  • 14. Data Science - ContextData Science - Context o Scalable  Model   Deployment o Big  Data   automation  &   purpose  built   appliances   (soft/hard) o Manage  SLAs  &   response  times o Scalable  Model   Deployment o Big  Data   automation  &   purpose  built   appliances   (soft/hard) o Manage  SLAs  &   response  times o Volume o Velocity o Streaming  Data o Volume o Velocity o Streaming  Data o Canonical   form o Data  catalog o Data  Fabric  across  the   organization o Access  to  multiple   sources  of  data   o Think  Hybrid  – Big  Data   Apps,  Appliances  &   Infrastructure o Canonical   form o Data  catalog o Data  Fabric  across  the   organization o Access  to  multiple   sources  of  data   o Think  Hybrid  – Big  Data   Apps,  Appliances  &   Infrastructure CollectCollect StoreStore TransformTransform o Metadata o Monitor  counters  &   Metrics o Structured  vs.  Multi-­‐ structured o Metadata o Monitor  counters  &   Metrics o Structured  vs.  Multi-­‐ structured o Flexible  &  Selectable § Data  Subsets   § Attribute  sets o Flexible  &  Selectable § Data  Subsets   § Attribute  sets o Refine  model  with § Extended  Data   subsets § Engineered   Attribute  sets o Validation  run  across  a   larger  data  set o Refine  model  with § Extended  Data   subsets § Engineered   Attribute  sets o Validation  run  across  a   larger  data  set ReasonReason ModelModel DeployDeploy Data ManagementData Management Data ScienceData Science o Dynamic  Data  Sets o 2  way  key-­‐value  tagging  of   datasets o Extended  attribute  sets o Advanced  Analytics o Dynamic  Data  Sets o 2  way  key-­‐value  tagging  of   datasets o Extended  attribute  sets o Advanced  Analytics ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict o Performance o Scalability o Refresh  Latency o In-­‐memory  Analytics o Performance o Scalability o Refresh  Latency o In-­‐memory  Analytics o Advanced  Visualization o Interactive  Dashboards o Map  Overlay o Infographics o Advanced  Visualization o Interactive  Dashboards o Map  Overlay o Infographics ¤ Bytes to Business a.k.a. Build the full stack ¤ Find Relevant Data For Business ¤ Connect the Dots
  • 15. VolumeVolume VelocityVelocity VarietyVariety Data Science - ContextData Science - Context ContextContext Connect edness Connect edness IntelligenceIntelligence InterfaceInterface InferenceInference “Data of unusual size” that can't be brute forced o Three Amigos o Interface = Cognition o Intelligence = Compute(CPU) & Computational(GPU) o Infer Significance & Causality
  • 16. Day in the life of a (super) ModelDay in the life of a (super) Model IntelligenceIntelligence InferenceInference Data RepresentationData Representation InterfaceInterface AlgorithmsAlgorithms ParametersParametersAttributesAttributes Data  (Scoring)Data  (Scoring) Model  SelectionModel  Selection Reason  &   Learn Reason  &   Learn ModelsModels Visualize,   Recommend,   Explore Visualize,   Recommend,   Explore Model  AssessmentModel  Assessment Feature  SelectionFeature  SelectionDimensionality  ReductionDimensionality  Reduction
  • 17. Data Science Maturity Model & SparkData Science Maturity Model & SparkIsolated Analytics Integrated Analytics Aggregated Analytics Automated Analytics Data Small  Data Larger  Data  set Big  Data Big  Data  Factory  Model Context Local Domain Cross-­‐domain +  External Cross  domain  +  External Model, Reason & Deploy • Single  set  of  boxes,  usually   owned  by  the  Model  Builders • Departmental • Deploy  -­‐ Central  Analytics   Infrastructure • Models  still  owned  &   operated  by  Modelers • Partly Enterprise-­‐wide • Central  Analytics  Infrastructure • Model  &  Reason  – by  Model  Builders • Deploy,  Operate  – by  ops • Residuals and  other  metrics  monitored   by  modelers • Enterprise-­‐wide • Distributed  Analytics  Infrastructure • AI  Augmented  models • Model  &  Reason  – by  Model   Builders • Deploy,  Operate  – by  ops • Data  as  a  monetized  service,   extending  to  eco  system  partners • Reports • Dashboards • Dashboards  +  some  APIs • Dashboards  +  Well  defined  APIs  +   programming  models Type • Descriptive  &  Reactive • +  Predictive • +  Adaptive • Adaptive Datasets • All  in  the  same  box • Fixed  data  sets,  usually  in   temp  data  spaces • Flexible  Data  &  Attribute  sets • Dynamic  datasets  with  well-­‐defined   refresh  policies   Workload • Skunk works • Business  relevant  apps  with   approx SLAs • High  performance  appliance  clusters • Appliances  and  clusters  for  multiple   workloads  including  real  time  apps • Infrastructure  for  emerging   technologies Strategy • Informal  definitions • Data  definitions  buried  in  the   analytics  models • Some  data  definitions • Data  catalogue,  metadata  &   Annotations • Big  Data  MDM  Strategy
  • 18. The  Sense  &  Sensibility  of  a   DataScientist DevOps The  Sense  &  Sensibility  of  a   DataScientist DevOps Factory  =  Operational Lab  =  Investigative http://doubleclix.wordpress.com/2014/05/11/the-­‐sense-­‐ sensibility-­‐of-­‐a-­‐data-­‐scientist-­‐devops/
  • 19. Where exactly is Apache Spark headed ? Where exactly is Apache Spark headed ? Spark Yesterday, Today & Tomorrow … “Unified engine acrossdiverse data sources, workloads & environments” 8:408:40
  • 20. http://free-­‐stock-­‐illustration.com/winding+road+clip+art Spark 1.x • Fast engine for big data processing • Fast to run code & fast to write code • In-memory computation graphs with compatibility with the Hadoop eco system and an interesting very usable APIs • Iterative & interactive apps that operated on data multiple times, which are not a good use case for Hadoop. Spark 1.3 & beyond has been the catalyst for a renaissance in Data Science ! Spark 1.4+ • Multi-pass analytics - ML pipelines, GraphX • Ad-hoc queries - DataFrames • Real-time stream processing – Spark Streaming • Parallel Machine Learning Algorithms beyond the basic RDDs • More types of data sources as input & output • More integration with R to span statistical computing beyond “single-node tools” • More integration with apps like visualization dashboards • More performance with even larger datasets & complex applications – Project Tungsten Spark Yesterday, Today & Tomorrow …
  • 21. Spark DirectionsSpark Directions Data ScienceData Science Platform APIsPlatform APIs Streaming, DAG Visualization & Debugging Streaming, DAG Visualization & Debugging Execution Optimization (Project Tungsten) Execution Optimization (Project Tungsten)o DataFrames o ML  Pipelines o SparkR o Growing  the  eco  system § Data  Sources  -­‐Uniform  access  to   diverse  data  sources § Pluggable  “smart”  DataSource API  for  reading/writing   DataFrame while  minimizing  I/O § Spark  Packages § Deployment  utilities  for  Google   Compute,  Azure  &  Job  Server o Focus  on  CPU  Efficiency o Run-­‐Time  Code  Generation o Cache  Locality  &  cache  aware   data  structures o Binary  Format  for  aggregations o Spark  managed  Memory o Off-­‐heap  memory   management o Spark  Streaming   flow  control  &   optimized  state   management
  • 23. RDD – The workhorse of Core SparkRDD – The workhorse of Core Spark o Resilient Distributed Datasets • Collection that can be operated in parallel o Transformations – create RDDs • Map, Filter,… o Actions – Get values • Collect, Take,… o We will apply these operations during this tutorial
  • 24. DataFrame API Spark Core Spark SQL Spark Streaming Spark R MLlib GraphX Packages ML Pipelines Advanced Analytics Neural Networks Deep Learning Parameter Server R Scala JavaPython Catalyst Optimizer – optimize execution plan Data Sources - Parquet, Hadoop, Cassandra, JSON, CSV, JDBC,… Tungsten Execution RDD
  • 25. SQL Query DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans Physical Plans Physical Plans CostModel Selected Physical Plan RDDs Catalog Analysis Logical   Optimization Physical  Planning Code  Generation Query Optimization-Execution pipeline Ref:  Spark  SQL  paper
  • 26. Spark DataFrames for the Data Scientist Spark DataFrames for the Data Scientist “A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.” - From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. DataFrames ! The Most Massively useful thing a Data Scientist can have … 10:0010:00
  • 27. Data Science “folk knowledge” (1 of A)Data Science “folk knowledge” (1 of A) o "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer Mediated Transactions o Learning = Representation + Evaluation + Optimization o It’s Generalization that counts • The fundamental goal of machine learning is to generalize beyond the examples in the training set o Data alone is not enough • Induction not deduction - Every learner should embody some knowledge or assumptions beyond the data it is given in order to generalize beyond it o Machine Learning is not magic – one cannot get something from nothing • In order to infer, one needs the knobs & the dials • One also needs a rich expressive dataset A few useful things to know about machine learning- by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
  • 28. pyspark pyspark.SparkContext() pyspark.SparkConf() pyspark.RDD() pyspark.Broadcast() pyspark.Accululator() pyspark.SparkFiles() pyspark.StorageLevel() pyspark.sqlpyspark.streaming pyspark.mllib pyspark.ml pyspark.streaming.StreamingContext() pyspark.streaming.Dstream() pyspark.streaming.kafka pyspark.streaming.kafka.Broker() … pyspark.sql.SQLContext() pyspark.sql.DataFrame() pyspark.sql.DataFrameNaFunctions() pyspark.sql.DataFrameStatFunctions() pyspark.sql.DataFrameReader() pyspark.sql.DataFrameWriter() pyspark.sql.Column() pyspark.sql.Row() pyspark.sql.functions() pyspark.sql.types() pyspark.sql.Window() pyspark.sql.WindowSpec() pyspark.sql.GroupedData() pyspark.sql.HiveContext() pyspark.mllib.classification pyspark.mllib.clustering pyspark.mllib.evaluation pyspark.mllib.feature pyspark.mllib.fpm pyspark.mllib.linalg pyspark.mllib.random pyspark.mllib.recommendation pyspark.mllib.regression pyspark.mllib.stat pyspark.mllib.tree pyspark.mllib.util ML  Pipeline  APIs pyspark.ml.Transformer pyspark.ml.Estimator pyspark.ml.Model pyspark.ml.Pipeline pyspark.ml.PipelineModel pyspark.ml.param pyspark.ml.feature pyspark.ml.classification pyspark.ml.recommendation pyspark.ml.regression pyspark.ml.tuning pyspark.ml.evaluation
  • 31. 3. Convert3. Convert pyspark.sql.DataFrame table pandas.DataFrame sqlContext.registerDataFrameAsTable(df,  "aTable") df2  =  sqlContext.table("aTable") df =  createDataFrame(pandas.DataFrame) p_df =  df.toPandas()
  • 32. 4. Columns & Rows (1 of 3)4. Columns & Rows (1 of 3) o Select a column • by the df(“<columnName>”) notation or the df.<columnName> notation. • The recommended way is the df(“<columnName>”), reason being a column name can collide with a dataframe method if we use the df.<columnName> o Column-wise operations line +,-, *,/,% (modulo),&&,||, <,<=,> and >= • df(“total”) = df(“price”) * df(“qty”) • inequality operator is !==, the usual equalTo operator is === and an equality test that is safe for null values <=> o Meta operations – type conversion (cast), alias, not null, … • df_cars.mpg.cast("double").alias('mpg') o Run arbitrary udfs on a column (see next page)
  • 33. 4. Columns & Rows (2 of 3)4. Columns & Rows (2 of 3) Run arbitrary udfs on a column
  • 34. 4. Columns & Rows (3 of 3)4. Columns & Rows (3 of 3) Interesting Operations … Adding a column…
  • 35. 5. DataFrame : RDD-like Operations5. DataFrame : RDD-like OperationsFunction Description   df.sort(<sort  expression>)  or  df.orderBy(<sort  expression>) Returns  a  sorted  DataFrame.  There  are  multiple  ways  of  specifying  the  sort  expression.  Use  of  the  orderBy  is   recommended  (as  the  syntax  is  closer  to  SQL)  for  example: df_orders_1.groupBy("CustomerID","Year").count()   .orderBy(‘count’,ascending=False).show() df.filter(<condition>)  or  df.where(<condition>) Returns  a  new  DataFrame  after  applying  the  <condition>.  The  condition  is  usually  based  on  a  column. Use  of  the  where  form  is  recommended  (as  the  syntax  is  closer  to  the  SQL  world),  for  example   df_orders.where(df_orders.[‘shipCountry’]  ==  ’France’) df.colasce(n) Returns  a  DataFrame  with  n  partitions,  same  as  colasce(n)  method  of  RDD df.foreach(<function>) Applies  a  function  on  all  the  rows  of  a  DataFrame df.map(lambda   r:..) Applies  the  function  on  all  the  rows  and  returns  the  resulting  object df.flatMap(lambda   r:  …) Returns  an  RDD,  flattened,  after  applying  the  function  on  all  the  rows  of  the  DataFrame   df.rdd() Returns  the  DataFrame  as  an  rdd  of  Row  objects df.na.replace([<list  of  values  to  be  replaced],[list  of  replacing   values],subset=[list  of  columns])  or    DataFrame.replace()   or   DataFrameNaFunctions.replace() An  interesting  function,  very  useful  and  a  little  strange  syntax-­‐‑wise.  The  recommended  form  is  the   df.na.replace()   even  though  the  .na namespace  throws  it  a  little  bit.  Use  the  subset=  for  column  names.  The   syntax  is  different  from  the  Scala  syntax.
  • 36. 6. DataFrame : Action6. DataFrame : Action o cache() o collect(), collectAsList() o count() o describe(), o first(), head(), show(), take() o …
  • 37. 7. DataFrame : Scientific Functions7. DataFrame : Scientific Functions
  • 38. 8. DataFrame : Statistical Functions8. DataFrame : Statistical Functions The pair-wise frequency (contingency table) of transmission type and no of speeds show interesting observation. • All automatic cars in the dataset are 3 speed while most of the manual transmission cars have 4 or 5 speeds • Almost all the manual cars have 2 barrels while the automatic cars have 2 and 4 barrels
  • 39. 9. DataFrame : Aggregate Functions9. DataFrame : Aggregate Functions o The pyspark.sql.functions class (and the org.apache.spark.sql.functions for Scala) contains the aggregation functions o There are two types of aggregations, one on column values and the other on subsets of column values i.e. grouped values of some other columns • pyspark.sql.functions.avg(“sales”) • pyspark.sql.functions.groupby(“year”).agg(“sales”:”avg”) o Count(), countDistinct() o First(),last()
  • 40. 10. DataFrame : na10. DataFrame : na o One of the tenets of big data and data science is that data is never fully clean-while we can handle types, formats et al, missing values is always challenging o One easy solution is to drop the rows that have missing values, but then we would lose valuable data in the columns that do have values. o A better solution is to impute data based on some criteria. It is true that data cannot be created out of thin air, but data can be inferred with some success – it is better than dropping the rows. • We can replace null with 0 • A better solution is to replace numerical values with the average of the rest of the valid values; for categorical replacing with the most common value is a good strategy • We could use mode or median instead of mean • Another good strategy is to infer the missing value from other attributes ie “Evidence from multiple fields”. • For example the Titanic data has name and for imputing the missing age field, we could use the Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field from the corresponding designation. So a row with missing age with Master. In name would get the average age of all records with “Master.” • There is also the filed for number of siblings and number of spouse. We could average the age based on the value of that field. • We could even average the ages from different strategies.
  • 41. 10. DataFrame : na10. DataFrame : na
  • 42. 11. Joins/Set Operations a.k.a.11. Joins/Set Operations a.k.a. Language Integrated Queries
  • 43. 12. SQL on Tables12. SQL on Tables
  • 44. Hands-OnHands-On o 003-DataFrame-For-DS • Understand and run the iPython Notebook o 004-Orders • Homework – we will go thru the solution when we meet in the afternoon
  • 45. Data Wrangling with SparkData Wrangling with Spark 2:002:00
  • 46. Algorithm spectrumAlgorithm spectrum o Regression o Logit o CART o Ensemble :Random Forest o Clustering o KNN o Genetic Alg o Simulated Annealing o Collab Filtering o SVM o Kernels o SVD o NNet o Boltzman Machine o Feature Learning Machine  Learning Cute  Math Artificial   Intelligence
  • 47. Statistical ToolboxStatistical Toolbox o Sample data : Car mileage data
  • 48.
  • 50. Linear Regression - APILinear Regression - API LabeledPoint The features and labels of a data point LinearModel weights, intercept LinearRegressionModel Base predict() LinearRegressionModel LinearRegressionWithS GD train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=1.0, regType=None, intercept=False) LassoModel Least-squares fit with an l_1 penalty term. LassoWithSGD train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,initialWeights=None) RidgeRegressionModel Least-squares fit with an l_2 penalty term. RidgeRegressionWithS GD train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0, initialWeights=None)
  • 51. Basic Linear RegressionBasic Linear Regression
  • 52. Use LR model for prediction & calculate MSEUse LR model for prediction & calculate MSE
  • 53. Step size is important, the model can diverge !Step size is important, the model can diverge !
  • 55.
  • 56.
  • 57. “Mood Of the Union” with TF- IDF “Mood Of the Union” with TF- IDF 2:452:45
  • 58. Scenario – Mood Of the UnionScenario – Mood Of the Union o It has been said that the State of the Union speech by the President of USA reflects the social challenge faced by the country ? o If so, can we infer the mood of the country by analyzing SOTU ? o If we embark on this line of thought, how would we do it with Spark & python ? o Is it different from Hadoop-MapReduce ? o Is it better ?
  • 59. POA (Plan Of Action)POA (Plan Of Action) o Collect State of the Union speech by George Washington, Abe Lincoln, FDR, JFK, Bill Clinton, GW Bush & Barack Obama o Read the 7 SOTU from the 7 presidents into 7 RDDs o Create word vectors o Transform into word frequency vectors o Remove stock common words o Inspect to n words to see if they reflect the sentiment of the time o Compute set difference and see how new words have cropped up o Compute TF-IDF (homework!)
  • 60. Lookout for these interesting Spark features Lookout for these interesting Spark features o RDD Map-Reduce o How to parse input o Removing common words o Sort rdd by value
  • 61. Read & Create word vectorRead & Create word vector iPython notebook at https://github.com/xsankar/cloaked-ironman
  • 62. Remove Common Words – 1 of 3Remove Common Words – 1 of 3 iPython notebook at https://github.com/xsankar/cloaked-ironman
  • 63. Remove Common Words – 2 of 3Remove Common Words – 2 of 3
  • 64. Remove Common Words – 3 of 3Remove Common Words – 3 of 3
  • 65. FDR vs. Barack Obama as reflected by SOTUFDR vs. Barack Obama as reflected by SOTU
  • 66. Barack Obama vs. Bill ClintonBarack Obama vs. Bill Clinton
  • 67. GWB vs Abe Lincoln as reflected by SOTUGWB vs Abe Lincoln as reflected by SOTU
  • 68. EpilogueEpilogue o Interesting Exercise o Highlights • Map-reduce in a couple of lines ! • But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1) • Set differences using substractByKey • Ability to sort a map by values (or any arbitrary function, for that matter) o To Explore as homework: • TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf http://blog.cloudera.com/blog/2014/09/how-­‐to-­‐translate-­‐from-­‐mapreduce-­‐to-­‐apache-­‐spark/
  • 70. Predicting Survivors with Classification Predicting Survivors with Classification 3:303:30
  • 71. Data Science “folk knowledge” (Wisdom of Kaggle) Jeremy’s Axioms Data Science “folk knowledge” (Wisdom of Kaggle) Jeremy’s Axioms o Iteratively explore data o Tools • Excel Format, Perl, Perl Book, Spark ! o Get your head around data • Pivot Table o Don’t over-complicate o If people give you data, don’t assume that you need to use all of it o Look at pictures ! o History of your submissions – keep a tab o Don’t be afraid to submit simple solutions • We will do this during this workshop Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by- jeremy-howard/
  • 72. Titanic  Passenger  Metadata • Small • 3  Predictors • Class • Sex • Age • Survived? Classification - ScenarioClassification - Scenario o This is a knowledge exercise o Classify survival from the titanic data o Gives us a quick dataset to run & test classification iPython notebook at https://github.com/xsankar/cloaked-ironman
  • 73. Classifying ClassifiersClassifying Classifiers StatisticalStatistical StructuralStructural RegressionRegression Naïve   Bayes Naïve   Bayes Bayesian   Networks Bayesian   Networks Rule-­‐basedRule-­‐based Distance-­‐basedDistance-­‐based Neural   Networks Neural   Networks Production   RulesProduction   Rules Decision  TreesDecision  Trees Multi-­‐layer   Perception Multi-­‐layer   Perception FunctionalFunctional Nearest  NeighborNearest  Neighbor LinearLinear Spectral   Wavelet Spectral   Wavelet kNNkNN Learning  vector   Quantization Learning  vector   Quantization EnsembleEnsemble Random  ForestsRandom  Forests Logistic   Regression1 Logistic   Regression1 SVMSVMBoostingBoosting 1Max  Entropy  Classifier  1Max  Entropy  Classifier   Ref: Algorithms of the Intelligent Web, Marmanis & Babenko
  • 75. Classification - Spark APIClassification - Spark API o Logistic Regression o SVMWIthSGD o DecisionTrees o Data as LabelledPoint (we will see in a moment) o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini", maxDepth=4, maxBins=100) o Impurity – “entropy” or “gini” o maxBins = control to throttle communication at the expense of accuracy • Larger = Higher Accuracy • Smaller = less communication (as # of bins = number of instances) o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing i.e. the places you slice for binning o intelligent framework - need this for scale
  • 76. Lookout for these interesting Spark features Lookout for these interesting Spark features o Concept of Labeled Point & how to create an RDD of LPs o Print the tree o Calculate Accuracy & MSE from RDDs
  • 77. Read data & extract featuresRead data & extract features iPython notebook at https://github.com/xsankar/cloaked-ironman
  • 79. Extract labels & featuresExtract labels & features
  • 80. Calculate Accuracy & MSECalculate Accuracy & MSE
  • 81. Use NaiveBayes AlgorithmUse NaiveBayes Algorithm
  • 82. Decision Tree – Best PracticesDecision Tree – Best Practices maxDepth Tune  with  Data/Model  Selection maxBins Set  low,  monitor communications,   increase  if  needed #  RDD  partitions Set  to  #  of  cores • Usually the recommendation is that the RDD partitions should be over partitioned ie “more partitions than cores”, because tasks take different times, we need to utilize the compute power and in the end they average out • But for Machine Learning especially trees, all tasks are approx equal computationally intensive, so over partitioning doesn’t help • Joe Bradley talk (reference below) has interesting insights https://speakerdeck.com/jkbradley/mllib-­‐decision-­‐trees-­‐at-­‐sf-­‐scala-­‐baml-­‐meetup DecisionTree.trainClassifier(data,  numClasses,  categoricalFeaturesInfo,  impurity="gini",   maxDepth=4,   maxBins=100)
  • 83. Future …Future … o Actually we should split the data to training & test sets o Then use different feature sets to see if we can increase the accuracy o Leave it as Homework o In 1.2 … o Random Forest • Bagging • PR for Random Forest o Boosting o Alpine lab sequoia Forest: coordinating merge o Model Selection Pipeline ; Design Doc
  • 84. ◦ “Output  of  weak  classifiers  into  a  powerful  committee” ◦ Final  Prediction  =  weighted  majority  vote   ◦ Later  classifiers  get  misclassified  points   – With  higher  weight,   – So  they  are  forced   – To  concentrate  on  them ◦ AdaBoost (AdaptiveBoosting) ◦ Boosting  vs Bagging – Bagging  – independent  trees  <-­‐ Spark  shines  here – Boosting  – successively  weighted BoostingBoosting — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
  • 85. ◦ Builds  large  collection  of  de-­‐correlated  trees  &  averages  them ◦ Improves  Bagging  by  selecting  i.i.d*  random  variables  for   splitting ◦ Simpler  to  train  &  tune ◦ “Do  remarkably  well,  with  very  little  tuning  required”  – ESLII ◦ Less  susceptible  to  over  fitting  (than  boosting) ◦ Many  RF  implementations – Original  version  -­‐ Fortran-­‐77  !  By  Breiman/Cutler – Python,  R,  Mahout,  Weka,  Milk  (ML  toolkit  for  py),  matlab * i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm Random Forests+Random Forests+ — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
  • 86. ◦ Two  Step – Develop  a  set  of  learners – Combine  the  results  to  develop  a  composite  predictor ◦ Ensemble  methods  can  take  the  form  of: – Using  different  algorithms,   – Using  the  same  algorithm  with  different  settings – Assigning  different  parts  of  the  dataset  to  different  classifiers ◦ Bagging  &  Random  Forests  are  examples  of  ensemble   method   Ref: Machine Learning In Action Ensemble MethodsEnsemble Methods — Goal ◦ Model Complexity (-) ◦ Variance (-) ◦ Prediction Accuracy (+)
  • 87. Random ForestsRandom Forests o While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables o Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller) o Error prediction • For each iteration, predict for dataset that is not in the sample (OOB data) • Aggregate OOB predictions • Calculate Prediction Error for the aggregate, which is basically the OOB estimate of error rate • Can use this to search for optimal # of predictors • We will see how close this is to the actual error in the Heritage Health Prize o Assumes equal cost for mis-prediction. Can add a cost function o Proximity matrix & applications like adding missing data, dropping outliersRef: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk A Brief Overview of RF by Dan Steinberg
  • 88. Why didn’t RF do better ? Bias/VarianceWhy didn’t RF do better ? Bias/Variance o High Bias • Due to Underfitting • Add more features • More sophisticated model • Quadratic Terms, complex equations,… • Decrease regularization o High Variance • Due to Overfitting • Use fewer features • Use more training sample • Increase Regularization Prediction Error Training Error Ref: Strata 2013 Tutorial by Olivier Grisel Learning Curve Need  more  features  or  more   complex  model  to  improve Need  more  data   to  improve 'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos http://www.slideshare.net/ksankar/data-­‐science-­‐folk-­‐knowledge
  • 91. Data Science “folk knowledge” (3 of A)Data Science “folk knowledge” (3 of A) o More Data Beats a Cleverer Algorithm • Or conversely select algorithms that improve with data • Don’t optimize prematurely without getting more data o Learn many models, not Just One • Ensembles ! – Change the hypothesis space • Netflix prize • E.g. Bagging, Boosting, Stacking o Simplicity Does not necessarily imply Accuracy o Representable Does not imply Learnable • Just because a function can be represented does not mean it can be learned o Correlation Does not imply Causation o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o A fewuseful things to knowabout machine learning -by Pedro Domingos § http://dl.acm.org/citation.cfm?id=2347755
  • 92. Scenario – Clustering with SparkScenario – Clustering with Spark o InterGallactic Airlines have the GallacticHoppers frequent flyer program & have data about their customers who participate in the program. o The airlines execs have a feeling that other airlines will poach their customers if they do not keep their loyal customers happy. o So the business want to customize promotions to their frequent flier program. o Can they just have one type of promotion ? o Should they have different types of incentives ? o Who exactly are the customers in their GallacticHoppers program ? o Recently they have deployed an infrastructure with Spark o Can Spark help in this business problem ?
  • 93. Clustering - TheoryClustering - Theory o Clustering is unsupervised learning o While the computers can dissect a dataset into “similar” clusters, it still needs human direction & domain knowledge to interpret & guide o Two types: • Centroid based clustering – k-means clustering • Tree based Clustering – hierarchical clustering o Spark implements the Scalable Kmeans++ • Paper : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
  • 94. Lookout for these interesting Spark features Lookout for these interesting Spark features o Application of Statistics toolbox o Center & Scale RDD o Filter RDDs
  • 95. Clustering - APIClustering - API o from pyspark.mllib.clustering import KMeans o Kmeans.train o train(cls, data, k, maxIterations=100, runs=1, initializationMode="k-means||") o K = number of clusters to create, default=2 o initializationMode = The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means|| o KMeansModel.predict o Maps a point to a cluster
  • 96. DataData iPython notebook at https://github.com/xsankar/cloaked-ironman
  • 97. Read Data & Create RDDRead Data & Create RDD
  • 98. Train & PredictTrain & Predict
  • 100. But Data is not evenBut Data is not even
  • 101. So let us center & scale the data and try againSo let us center & scale the data and try again
  • 102. Looks GoodLooks Good Let us try with 5 clustersLet us try with 5 clusters
  • 103. Let us map the cluster to our dataLet us map the cluster to our data
  • 104. InterpretationInterpretation C# AVG Interpretation 1 2 3 4 5 Note  :   • This  is  just  a  sample  interpretation. • In  real  life  we  would  “noodle”  over  the  clusters  &  tweak   them  to  be  useful,  interpretable  and  distinguishable. • May  be  3  is  more  suited  to  create  targeted  promotions
  • 105. EpilogueEpilogue o KMeans in Spark has enough controls o It does a decent job o We were able to control the clusters based on our experience (2 cluster is too low, 10 is too high, 5 seems to be right) o We can see that the Scalable KMeans has control over runs, parallelism et al. (Home work : explore the scalability) o We were able to interpret the results with domain knowledge and arrive at a scheme to solve the business opportunity o Naturally we would tweak the clusters to fit the business viability. 20 clusters with corresponding promotion schemes are unwieldy, even if the WSSE is the minimum.
  • 107. Recommendation & Personalization - SparkRecommendation & Personalization - Spark Automated  Analytics-­‐ Let  Data  tell  story Feature  Learning,  AI,  Deep  Learning Learning  Models  -­‐ fit  parameters   as  it  gets  more  data   Dynamic  Models  – model   selection  based  on  context o Knowledge  Based o Demographic  Based o Content  Based o Collaborative  Filtering o Item  Based o User  Based o Latent  Factor  based o User  Rating o Purchased o Looked/Not  purchased Spark  implements  the  user  based  ALS  collaborative  filtering Ref:   ALS  -­‐ Collaborative  Filtering  for  Implicit  Feedback  Datasets,  Yifan Hu  ;  AT&T  Labs.,  Florham  Park,  NJ  ;  Koren,  Y.  ;  Volinsky,   C. ALS-­‐WR  -­‐ Large-­‐Scale  Parallel  Collaborative  Filtering  for  the  Netflix  Prize,  Yunhong Zhou,  Dennis  Wilkinson,   Robert  Schreiber,  Rong Pan
  • 108. Spark Collaborative Filtering APISpark Collaborative Filtering API o ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1) o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, alpha=0.01) o MatrixFactorizationModel.predict(self, user, product) o MatrixFactorizationModel.predictAll(self, usersProducts)
  • 109. Read & ParseRead & Parse
  • 110. Split & TrainSplit & Train
  • 112. EpilogueEpilogue o We explored interesting APIs in Spark o ALS-Collab Filtering o RDD Operations • Join (HashJoin) • In memory, Grace, Recursive hash join http://technet.microsoft.com/en-­‐us/library/ms189313(v=sql.105).aspx
  • 114. ReferenceReference 1. SF Scala & SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on Spark http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley- decision-trees-on-spark 2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and- feature-scaling-needed-for-k-means-clustering 3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted- e-g-standardised-before-making-a-model-when-is 4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp- content*uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/ 5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml- meetup 6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity- model.html 7. http://blogs.gartner.com/matthew-davis/
  • 115. Essential Reading ListEssential Reading List o A few useful things to know about machine learning - by Pedro Domingos • http://dl.acm.org/citation.cfm?id=2347755 o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert • http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper t.pdf o http://www.no-free-lunch.org/ o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, Y. C • http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD R.pdf o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe • http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o Avoid these three mistakes, James Faghmo • https://medium.com/about-data/73258b3848a4 o Leakage in Data Mining: Formulation, Detection, and Avoidance • http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap er_LeakingInDataMining.pdf
  • 116. For your reading & viewing pleasure … An ordered ListFor your reading & viewing pleasure … An ordered List ① An Introduction to Statistical Learning • http://www-bcf.usc.edu/~gareth/ISL/ ② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning • http://online.stanford.edu/course/statistical-learning-winter-2014 ③ Prof. Pedro Domingo • https://class.coursera.org/machlearning-001/lecture/preview ④ Prof. Andrew Ng • https://class.coursera.org/ml-003/lecture/preview ⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data • https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120 ⑥ Mathematicalmonk @ YouTube • https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA ⑦ The Elements Of Statistical Learning • http://statweb.stanford.edu/~tibs/ElemStatLearn/ http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn- machine-learning/
  • 117. References:References: o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas • http://pyvideo.org/video/1655/an-introduction-to-scikit- learn-machine-learning o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel • http://pyvideo.org/video/1719/advanced-machine-learning- with-scikit-learn o Just The Basics, Strata 2013, William Cukierski & Ben Hamner • http://strataconf.com/strata2013/public/schedule/detail/27291 o The Problem of Multiple Testing • http://download.journals.elsevierhealth.com/pdfs/journals/193 4-1482/PIIS1934148209014609.pdf o Thanks to Ana Crisan for the Titanic inset. Picture courtesy http://emileeid.com/2012/02/11/titanic-3d-exclusive-posters/
  • 118. The Beginning As The EndThe Beginning As The End How did we do ? 4:454:45