Data Science Warsaw, 2015.10.13
As simple as
Apache Spark
Data Science Warsaw, 2015.10.13
About me
● At ICM since 5 years
● Knowledge Discovery in Documents
○ Object disambiguation, Document classification, Document
similarity, etc., etc.
● Enough big to use Big Data ecosystems
○ Hadoop since 2012
○ Spark since 2013 (2014 for real)
2
Data Science Warsaw, 2015.10.13
We have still
about 19 minutes...
Data Science Warsaw, 2015.10.13
Obligatory word count example
■ Task: to count the number of occurrences of each word in a text
■ Frequently used when introducing the MapReduce paradigm
4
Tell me
you’ve already
know it!
All rights reserved, © 2015 ICM UW 5
Hadoop has a rich set of libraries
Map-Reduce — good for batch
Pig — Scripts
Oozie — Workflows
Mahout — Machine Learning
Hive — SQL Queries
Impala —
Ad-hoc Queries
Storm — Real Time
Streaming
Giraph — Graphs
All rights reserved, © 2015 ICM UW 6
Hadoop ecosystem is (too) large
Map-Reduce — good for batch
■ Using multiple libraries results in
● long deployment, costful support, burden of administering
number of configuration files
● lots of glueing code between libraries
Pig — Scripts
Oozie — Workflows
Mahout — Machine Learning
Hive — SQL Queries
Impala —
Ad-hoc Queries
Storm — Real Time
Streaming
Giraph — Graphs
All rights reserved, © 2015 ICM UW
Let’s walk into Big Data
like a Boss with .
All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
Ecosystem of high-level tools for various use-cases
8
Spark Core
Spark SQL
Spark
Streaming
near real-time
MLLib
machine
learning
GraphX
graph
processing
SparkR
R on Spark
All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
9
„One to rule them all"
All rights reserved, © 2015 ICM UW
Example: versatile yet seamless
1. Select positions from historic tweets.
2. Train a model of 10 clusters of neighbouring nodes.
3. Classify real–time tweets from last 20 sec. every 3 sec. and count them for each
cluster.
10
points = sc.runSql[Double, Double]("SELECT latitude, longitude FROM historic_tweets")
model = KMeans.train(points, 10)
sc.twitterStream(...)
.map(lambda t: (model.closestCenter(t.location), 1))
.reduceByKeyAndWindow(lambda x,y: x+y , Seconds(20), Seconds(3))
Source: The State of Spark, and Where We’re Going Next, presentation by M. Zaharia, 2013
All rights reserved, © 2015 ICM UW
How to start?
All rights reserved, © 2015 ICM UW
First: download
12
All rights reserved, © 2015 ICM UW
Second: ./bin/pyspark
13
All rights reserved, © 2015 ICM UW
Third: Code much and often
14
import pyspark.ml.recommendation.ALS
import pyspark.ml.recommendation.Rating
// Transform Strings: "user_id,movie_id,rating" to Ratings: Rating(user_id:Int,movie_id:Int,rating:Double)
data = sc.textFile("path/to/data.csv")
ratings = data.map(lambda s: s.split(',')).map(lambda arr: Rating(int(arr[0]), int(arr[1]), float(arr[2]))
// Build the recommendation model using ALS
// Factor the rating matrix A=[n,m] into B=[n,f] and C=[f,m], where A ~= B x C
numFeatures = 10; numIterations = 20
model = ALS.train(ratings, numFeatures, numIterations, 0.01)
All rights reserved, © 2015 ICM UW
● We know which products are preferred by a particular user
● Having information about preferences -- recommend to a
particular user -- a product which she or he is likely to purchase
● Iterative method
15
= X
Collaborative Filtering: Problem Statement
All rights reserved, © 2015 ICM UW
Collaborative Filtering: Problem Statement
16
USER
USER
=
~TASTE
DEMANDMOVIE
MOVIE
~TASTE
SUPPLY
X
Demo time!
All rights reserved, © 2015 ICM UW
Tweets exploration
17
Demo time!
All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
Ecosystem of high-level tools for various use-cases
18
Spark Core
Spark SQL
Spark
Streaming
near real-time
MLLib
machine
learning
GraphX
graph
processing
SparkR
R on Spark
All rights reserved, © 2015 ICM UW 19
What next? Trainings!
All rights reserved, © 2015 ICM UW
Thank you!
20
Piotr Dendek
pdendek@icm.edu.pl
@pjden

As simple as Apache Spark

  • 1.
    Data Science Warsaw,2015.10.13 As simple as Apache Spark
  • 2.
    Data Science Warsaw,2015.10.13 About me ● At ICM since 5 years ● Knowledge Discovery in Documents ○ Object disambiguation, Document classification, Document similarity, etc., etc. ● Enough big to use Big Data ecosystems ○ Hadoop since 2012 ○ Spark since 2013 (2014 for real) 2
  • 3.
    Data Science Warsaw,2015.10.13 We have still about 19 minutes...
  • 4.
    Data Science Warsaw,2015.10.13 Obligatory word count example ■ Task: to count the number of occurrences of each word in a text ■ Frequently used when introducing the MapReduce paradigm 4 Tell me you’ve already know it!
  • 5.
    All rights reserved,© 2015 ICM UW 5 Hadoop has a rich set of libraries Map-Reduce — good for batch Pig — Scripts Oozie — Workflows Mahout — Machine Learning Hive — SQL Queries Impala — Ad-hoc Queries Storm — Real Time Streaming Giraph — Graphs
  • 6.
    All rights reserved,© 2015 ICM UW 6 Hadoop ecosystem is (too) large Map-Reduce — good for batch ■ Using multiple libraries results in ● long deployment, costful support, burden of administering number of configuration files ● lots of glueing code between libraries Pig — Scripts Oozie — Workflows Mahout — Machine Learning Hive — SQL Queries Impala — Ad-hoc Queries Storm — Real Time Streaming Giraph — Graphs
  • 7.
    All rights reserved,© 2015 ICM UW Let’s walk into Big Data like a Boss with .
  • 8.
    All rights reserved,© 2015 ICM UW Spark ecosystem is versatile yet seamless Ecosystem of high-level tools for various use-cases 8 Spark Core Spark SQL Spark Streaming near real-time MLLib machine learning GraphX graph processing SparkR R on Spark
  • 9.
    All rights reserved,© 2015 ICM UW Spark ecosystem is versatile yet seamless 9 „One to rule them all"
  • 10.
    All rights reserved,© 2015 ICM UW Example: versatile yet seamless 1. Select positions from historic tweets. 2. Train a model of 10 clusters of neighbouring nodes. 3. Classify real–time tweets from last 20 sec. every 3 sec. and count them for each cluster. 10 points = sc.runSql[Double, Double]("SELECT latitude, longitude FROM historic_tweets") model = KMeans.train(points, 10) sc.twitterStream(...) .map(lambda t: (model.closestCenter(t.location), 1)) .reduceByKeyAndWindow(lambda x,y: x+y , Seconds(20), Seconds(3)) Source: The State of Spark, and Where We’re Going Next, presentation by M. Zaharia, 2013
  • 11.
    All rights reserved,© 2015 ICM UW How to start?
  • 12.
    All rights reserved,© 2015 ICM UW First: download 12
  • 13.
    All rights reserved,© 2015 ICM UW Second: ./bin/pyspark 13
  • 14.
    All rights reserved,© 2015 ICM UW Third: Code much and often 14 import pyspark.ml.recommendation.ALS import pyspark.ml.recommendation.Rating // Transform Strings: "user_id,movie_id,rating" to Ratings: Rating(user_id:Int,movie_id:Int,rating:Double) data = sc.textFile("path/to/data.csv") ratings = data.map(lambda s: s.split(',')).map(lambda arr: Rating(int(arr[0]), int(arr[1]), float(arr[2])) // Build the recommendation model using ALS // Factor the rating matrix A=[n,m] into B=[n,f] and C=[f,m], where A ~= B x C numFeatures = 10; numIterations = 20 model = ALS.train(ratings, numFeatures, numIterations, 0.01)
  • 15.
    All rights reserved,© 2015 ICM UW ● We know which products are preferred by a particular user ● Having information about preferences -- recommend to a particular user -- a product which she or he is likely to purchase ● Iterative method 15 = X Collaborative Filtering: Problem Statement
  • 16.
    All rights reserved,© 2015 ICM UW Collaborative Filtering: Problem Statement 16 USER USER = ~TASTE DEMANDMOVIE MOVIE ~TASTE SUPPLY X Demo time!
  • 17.
    All rights reserved,© 2015 ICM UW Tweets exploration 17 Demo time!
  • 18.
    All rights reserved,© 2015 ICM UW Spark ecosystem is versatile yet seamless Ecosystem of high-level tools for various use-cases 18 Spark Core Spark SQL Spark Streaming near real-time MLLib machine learning GraphX graph processing SparkR R on Spark
  • 19.
    All rights reserved,© 2015 ICM UW 19 What next? Trainings!
  • 20.
    All rights reserved,© 2015 ICM UW Thank you! 20 Piotr Dendek pdendek@icm.edu.pl @pjden