As simple as Apache Spark

Data Science Warsaw, 2015.10.13
As simple as
Apache Spark

About me
● At ICM since 5 years
● Knowledge Discovery in Documents
○ Object disambiguation, Document classification, Document
similarity, etc., etc.
● Enough big to use Big Data ecosystems
○ Hadoop since 2012
○ Spark since 2013 (2014 for real)
2

We have still
about 19 minutes...

Obligatory word count example
■ Task: to count the number of occurrences of each word in a text
■ Frequently used when introducing the MapReduce paradigm
4
Tell me
you’ve already
know it!

All rights reserved, © 2015 ICM UW 5
Hadoop has a rich set of libraries
Map-Reduce — good for batch
Pig — Scripts
Oozie — Workflows
Mahout — Machine Learning
Hive — SQL Queries
Impala —
Ad-hoc Queries
Storm — Real Time
Streaming
Giraph — Graphs

Hadoop ecosystem is (too) large
Map-Reduce — good for batch
■ Using multiple libraries results in
● long deployment, costful support, burden of administering
number of configuration files
● lots of glueing code between libraries
Pig — Scripts
Oozie — Workflows
Mahout — Machine Learning
Hive — SQL Queries
Impala —
Ad-hoc Queries
Storm — Real Time
Streaming
Giraph — Graphs

All rights reserved, © 2015 ICM UW
Let’s walk into Big Data
like a Boss with .

Spark ecosystem is versatile yet seamless
Ecosystem of high-level tools for various use-cases
8
Spark Core
Spark SQL
Spark
Streaming
near real-time
MLLib
machine
learning
GraphX
graph
processing
SparkR
R on Spark

9
„One to rule them all"

Example: versatile yet seamless
1. Select positions from historic tweets.
2. Train a model of 10 clusters of neighbouring nodes.
3. Classify real–time tweets from last 20 sec. every 3 sec. and count them for each
cluster.
10
points = sc.runSql[Double, Double]("SELECT latitude, longitude FROM historic_tweets")
model = KMeans.train(points, 10)
sc.twitterStream(...)
.map(lambda t: (model.closestCenter(t.location), 1))
.reduceByKeyAndWindow(lambda x,y: x+y , Seconds(20), Seconds(3))
Source: The State of Spark, and Where We’re Going Next, presentation by M. Zaharia, 2013

How to start?

First: download
12

Second: ./bin/pyspark
13

Third: Code much and often
14
import pyspark.ml.recommendation.ALS
import pyspark.ml.recommendation.Rating
// Transform Strings: "user_id,movie_id,rating" to Ratings: Rating(user_id:Int,movie_id:Int,rating:Double)
data = sc.textFile("path/to/data.csv")
ratings = data.map(lambda s: s.split(',')).map(lambda arr: Rating(int(arr[0]), int(arr[1]), float(arr[2]))
// Build the recommendation model using ALS
// Factor the rating matrix A=[n,m] into B=[n,f] and C=[f,m], where A ~= B x C
numFeatures = 10; numIterations = 20
model = ALS.train(ratings, numFeatures, numIterations, 0.01)

● We know which products are preferred by a particular user
● Having information about preferences -- recommend to a
particular user -- a product which she or he is likely to purchase
● Iterative method
15
= X
Collaborative Filtering: Problem Statement

Collaborative Filtering: Problem Statement
16
USER
USER
=
~TASTE
DEMANDMOVIE
MOVIE
~TASTE
SUPPLY
X
Demo time!

Tweets exploration
17
Demo time!

Ecosystem of high-level tools for various use-cases
18
Spark Core
Spark SQL
Spark
Streaming
near real-time
MLLib
machine
learning
GraphX
graph
processing
SparkR
R on Spark

What next? Trainings!

Thank you!
20
Piotr Dendek
pdendek@icm.edu.pl
@pjden

As simple as Apache Spark

More Related Content

What's hot

Similar to As simple as Apache Spark

More from Data Science Warsaw

Recently uploaded

As simple as Apache Spark