SlideShare a Scribd company logo
1 of 103
Page1 Ā© Hortonworks Inc. 2014
Advanced Analytics with Apache Spark
and Apache Zeppelin in HDP
Hortonworks. We do Hadoop.
Alex Zeltov
Solutions Engineer
@azeltov
Page2 Ā© Hortonworks Inc. 2014
In this workshop
ā€¢ Introduction to HDP and Spark
ā€¢ Build a Data analytics application:
- Spark Programming: Scala, Python, R
- Core Spark: working with RDDs, DataFrames
- Spark SQL: structured data access
- Spark MlLib: predictive analytics
- Spark Streaming: real time data processing
ā€¢ Develop Recommendation Engine - Using the ā€œCollaborative
Filteringā€ method
ā€¢ Conclusion and Q/A
Page3 Ā© Hortonworks Inc. 2014
Introduction to HDP and Spark
http://hortonworks.com/hadoop/spark/
Page4 Ā© Hortonworks Inc. 2014
Spark is certified as YARN Ready and is a part of HDP.
Hortonworks Data Platform 2.3
GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
YARN: Data Operating System
(Cluster Resource Management)
MapReduce
Apache Falcon
Apache Sqoop
Apache Flume
Apache Kafka
ApacheHive
ApachePig
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
1 ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢
ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢
HDFS
(Hadoop Distributed File System)
Apache Ambari
Apache
ZooKeeper
Apache Oozie
Deployment Choice
Linux Windows On-premises Cloud
Apache Atlas
Cloudbreak
SECURITY
Apache Ranger
Apache Knox
Apache Atlas
HDFS Encryption
ISVEngines
Page5 Ā© Hortonworks Inc. 2014
Spark Components
Spark allows you to do data processing, ETL, machine learning,
stream processing, SQL querying from one framework
Page6 Ā© Hortonworks Inc. 2014
Emerging Spark Patterns
ā€¢ Spark as query federation engine
ļƒ˜ Bring data from multiple sources to join/query in Spark
ā€¢ Use multiple Spark libraries together
ļƒ˜ Common to see Core, ML & Sql used together
ā€¢ Use Spark with various Hadoop ecosystem projects
ļƒ˜ Use Spark & Hive together
ļƒ˜ Spark & HBase together
Page7 Ā© Hortonworks Inc. 2014
More Data Sources APIs
18/03/2016
Page8 Ā© Hortonworks Inc. 2014
Spark Deployment Modes
ā€¢ Spark Standalone Cluster:
ā€“ For developing Spark apps against a local Spark (similar to
develop/deploying in IDE)
ā€¢ Spark on YARN in two modes:
ā€“ Spark driver (SparkContext) in local (yarn-client): Spark Driver runs in the
client process outside of YARN cluster, and ApplicationMaster is only used to
negotiate resources from Resoure manager
ā€“ Spark driver (SparkContext) in YARN AM(yarn-cluster): Spark Driver runs in
ApplicationMaster spawned by NodeManager on a slave node
Page9 Ā© Hortonworks Inc. 2014
Spark on YARN
YARN RM
App Master
Monitoring UI
Page10 Ā© Hortonworks Inc. 2014
Spark UI
Page11 Ā© Hortonworks Inc. 2014
Interacting with Spark
Page12 Ā© Hortonworks Inc. 2014
Interacting with Spark
ā€¢ Sparkā€™s interactive REPL shell (in Python or Scala)
ā€¢ Web-based Notebooks:
ā€¢ Zeppelin: A web-based notebook that enables interactive data
analytics.
ā€¢ Jupyter: Evolved from the IPython Project
ā€¢ SparkNotebook: forked from the scala-notebook
Page13 Ā© Hortonworks Inc. 2014
Apache Zeppelin
ā€¢ A web-based notebook that enables interactive data
analytics.
ā€¢ Multiple language backend
ā€¢ Multi-purpose Notebook is the place for all your
needs
ļƒ˜ Data Ingestion
ļƒ˜ Data Discovery
ļƒ˜ Data Analytics
ļƒ˜ Data Visualization
ļƒ˜ Collaboration
Page14 Ā© Hortonworks Inc. 2014
Zeppelin- Multiple language backend
Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.
Page15 Ā© Hortonworks Inc. 2014
Zeppelin ā€“ Dependency Management
ā€¢ Load libraries recursively from Maven repository
ā€¢ Load libraries from local filesystem
ā€¢ %dep
ā€¢ // add maven repository
ā€¢ z.addRepo("RepoName").url("RepoURLā€)
ā€¢ // add artifact from filesystem
ā€¢ z.load("/path/to.jar")
ā€¢ // add artifact from maven repository, with no dependency
ā€¢ z.load("groupId:artifactId:version").excludeAll()
Page16 Ā© Hortonworks Inc. 2014
Spark & Zeppelin Pace of Innovation
HDP 2.2.4
Spark 1.2.1
GA
HDP 2.3.2
Spark 1.4.1
GA
HDP 2.3.0
Spark 1.3.1
GA
HDP 2.3.4
Spark 1.5.2*
GA
Spark
Spark 1.3.1
TP
5/2015
Spark 1.4.1
TP
8/2015
Spark 1.5.1
TP
Nov/2015
Now
Zeppelin
TP
Oct/2015
Apache Zeppelin
Zeppelin
TP Refresh
March 1st 2016
Dec 2015
HDP 2.4.0
Spark 1.6
GA
Zeppelin
GA
Q1, 2016
Last Awareness Session
Spark 1.6
TP
Jan/2015
March 1st 2016
HDP 2.5.x
Spark 1.6.1*
GA
Q1, 2016
Page17 Ā© Hortonworks Inc. 2014
Spark in HDP customer base - 2015
0
10
20
30
40
50
60
70
Q1 Q2 Q3 Q4
Unique # of customers filing Spark tickets/Qs
Customers that filed
Spark tickets in 2015
132
Page18 Ā© Hortonworks Inc. 2014
Programming Spark
Page19 Ā© Hortonworks Inc. 2014
How Does Spark Work?
ā€¢ RDD
ā€¢ Your data is loaded in parallel into structured collections
ā€¢ Actions
ā€¢ Manipulate the state of the working model by forming new RDDs
and performing calculations upon them
ā€¢ Persistence
ā€¢ Long-term storage of an RDDā€™s state
Page20 Ā© Hortonworks Inc. 2014
Resilient Distributed Datasets
ā€¢ The primary abstraction in Spark
Ā» Immutable once constructed
Ā» Track lineage information to efļ¬ciently recompute lost data
Ā» Enable operations on collection of elements in parallel
ā€¢ You construct RDDs
Ā» by parallelizing existing collections (lists)
Ā» by transforming an existing RDDs
Ā» from ļ¬les in HDFS or any other storage system
Page21 Ā© Hortonworks Inc. 2014
item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
more partitions = more parallelism
Worker
Spark
executor
Worker
Spark
executor
Worker
Spark
executor
RDDs
ā€¢ Programmer speciļ¬es number of partitions for an RDD
(Default value used if unspeciļ¬ed)
RDD split into 5 partitions
Page22 Ā© Hortonworks Inc. 2014
RDDs
ā€¢ Two types of operations:transformations and actions
ā€¢ Transformations are lazy (not computed immediately)
ā€¢ Transformed RDD is executed when action runs on it
ā€¢ Persist (cache) RDDs in memory or disk
Page23 Ā© Hortonworks Inc. 2014
Example RDD Transformations
ā€¢map(func)
ā€¢filter(func)
ā€¢distinct(func)
ā€¢ All create a new DataSet from an existing one
ā€¢ Do not create the DataSet until an action is performed (Lazy)
ā€¢ Each element in an RDD is passed to the target function and the
result forms a new RDD
Page24 Ā© Hortonworks Inc. 2014
Example Action Operations
ā€¢count()
ā€¢reduce(func)
ā€¢collect()
ā€¢take()
ā€¢ Either:
ā€¢ Returns a value to the driver program
ā€¢ Exports state to external system
Page25 Ā© Hortonworks Inc. 2014
Example Persistence Operations
ā€¢persist() -- takes options
ā€¢cache() -- only one option: in-memory
ā€¢ Stores RDD Values
ā€¢ in memory (what doesnā€™t fit is recalculated when necessary)
ā€¢ Replication is an option for in-memory
ā€¢ to disk
ā€¢ blended
Page26 Ā© Hortonworks Inc. 2014
Spark Applications
Are a definition in code of
ā€¢ RDD creation
ā€¢ Actions
ā€¢ Persistence
Results in the creation of a DAG (Directed Acyclic Graph) [workflow]
ā€¢ Each DAG is compiled into stages
ā€¢ Each Stage is executed as a series of Tasks
ā€¢ Each Task operates in parallel on assigned partitions
Page27 Ā© Hortonworks Inc. 2014
Spark Context
ā€¢ A Spark program ļ¬rst creates a SparkContext object
ā€¢ Tells Spark how and where to access a cluster
ā€¢ Use SparkContext to create RDDs
ā€¢ SparkContext, SQLContext, ZeppelinContext:
ā€¢ are automatically created and exposed as variable names 'sc', 'sqlContext' and
'z', respectively, both in scala and python environments using Zeppelin
ā€¢ iPython and programs must use a constructor to create a new SparkContext
Note: that scala / python environment shares the same SparkContext, SQLContext,
ZeppelinContext instance.
Page28 Ā© Hortonworks Inc. 2014
1. Resilient Distributed Dataset [RDD] Graph
val v = sc.textFile("hdfs://ā€¦some-hdfs-data")
mapmap reduceByKey collecttextFile
v.flatMap(line=>line.split(" "))
.map(word=>(word, 1)))
.reduceByKey(_ + _, 3)
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
Page29 Ā© Hortonworks Inc. 2014
Processing A File in Scala
//Load the file:
val file = sc.textFile("hdfs://ā€¦/user/DAW/littlelog.csv")
//Trim away any empty rows:
val fltr = file.filter(_.length > 0)
//Print out the remaining rows:
fltr.foreach(println)
29
Page30 Ā© Hortonworks Inc. 2014
Looking at the State in the Machine
//run debug command to inspect RDD:
scala> fltr.toDebugString
//simplified output:
res1: String =
FilteredRDD[2] at filter at <console>:14
MappedRDD[1] at textFile at <console>:12
HadoopRDD[0] at textFile at <console>:12
30
Page31 Ā© Hortonworks Inc. 2014
A Word on Anonymous Functions
Scala programmers make great use of anonymous functions as can
be seen in the code:
flatMap( line => line.split(" ") )
31
Argument
to the
function
Body of
the
function
Page32 Ā© Hortonworks Inc. 2014
Scala Functions Come In a Variety of Styles
flatMap( line => line.split(" ") )
flatMap((line:String) => line.split(" "))
flatMap(_.split(" "))
32
Argument to the
function (type inferred)
Body of the function
Argument to the
function (explicit type)
Body of the
function
No Argument to the
function declared
(placeholder) instead
Body of the function includes placeholder _ which allows for exactly one use of
one arg for each _ present. _ essentially means ā€˜whatever you pass meā€™
Page33 Ā© Hortonworks Inc. 2014
And Finally ā€“ the Formal ā€˜defā€™
def myFunc(line:String): Array[String]={
return line.split(",")
}
//and now that it has a name:
myFunc("Hi Mom, Iā€™m home.").foreach(println)
Return type of the function)
Body of the function
Argument to the function)
Page34 Ā© Hortonworks Inc. 2014
LAB: Spark RDD & Data Frames Demo ā€“
Philly Crime Data Set
http://sandbox.hortonworks.com:8081/#/notebook/2B6HKTZDK
Page35 Ā© Hortonworks Inc. 2014
Spark DataFrames
Page36 Ā© Hortonworks Inc. 2014
What are DataFrames?
ā€¢ Distributed Collection of Data organized in Columns
ā€¢ Equivalent to Tables in Databases or DataFrame in R/PYTHON
ā€¢ Much richer optimization than any other implementation of DF
ā€¢ Can be constructed from a wide variety of sources and APIs
ā€¢ Greater accessiblity
ā€¢ Declarative rather thanimperative
ā€¢ Catalyst Optimizer
Why DataFrames?
Page37 Ā© Hortonworks Inc. 2014
Writing a DataFrame
val df = sqlContext.jsonFile("/tmp/people.json")
df.show()
df.printSchema()
df.select ("First Name").show()
df.select("First Name","Age").show()
df.filter(df("age")>40).show()
df.groupBy("age").count().show()
Page38 Ā© Hortonworks Inc. 2014
Querying RDD Using SQL
import org.apache.spark.sql.types.{StructType,StructField,StringType}
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName,
StringType, true)))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val people = sc.textFile("/tmp/people.txt")
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
peopleDataFrame.registerTempTable("people")
val results = sqlContext.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
Page39 Ā© Hortonworks Inc. 2014
Querying RDD Using SQL
// SQL statements can be run directly on RDDā€™s
val teenagers =
sqlC.sql("SELECT name FROM people
WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support
// normal RDD operations:
val nameList = teenagers.map(t => "Name: " + t(0)).collect()
// Language integrated queries (ala LINQ)
val teenagers =
people.where('age >= 10).where('age <= 19).select('name)
Page40 Ā© Hortonworks Inc. 2014
Dataframes for Apache Spark
DataFrame SQL
DataFrame R
DataFrame Python
DataFrame Scala
RDD Python
RDD Scala
Time to aggregate 10 million integer pairs (in seconds)
DataFrames can be significantly faster than RDDs. And they
perform the same, regardless of language.
Page41 Ā© Hortonworks Inc. 2014
Transformations Actions
filter count
select collect
drop show
join take
Transformations contribute to the query plan but
nothing is executed until an action is called
Dataframes ā€“ Transformation & Actions
Page42 Ā© Hortonworks Inc. 2014
LAB: DataFrames
http://sandbox.hortonworks.com:8081/#/notebook/2B4B7EWY7
http://sandbox.hortonworks.com:8081/#/notebook/2B5RMG4AM
DataFrames + SQL
DataFrames JSON
Page43 Ā© Hortonworks Inc. 2014
DataFrames and JDBC
val jdbc_attendees = sqlContext.load("jdbc", Map("url" ->
"jdbc:mysql://localhost:3306/db1?user=root&password=xxx","dbtable" -> "attendees"))
jdbc_attendees.show()
jdbc.attendees.count()
jdbc_attendees.registerTempTable("jdbc_attendees")
val countall = sqlContext.sql("select count(*) from jdbc_attendees")
countall.map(t=>"Records count is "+t(0)).collect().foreach(println)
Page44 Ā© Hortonworks Inc. 2014
Code ā€˜select countā€™
Equivalent SQL Statement:
Select count(*) from pagecounts WHERE state = ā€˜FLā€™
Scala statement:
val file = sc.textFile("hdfs://ā€¦/log.txt")
val numFL = file.filter(line =>
line.contains("fl")).count()
scala> println(numFL)
44
1. Load the page as an RDD
2. Filter the lines of the page
eliminating any that do not
contain ā€œflā€œ
3. Count those lines that
remain
4. Print the value of the
counted lines containing ā€˜flā€™
Page45 Ā© Hortonworks Inc. 2014
Spark SQL
45
Page46 Ā© Hortonworks Inc. 2014 46
Platform APIs
ā€¢ Joining Data from Different
Sources
ā€¢ Access Data using DataFrames /
SQL
Page47 Ā© Hortonworks Inc. 2014 47
Platform APIs
ā€¢ Community Plugins
ā€¢ 100+ connectors
http://spark-packages.org/
Page48 Ā© Hortonworks Inc. 2014
LAB: JDBC and 3rd party packages
http://sandbox.hortonworks.com:8081/#/notebook/2B2P8RE82
Page49 Ā© Hortonworks Inc. 2014
What About Integration With Hive?
scala> val hiveCTX = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveCTX.hql("SHOW TABLES").collect().foreach(println)
ā€¦
[omniture]
[omniturelogs]
[orc_table]
[raw_products]
[raw_users]
ā€¦
49
Page50 Ā© Hortonworks Inc. 2014
More Integration With Hive:
scala> hCTX.hql("DESCRIBE raw_users").collect().foreach(println)
[swid,string,null]
[birth_date,string,null]
[gender_cd,string,null]
scala> hCTX.hql("SELECT * FROM raw_users WHERE gender_cd='F' LIMIT
5").collect().foreach(println)
[0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D,8-Apr-84,F]
[00071AA7-86D2-4EB9-871A-A786D27EB9BA,7-Feb-88,F]
[00071B7D-31AF-4D85-871B-7D31AFFD852E,22-Oct-64,F]
[000F36E5-9891-4098-9B69-CEE78483B653,24-Mar-85,F]
[00102F3F-061C-4212-9F91-1254F9D6E39F,1-Nov-91,F]
50
Page51 Ā© Hortonworks Inc. 2014
ORC at Spotify
16x less HDFS read when
using ORC versus Avro.(5)
IOi
32x less CPU when using ORC
versus Avro.(5)
CPUi
[2]
Page52 Ā© Hortonworks Inc. 2014
LAB: HIVE ORC
http://sandbox.hortonworks.com:8081/#/notebook/2B6KUW16Z
Page53 Ā© Hortonworks Inc. 2014
Spark Streaming
Page54 Ā© Hortonworks Inc. 2014
MicroBatch Spark Streams
Page55 Ā© Hortonworks Inc. 2014
Physical Execution
Page56 Ā© Hortonworks Inc. 2014
Spark Streaming 101
ā€¢ Spark has significant library support for streaming applications
val ssc = new StreamingContext(sc, Seconds(5))
val tweetStream = TwitterUtils.createStream(ssc, Some(auth))
ā€¢ Allows to combine Streaming with Batch/ETL,SQL & ML
ā€¢ Read data from HDFS, Flume, Kafka, Twitter, ZeroMQ & custom.
ā€¢ Chop input data stream into batches
ā€¢ Spark processes batches & results published in batches
ā€¢ Fundamental unit is Discretized Streams (DStreams)
Page57 Ā© Hortonworks Inc. 2014
Spark MLLib
Page58 Ā© Hortonworks Inc. 2014
Spark MLlib ā€“ Algorithms Offered
ā€¢ Classification: logistic regression, linear SVM,
ā€“ naĆÆve Bayes, least squares, classification tree
ā€¢ Regression: generalized linear models (GLMs),
ā€“ regression tree
ā€¢ Collaborative filtering: alternating least squares (ALS),
ā€“ non-negative matrix factorization (NMF)
ā€¢ Clustering: k-means
ā€¢ Decomposition: SVD, PCA
ā€¢ Optimization: stochastic gradient descent, L-BFGS
Page59 Ā© Hortonworks Inc. 2014 59
ML - Pipelines
ā€¢ New algorithms KMeans [SPARK-7879], Naive Bayes [SPARK-
8600], Bisecting KMeans
ā€¢ [SPARK-6517], Multi-layer Perceptron (ANN) [SPARK-2352],
Weighting for
ā€¢ Linear Models [SPARK-7685]
ā€¢ New transformers (close to parity with SciKit learn):
CountVectorizer [SPARK-8703],
ā€¢ PCA [SPARK-8664], DCT [SPARK-8471], N-Grams [SPARK-8455]
ā€¢ Calling into single machine solvers (coming soon as a package)
Page60 Ā© Hortonworks Inc. 2014
Twitter Language Classifier
Goal: connect to real time twitter stream and print only
those tweets whose language match our chosen language.
Main issue: how to detect the language during run time?
Solution: build a language classifier model offline capable of
detecting language of tweet (Mlib). Then, apply it to real
time twitter stream and do filtering (Spark Streaming).
Page61 Ā© Hortonworks Inc. 2014
Spark External Datasources
Page62 Ā© Hortonworks Inc. 2014
Spark External Datasources
You can load datasets from various external sources:
ā€¢ Local Filesystem
ā€¢ HDFS
ā€¢ HDFS using custom InputFormat
ā€¢ Amazon S3
ā€¢ Relational Databases (RDBMS)
ā€¢ Apache Cassandra, Mongo DB, etc.
Page63 Ā© Hortonworks Inc. 2014
LABS: data load from MongoDB or Cassandra
Page64 Ā© Hortonworks Inc. 2014
Recommendation Engine - ALS
Page65 Ā© Hortonworks Inc. 2014
Step 1: Data Ingest
ā€¢ Using the MovieLens 10M data set
ā€¢ http://grouplens.org/datasets/movielens/
ā€¢ Ratings: UserID::MovieID::Rating::Timestamp
ā€¢ 10.000.000 ratings on 10.000 movies by 72.000 users
ā€¢ ratings.dat.gz
ā€¢ Movies: MovieID::Title::Genres
ā€¢ 10.000 movies
ā€¢ movies.dat
Page66 Ā© Hortonworks Inc. 2014
baseDir = os.path.join('movielens')
ratingsFilename = os.path.join(baseDir, 'ratings.dat.gz')
moviesFilename = os.path.join(baseDir, 'movies.dat')
numPartitions = 2
rawRatings =
sc.textFile(ratingsFilename).repartition(numPartitions)
rawMovies = sc.textFile(moviesFilename)
Step 1: Data Ingest
ā€¢ Some simple python code followed by the creation of the first RDD
import sys
import os
Page67 Ā© Hortonworks Inc. 2014
Step 2: Feature Extraction
ā€¢ Transform the string data in tuples of useful data and remove
unwanted pieces
ā€¢ Ratings: UserID::MovieID::Rating::TImestamp
1::1193::5::978300760
1::661::3::978302109 => [(1, 1193, 5.0), (1, 914, 3.0), ā€¦]
ā€¢ Movies: MovieID::Title::Genres
1::Toy Story (1995):: Animation|Childrenā€™s|Comedy
2::Jumanji (1995)::Adventure|Childrenā€™s|Fantasy
=> [(1, 'Toy Story (1995)'), (2, u'Jumanji (1995)'), ā€¦]
Page68 Ā© Hortonworks Inc. 2014
def get_ratings_tuple(entry):
float(items[2])
items = entry.split('::')
return int(items[0]), int(items[1]),
def get_movie_tuple(entry):
items = entry.split('::')
return int(items[0]), items[1]
ratingsRDD = rawRatings.map(get_ratings_tuple).cache()
moviesRDD = rawMovies.map(get_movie_tuple).cache()
Step 2: Feature Extraction
Page69 Ā© Hortonworks Inc. 2014
Step 2: Feature Extraction
ā€¢ Inspect RDD using collect()
ā€¢ Careful: make sure the whole dataset fits in the
memory of the driver
Driver
job
Executor
Task
Executor
Task
ā€¢ Use take(num)
ā€¢ Safer: takes a num-size subset
print 'Ratings: %s' % ratingsRDD.take(2)
Ratings: [(1, 1193, 5.0), (1, 914, 3.0)]
print 'Movies: %s' % moviesRDD.take(2)
Movies: [(1, u'Toy Story (1995)'), (2, u'Jumanji (1995)')]
Page70 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ The naĆÆve approach
ā€¢ Recommend movies with the highest average rating
ā€¢ Need a tuple containing the movie name and itā€™s average rating
ā€¢ Only consider movies with at least 500 ratings
ā€¢ Tuple must contain the number of ratings for the movie
ā€¢ The tuple we need should be of the folowing form:
( averageRating, movieName, numberOfRatings )
Page71 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ The naĆÆve approach
ā€¢ Calculate the average rating of a movie
ā€¢ From the ratingsRDD, we create tuples containing all the ratings for a movie:
ā€“ Remember: ratingsRDD = (UserID, MovieID, Rating)
movieIDsWithRatingsRDD = (ratingsRDD
.map(lambda (user_id,movie_id,rating): (movie_id,[rating]))
.reduceByKey(lambda a,b: a+b))
ā€¢ This is simpele map-reduce in spark:
ā€¢ Map: (UserID, MovieID, Rating) => (MovieID, [Rating])
ā€¢ Reduce: (MovieID1, [Rating1]), (MovieID1, [Rating2]) => (MovieID1, [Rating1,Rating2])
Page72 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ The naĆÆve approach
( len(RatingsTuple[1]), total/len(RatingsTuple[1])) )
movieIDsWithAvgRatingsRDD =
movieIDsWithRatingsRDD.map(getCountsAndAverages)
ā€¢ Note that the new key-value tuples have MovieID as key and a nested tuple
(ratings,average) as value: [ (2, (332, 3.174698795180723) ), ā€¦ ]
ā€¢ Next map the data to an RDD with average and number of ratings
def getCountsAndAverages(RatingsTuple):
total = 0.0
for rating in RatingsTuple[1]:
total += rating
return ( RatingsTuple[0],
Page73 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ The naĆÆve approach
ā€¢ Only the movie name is still missing from the tuple
ā€¢ The name of the movie was not present in the ratings data. It must
be joined in from the movie data
movieNameWithAvgRatingsRDD = ( moviesRDD
.join(movieIDsWithAvgRatingsRDD)
.map(lambda ( movieid,(name,(ratings, average)) ):
(average, name, ratings)) )
ā€¢ The join creates tuples that still contain the movieID and ends up nested three deep:
(Key , (Left_Value, Right_value) )
ā€¢ A simple map() solves that problem and produces the tuple we need
Page74 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ The naĆÆve approach
ā€¢ The RDD now contains tuples of the correct form
Print movieNameWithAvgRatingsRDD.take(3)
[
(3.68181818181818, 'Happiest Millionaire, The (1967)', 22),
(3.04682274247491, 'Grumpier Old Men (1995)', 299),
(2.88297872340425, 'Hocus Pocus (1993)', 94)
]
Page75 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ The naĆÆve approach
ā€¢ Now we can easily filter out all the movies with less than 500 ratings,
sort the RDD by average rating and show the top 20
movieLimitedAndSortedByRatingRDD =
( movieNameWithAvgRatingsRDD
.filter(
name, ratings): ratings > 500lambda (average,
)
.sortBy(sortFunction, ascending=False)
)
Page76 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ The naĆÆve approach
value =
return
tuple[1]
(key + ' ' + value)
ā€¢ sortFunction makes sure the tuples are sorted using both key and
value which insures a consistent sort, even if a key appears more
than once
def sortFunction(tuple):
key = unicode('%.3f' % tuple[0])
Page77 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ The naĆÆve approach
print 'Movies with highest ratings: %s' %
movieLimitedAndSortedByRatingRDD.take(20)
Movies with highest ratings: [
1447),
Page78 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ Collaborative Filtering
ā€¢ The naĆÆve approach will recommend
the same movies to everybody,
regardless of their personal
preferences.
ā€¢ Collaborative Filtering will look for
people with similar tastes and use
their ratings to give recommendations
fit to your personal preferences.
Image from Wikipedia:
https://en.wikipedia.org/wiki/Collaborative_filtering
Page79 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ Collaborative Filtering
ā€¢ We have a matrix where every row is the ratings for one user for all
movies in the database.
ā€¢ Since every user did not rate every movie, this matrix is incomplete.
ā€¢ Predicting the missing ratings is exactly what we need to do in order
to give the user good recommendations
ā€¢ The algorithm that is usually applied to solve recommendation
problems is ā€œAlternating Least Squaresā€ which takes an iterative
approach to finding the missing values in the matrix.
ā€¢ Sparkā€™s mllib has a module for Alternating Least Square recommendation, aptly called ā€œALSā€
Page80 Ā© Hortonworks Inc. 2014 80
Step 3: Create Model ā€“ Collaborative Filtering
ā€¢ Machine Learning workflow
Full
Dataset
Training
Set
Validation
Set
Test Set
Model
Accuracy
(Over-fitting test)
Prediction
Page81 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ Collaborative Filtering
ā€¢ Randomly split the dataset we have in multiple groups for training,
validating and testing using randomSplit(weights, seed=None)
trainingRDD, validationRDD, testRDD =
ratingsRDD.randomSplit([6, 2, 2], seed=0L)
print 'Training: %s, validation: %s, test: %sn' %
trainingRDD.count(),
validationRDD.count(),
testRDD.count())
Training: 292716, validation: 96902, test: 98032
Page82 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ Collaborative Filtering
ā€¢ Before we start training the model, we need a way to calculate how
good a model is, so we can compare it against other tries
ā€¢ Root Mean Square Error (RMSE) is often used to compute the error of
a model
ā€¢ RMSE compares the predicted values from the training set with
the real values present in the validation set. By adding the
absolute values of the differences, and taking the average of
those values, we get a single number that represents the error
of the model
Page83 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ Collaborative Filtering
def computeError(predictedRDD, actualRDD):
predictedReformattedRDD = (predictedRDD
.map(lambda (UserID,
actualReformattedRDD
MovieID, Rating):((UserID,
= (actualRDD
MovieID), Rating)) )
.map(lambda (UserID, MovieID, Rating):((UserID, MovieID), Rating)) )
squaredErrorsRDD = (predictedReformattedRDD
.join(actualReformattedRDD)
.map(lambda (k,(a,b)): math.pow((a-b),2)))
totalError = squaredErrorsRDD.reduce(lambda a,b: a+b)
numRatings = squaredErrorsRDD.count()
return math.sqrt(float(totalError)/numRatings)
Page84 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ Collaborative Filtering
ā€¢ Create a trained model using the ALS.train() method from Spark mllib
ā€¢ Rank is the most important parameter to tune
ā€¢ The number of rows and columns in the matrix used
ā€¢ A lower rank will mean higher error, a high rank may lead to overfitting
ALS.train(
trainingRDD,
rank, # Weā€™ll try 3 ranks: 4, 8, 12
seed = 5L,
iterations = 5,
lambda_ = 0.1
)
Page85 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ Collaborative Filtering
ā€¢ Use the trained model to predict the missing ratings in the validation
set
ā€¢ Create a new RDD from te validation set where the ratings are removed
ā€¢ Call the predictAll() method using the trained model on that RDD
validationForPredictRDD = validationRDD
.map(
lambda (UserID, MovieID, Rating):
(UserID, MovieID)
)
predictedRatingsRDD =
model.predictAll(validationForPredictRDD)
Page86 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ Collaborative Filtering
ā€¢ Finally use our computeError() method to calculate the error of our
trained model by comparing the predicted ratings with the real ones
error = computeError(predictedRatingsRDD, validationRDD)
Page87 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ Collaborative Filtering
from pyspark.mllib.recommendation import ALS
validationForPredictRDD = ( validationRDD
.map(lambda (UserID, MovieID, Rating): (UserID, MovieID))
ranks = [4, 8, 12]
errors = [0, 0
,
0]
err = 0
minError = float('inf')
bestRank = -1bestIteration = -1
ā€¢ Import the ALS module, create the ā€œemptyā€ validatio RDD for prediction and set up some
variables
Page88 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ Collaborative Filtering
for rank in ranks:
model = ALS.train( trainingRDD, rank, seed=5L,
iterations=5, lambda_=0.1)
predictedRatingsRDD =
model.predictAll(validationForPredictRDD)
error = computeError(predictedRatingsRDD, validationRDD)
errors[err] = error
err += 1
print 'For rank %s the RMSE is %s' % (rank, error)
minError:if error <
minError
bestRank
= error
= rank
Page89 Ā© Hortonworks Inc. 2014
Step 3: Create Model ā€“ Collaborative Filtering
ā€¢ The model that was trained with rank 8 has the lowest error (RMSE)
print 'The best model was trained with rank %s' % bestRank
For rank 4 the RMSE is 0.892734779484
For rank 8 the RMSE is 0.890121292255
For rank 12 the RMSE is 0.890216118367
The best model was trained with rank 8
Page90 Ā© Hortonworks Inc. 2014
Step 4: Test Model
ā€¢ So we have now found the best model, but now we still need to test
if the model is actually good
ā€¢ Testing using the same validation set is not a good test since it may
leave us vulnerable to overfitting
ā€¢ The model is so fit to the validation set, that it only produces good results for that set
ā€¢ This is why we have split of a test set at the start of the Machine
Learning process
ā€¢ We will use the best rank result we obtained to train a model and
then predict the ratings for the test set
ā€¢ Calculating the RMSE for the test set predictions should tell us if our
model is useable
Page91 Ā© Hortonworks Inc. 2014
Step 4: Test Model
ā€¢ We recreate the model, remove all the ratings present in the test set
and run the predictAll() method
8,seed=5L,
iterations=5, lambda_=0.1)
myModel = ALS.train(trainingRDD,
testForPredictingRDD =
testRDD.map(lambda (UserID, MovieID, Rating):
(UserID, MovieID))
predictedTestRDD = myModel.predictAll(testForPredictingRDD)
testRMSE = computeError(testRDD, predictedTestRDD)
Page92 Ā© Hortonworks Inc. 2014
Step 4: Test Model
ā€¢ The RMSE is good. Our model does not suffer from overfitting and is
usable.
ā€¢ The RMSE of the validation set was 0.890121292255, only slightly better
print 'The model had a RMSE on the test set of %s' %
testRMSE
The model had a RMSE on the test set of 0.891048561304
Page93 Ā© Hortonworks Inc. 2014
Step 5: Use the model
ā€¢ Letā€™s get some movie predictions!
ā€¢ First I need to give the data set some ratings so it has something to deduce my taste
myRatedMovies = [ # Rating
(0, 845,5.0), # Blade Runner (1982) - 5.0/5
(0, 789,4.5), # Good Will Hunting (1997) - 4.5/5
(0, 983,4.8), # Christmas Story, A (1983) - 4.8/5
(0, 551,2.0), # Taxi Driver (1976) - 2.0/5
(0,1039,2.0), # Pulp Fiction (1994) - 2.0/5
(0, 651,5.0), # Dr. Strangelove (1963) - 5.0/5
(0,1195,4.0), # Raiders of the Lost Ark (1981) - 4.0/5
(0,1110,5.0), # Sixth Sense, The (1999) - 4.5/5
(0,1250,4.5), # Matrix, The (1999) - 4.5/5
- 4.0/5(0,1083,4.0) # Princess Bride, The (1987)
]
myRatingsRDD = sc.parallelize(myRatedMovies)
Page94 Ā© Hortonworks Inc. 2014
Step 5: Use the model
ā€¢ Then we add my ratings to the data set
ā€¢ since we now have more ratings, letā€™s train our model again
ā€¢ and make sure the RMSE is still OK (re-using the test set RDDs from the previous step)
trainingWithMyRatingsRDD = myRatingsRDD.union(trainingRDD)
myRatingsModel = ALS.train(trainingWithMyRatingsRDD, 8,
seed=5L, iterations=5, lambda_=0.1)
predictedTestMyRatingsRDD = myRatingsModel
.predictAll(testForPredictingRDD)
testRMSEMyRatings = computeError(testRDD,
predictedTestMyRatingsRDD)
Page95 Ā© Hortonworks Inc. 2014
Step 5: Use the model
ā€¢ And of course, check the RMSE again... Weā€™re good
print 'The model had a RMSE on the test set of %s' %
testRMSEMyRatings
The model had a RMSE on the test set of 0.892023318284
Page96 Ā© Hortonworks Inc. 2014
Step 5: Use the model
ā€¢ Now we need an RDD with only the movies I did not rate, to run
predictAll() on. (my userid is set to zero)
ā€¢ [(0, movieID1), (0, movieID2), (0, movieID3), ā€¦]
myUnratedMoviesRDD = (moviesRDD
.map(lambda (movieID, name): movieID)
.filter(lambda movieID:
in myRatedMovies] )movieID not in [ mine[1] for mine
.map(lambda movieID: (0, movieID)))
predictedRatingsRDD =
myRatingsModel.predictAll(myUnratedMoviesRDD)
Page97 Ā© Hortonworks Inc. 2014
Step 5: Use the model
ā€¢ From the predicted RDD, get the top 20 predicted ratings, but only
for movies that had at least 75 ratings in total
ā€¢ Re-use the RDD we created in the naĆÆve approach that had the average ratings and
number of ratings. (movieIDsWithAvgRatingsRDD)
ā€¢ Map it to tuples of form (movieID, number_of_ratings)
ā€¢ Strip the userid from the predicted RDD
ā€¢ Map it to tuples (movieID, predicted_rating)
ā€¢ Join those two and add the movie names from the original movies
data and clean up the result
ā€¢ The resulting tuple is (predicted_rating, name, number_of_ratings)
ā€¢ Filter out all movies that had less than 75 ratings
Page98 Ā© Hortonworks Inc. 2014
Step 5: Use the model
movieCountsRDD = movieIDsWithAvgRatingsRDD
.map(lambda (movie_id, (ratings, average)):
(movie_id, ratings))
predictedRDD = predictedRatingsRDD
.map(lambda (uid, movie_id, rating): (movie_id, rating))
predictedWithCountsRDD= (predictedRDD.join(movieCountsRDD))
ratingsWithNamesRDD = (predictedWithCountsRDD
.join(moviesRDD)
.map(lambda (movie_id, ((pred, ratings), name)):
(pred, name,
.filter(lambda (pred, name, ratings): ratings
ratings))
> 75))
Page99 Ā© Hortonworks Inc. 2014
Step 5: Use the model
ā€¢ And finally get the top 20 recommended movies for myself
predictedHighestRatedMovies =
ratingsWithNamesRDD.takeOrdered(20, key=lambda x: -x[0])
print ('My highest rated movies as predictedn%s' %
'n'.join(map(str, predictedHighestRatedMovies)))
Page100 Ā© Hortonworks Inc. 2014
Step 5: Use the model
My highest rated movies as predicted:
(4.823536053603062, 'Once Upon a Time in the West (1969)', 82)
(4.743456934724456, 'Texas Chainsaw Massacre, The (1974)', 111)
(4.452221024980805, 'Evil Dead II (Dead By Dawn) (1987)', 305)
(4.387531237859994, 'Duck Soup (1933)', 279)
(4.373821653377477, 'Citizen Kane (1941)', 527)
(4.344480264132989, 'Cabin Boy (1994)', 95)
(4.332264360095111, 'Shaft (1971)', 85)
(4.217371529794628, 'Night of the Living Dead (1968)', 352)
(4.181318251399025, 'Yojimbo (1961)', 110)
(4.171790272807383, 'Naked Gun: From the Files of Police Squad', 435)
ā€¦
Apache Spark on HDP 2.3
Predict
Train
Model
Persist
Submit Rating
Recommen
dation
SparkShell
User
Improve Model
Ā© Hortonworks Inc. 2011 ā€“ 2015. All Rights Reserved
Page102 Ā© Hortonworks Inc. 2014
Conclusion and Q&A
Page103 Ā© Hortonworks Inc. 2014
Learn More Spark + Hadoop Perfect Together
HDP Spark General Info:
http://hortonworks.com/hadoop/spark/
Learn more about our Focus on Spark:
http://hortonworks.com/hadoop/spark/#section_6
Get the HDP Spark 1.5.1 Tech Preview:
http://hortonworks.com/hadoop/spark/#section_5
Get started with Spark and Zeppelin and download the Sandbox:
http://hortonworks.com/sandbox
Try these tutorials:
http://hortonworks.com/hadoop/spark/#tutorials
http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/
Learn more about GeoSpatial Spark processing with Magellan:
http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/

More Related Content

What's hot

Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
Ā 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
Ā 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
Ā 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann
Ā 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks
Ā 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
Ā 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
Ā 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
Ā 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
Ā 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
Ā 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
Ā 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2Sujee Maniyam
Ā 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
Ā 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash CourseDataWorks Summit
Ā 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
Ā 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Ā 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
Ā 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
Ā 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit
Ā 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
Ā 

What's hot (20)

Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
Ā 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
Ā 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Ā 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Ā 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Ā 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Ā 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Ā 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Ā 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Ā 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
Ā 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Ā 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
Ā 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Ā 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
Ā 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Ā 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Ā 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Ā 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Ā 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
Ā 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Ā 

Viewers also liked

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
Ā 
Gobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopYinan Li
Ā 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopDataWorks Summit
Ā 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
Ā 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best PracticesPrakash Chockalingam
Ā 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoopskaluska
Ā 
Bitraf Arduino workshop
Bitraf Arduino workshopBitraf Arduino workshop
Bitraf Arduino workshopJens Brynildsen
Ā 
JSON-LD: JSON for the Social Web
JSON-LD: JSON for the Social WebJSON-LD: JSON for the Social Web
JSON-LD: JSON for the Social WebGregg Kellogg
Ā 
Witness statement
Witness statementWitness statement
Witness statementLola Heavey
Ā 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
Ā 
ā€ā€™I den svenska och tyska litteraturens mittpunktā€™: Svenska Pommerns roll som...
ā€ā€™I den svenska och tyska litteraturens mittpunktā€™: Svenska Pommerns roll som...ā€ā€™I den svenska och tyska litteraturens mittpunktā€™: Svenska Pommerns roll som...
ā€ā€™I den svenska och tyska litteraturens mittpunktā€™: Svenska Pommerns roll som...Andreas Ɩnnerfors
Ā 
JSON-LD for RESTful services
JSON-LD for RESTful servicesJSON-LD for RESTful services
JSON-LD for RESTful servicesMarkus Lanthaler
Ā 
IOT 101 - A primer on Internet of Things
IOT 101 - A primer on Internet of ThingsIOT 101 - A primer on Internet of Things
IOT 101 - A primer on Internet of ThingsNagarro
Ā 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
Ā 
MotivaciĆ³n laboral
MotivaciĆ³n laboralMotivaciĆ³n laboral
MotivaciĆ³n laboralalexander_hv
Ā 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBGord Sissons
Ā 

Viewers also liked (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Ā 
Gobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for HadoopGobblin: Unifying Data Ingestion for Hadoop
Gobblin: Unifying Data Ingestion for Hadoop
Ā 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
Ā 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
Ā 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Ā 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
Ā 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
Ā 
Bitraf Arduino workshop
Bitraf Arduino workshopBitraf Arduino workshop
Bitraf Arduino workshop
Ā 
JSON-LD: JSON for the Social Web
JSON-LD: JSON for the Social WebJSON-LD: JSON for the Social Web
JSON-LD: JSON for the Social Web
Ā 
Witness statement
Witness statementWitness statement
Witness statement
Ā 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
Ā 
EKSG 2017 Approved Budget
EKSG 2017 Approved Budget EKSG 2017 Approved Budget
EKSG 2017 Approved Budget
Ā 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
Ā 
ā€ā€™I den svenska och tyska litteraturens mittpunktā€™: Svenska Pommerns roll som...
ā€ā€™I den svenska och tyska litteraturens mittpunktā€™: Svenska Pommerns roll som...ā€ā€™I den svenska och tyska litteraturens mittpunktā€™: Svenska Pommerns roll som...
ā€ā€™I den svenska och tyska litteraturens mittpunktā€™: Svenska Pommerns roll som...
Ā 
JSON-LD for RESTful services
JSON-LD for RESTful servicesJSON-LD for RESTful services
JSON-LD for RESTful services
Ā 
IOT 101 - A primer on Internet of Things
IOT 101 - A primer on Internet of ThingsIOT 101 - A primer on Internet of Things
IOT 101 - A primer on Internet of Things
Ā 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Ā 
MotivaciĆ³n laboral
MotivaciĆ³n laboralMotivaciĆ³n laboral
MotivaciĆ³n laboral
Ā 
IBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TBIBM Hadoop-DS Benchmark Report - 30TB
IBM Hadoop-DS Benchmark Report - 30TB
Ā 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
Ā 

Similar to Spark Advanced Analytics NJ Data Science Meetup - Princeton University

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
Ā 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
Ā 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
Ā 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
Ā 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
Ā 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
Ā 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
Ā 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
Ā 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
Ā 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
Ā 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
Ā 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
Ā 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit
Ā 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
Ā 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
Ā 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
Ā 

Similar to Spark Advanced Analytics NJ Data Science Meetup - Princeton University (20)

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
Ā 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Ā 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
Ā 
Apache spark
Apache sparkApache spark
Apache spark
Ā 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Ā 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Ā 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Ā 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
Ā 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Ā 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Ā 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
Ā 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Ā 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Ā 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ā 
Spark core
Spark coreSpark core
Spark core
Ā 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
Ā 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
Ā 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Ā 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Ā 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Ā 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
Ā 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
Ā 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
Ā 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
Ā 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
Ā 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
Ā 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...gurkirankumar98700
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
Ā 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĆŗjo
Ā 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
Ā 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
Ā 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜RTylerCroy
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationRadu Cotescu
Ā 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Ā 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
Ā 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Ā 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Ā 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Ā 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Ā 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Ā 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Ā 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Ā 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
Ā 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organization
Ā 

Spark Advanced Analytics NJ Data Science Meetup - Princeton University

  • 1. Page1 Ā© Hortonworks Inc. 2014 Advanced Analytics with Apache Spark and Apache Zeppelin in HDP Hortonworks. We do Hadoop. Alex Zeltov Solutions Engineer @azeltov
  • 2. Page2 Ā© Hortonworks Inc. 2014 In this workshop ā€¢ Introduction to HDP and Spark ā€¢ Build a Data analytics application: - Spark Programming: Scala, Python, R - Core Spark: working with RDDs, DataFrames - Spark SQL: structured data access - Spark MlLib: predictive analytics - Spark Streaming: real time data processing ā€¢ Develop Recommendation Engine - Using the ā€œCollaborative Filteringā€ method ā€¢ Conclusion and Q/A
  • 3. Page3 Ā© Hortonworks Inc. 2014 Introduction to HDP and Spark http://hortonworks.com/hadoop/spark/
  • 4. Page4 Ā© Hortonworks Inc. 2014 Spark is certified as YARN Ready and is a part of HDP. Hortonworks Data Platform 2.3 GOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS YARN: Data Operating System (Cluster Resource Management) MapReduce Apache Falcon Apache Sqoop Apache Flume Apache Kafka ApacheHive ApachePig ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm 1 ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ ā€¢ HDFS (Hadoop Distributed File System) Apache Ambari Apache ZooKeeper Apache Oozie Deployment Choice Linux Windows On-premises Cloud Apache Atlas Cloudbreak SECURITY Apache Ranger Apache Knox Apache Atlas HDFS Encryption ISVEngines
  • 5. Page5 Ā© Hortonworks Inc. 2014 Spark Components Spark allows you to do data processing, ETL, machine learning, stream processing, SQL querying from one framework
  • 6. Page6 Ā© Hortonworks Inc. 2014 Emerging Spark Patterns ā€¢ Spark as query federation engine ļƒ˜ Bring data from multiple sources to join/query in Spark ā€¢ Use multiple Spark libraries together ļƒ˜ Common to see Core, ML & Sql used together ā€¢ Use Spark with various Hadoop ecosystem projects ļƒ˜ Use Spark & Hive together ļƒ˜ Spark & HBase together
  • 7. Page7 Ā© Hortonworks Inc. 2014 More Data Sources APIs 18/03/2016
  • 8. Page8 Ā© Hortonworks Inc. 2014 Spark Deployment Modes ā€¢ Spark Standalone Cluster: ā€“ For developing Spark apps against a local Spark (similar to develop/deploying in IDE) ā€¢ Spark on YARN in two modes: ā€“ Spark driver (SparkContext) in local (yarn-client): Spark Driver runs in the client process outside of YARN cluster, and ApplicationMaster is only used to negotiate resources from Resoure manager ā€“ Spark driver (SparkContext) in YARN AM(yarn-cluster): Spark Driver runs in ApplicationMaster spawned by NodeManager on a slave node
  • 9. Page9 Ā© Hortonworks Inc. 2014 Spark on YARN YARN RM App Master Monitoring UI
  • 10. Page10 Ā© Hortonworks Inc. 2014 Spark UI
  • 11. Page11 Ā© Hortonworks Inc. 2014 Interacting with Spark
  • 12. Page12 Ā© Hortonworks Inc. 2014 Interacting with Spark ā€¢ Sparkā€™s interactive REPL shell (in Python or Scala) ā€¢ Web-based Notebooks: ā€¢ Zeppelin: A web-based notebook that enables interactive data analytics. ā€¢ Jupyter: Evolved from the IPython Project ā€¢ SparkNotebook: forked from the scala-notebook
  • 13. Page13 Ā© Hortonworks Inc. 2014 Apache Zeppelin ā€¢ A web-based notebook that enables interactive data analytics. ā€¢ Multiple language backend ā€¢ Multi-purpose Notebook is the place for all your needs ļƒ˜ Data Ingestion ļƒ˜ Data Discovery ļƒ˜ Data Analytics ļƒ˜ Data Visualization ļƒ˜ Collaboration
  • 14. Page14 Ā© Hortonworks Inc. 2014 Zeppelin- Multiple language backend Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.
  • 15. Page15 Ā© Hortonworks Inc. 2014 Zeppelin ā€“ Dependency Management ā€¢ Load libraries recursively from Maven repository ā€¢ Load libraries from local filesystem ā€¢ %dep ā€¢ // add maven repository ā€¢ z.addRepo("RepoName").url("RepoURLā€) ā€¢ // add artifact from filesystem ā€¢ z.load("/path/to.jar") ā€¢ // add artifact from maven repository, with no dependency ā€¢ z.load("groupId:artifactId:version").excludeAll()
  • 16. Page16 Ā© Hortonworks Inc. 2014 Spark & Zeppelin Pace of Innovation HDP 2.2.4 Spark 1.2.1 GA HDP 2.3.2 Spark 1.4.1 GA HDP 2.3.0 Spark 1.3.1 GA HDP 2.3.4 Spark 1.5.2* GA Spark Spark 1.3.1 TP 5/2015 Spark 1.4.1 TP 8/2015 Spark 1.5.1 TP Nov/2015 Now Zeppelin TP Oct/2015 Apache Zeppelin Zeppelin TP Refresh March 1st 2016 Dec 2015 HDP 2.4.0 Spark 1.6 GA Zeppelin GA Q1, 2016 Last Awareness Session Spark 1.6 TP Jan/2015 March 1st 2016 HDP 2.5.x Spark 1.6.1* GA Q1, 2016
  • 17. Page17 Ā© Hortonworks Inc. 2014 Spark in HDP customer base - 2015 0 10 20 30 40 50 60 70 Q1 Q2 Q3 Q4 Unique # of customers filing Spark tickets/Qs Customers that filed Spark tickets in 2015 132
  • 18. Page18 Ā© Hortonworks Inc. 2014 Programming Spark
  • 19. Page19 Ā© Hortonworks Inc. 2014 How Does Spark Work? ā€¢ RDD ā€¢ Your data is loaded in parallel into structured collections ā€¢ Actions ā€¢ Manipulate the state of the working model by forming new RDDs and performing calculations upon them ā€¢ Persistence ā€¢ Long-term storage of an RDDā€™s state
  • 20. Page20 Ā© Hortonworks Inc. 2014 Resilient Distributed Datasets ā€¢ The primary abstraction in Spark Ā» Immutable once constructed Ā» Track lineage information to efļ¬ciently recompute lost data Ā» Enable operations on collection of elements in parallel ā€¢ You construct RDDs Ā» by parallelizing existing collections (lists) Ā» by transforming an existing RDDs Ā» from ļ¬les in HDFS or any other storage system
  • 21. Page21 Ā© Hortonworks Inc. 2014 item-1 item-2 item-3 item-4 item-5 item-6 item-7 item-8 item-9 item-10 item-11 item-12 item-13 item-14 item-15 item-16 item-17 item-18 item-19 item-20 item-21 item-22 item-23 item-24 item-25 more partitions = more parallelism Worker Spark executor Worker Spark executor Worker Spark executor RDDs ā€¢ Programmer speciļ¬es number of partitions for an RDD (Default value used if unspeciļ¬ed) RDD split into 5 partitions
  • 22. Page22 Ā© Hortonworks Inc. 2014 RDDs ā€¢ Two types of operations:transformations and actions ā€¢ Transformations are lazy (not computed immediately) ā€¢ Transformed RDD is executed when action runs on it ā€¢ Persist (cache) RDDs in memory or disk
  • 23. Page23 Ā© Hortonworks Inc. 2014 Example RDD Transformations ā€¢map(func) ā€¢filter(func) ā€¢distinct(func) ā€¢ All create a new DataSet from an existing one ā€¢ Do not create the DataSet until an action is performed (Lazy) ā€¢ Each element in an RDD is passed to the target function and the result forms a new RDD
  • 24. Page24 Ā© Hortonworks Inc. 2014 Example Action Operations ā€¢count() ā€¢reduce(func) ā€¢collect() ā€¢take() ā€¢ Either: ā€¢ Returns a value to the driver program ā€¢ Exports state to external system
  • 25. Page25 Ā© Hortonworks Inc. 2014 Example Persistence Operations ā€¢persist() -- takes options ā€¢cache() -- only one option: in-memory ā€¢ Stores RDD Values ā€¢ in memory (what doesnā€™t fit is recalculated when necessary) ā€¢ Replication is an option for in-memory ā€¢ to disk ā€¢ blended
  • 26. Page26 Ā© Hortonworks Inc. 2014 Spark Applications Are a definition in code of ā€¢ RDD creation ā€¢ Actions ā€¢ Persistence Results in the creation of a DAG (Directed Acyclic Graph) [workflow] ā€¢ Each DAG is compiled into stages ā€¢ Each Stage is executed as a series of Tasks ā€¢ Each Task operates in parallel on assigned partitions
  • 27. Page27 Ā© Hortonworks Inc. 2014 Spark Context ā€¢ A Spark program ļ¬rst creates a SparkContext object ā€¢ Tells Spark how and where to access a cluster ā€¢ Use SparkContext to create RDDs ā€¢ SparkContext, SQLContext, ZeppelinContext: ā€¢ are automatically created and exposed as variable names 'sc', 'sqlContext' and 'z', respectively, both in scala and python environments using Zeppelin ā€¢ iPython and programs must use a constructor to create a new SparkContext Note: that scala / python environment shares the same SparkContext, SQLContext, ZeppelinContext instance.
  • 28. Page28 Ā© Hortonworks Inc. 2014 1. Resilient Distributed Dataset [RDD] Graph val v = sc.textFile("hdfs://ā€¦some-hdfs-data") mapmap reduceByKey collecttextFile v.flatMap(line=>line.split(" ")) .map(word=>(word, 1))) .reduceByKey(_ + _, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] Array[(String, Int)] RDD[(String, Int)]
  • 29. Page29 Ā© Hortonworks Inc. 2014 Processing A File in Scala //Load the file: val file = sc.textFile("hdfs://ā€¦/user/DAW/littlelog.csv") //Trim away any empty rows: val fltr = file.filter(_.length > 0) //Print out the remaining rows: fltr.foreach(println) 29
  • 30. Page30 Ā© Hortonworks Inc. 2014 Looking at the State in the Machine //run debug command to inspect RDD: scala> fltr.toDebugString //simplified output: res1: String = FilteredRDD[2] at filter at <console>:14 MappedRDD[1] at textFile at <console>:12 HadoopRDD[0] at textFile at <console>:12 30
  • 31. Page31 Ā© Hortonworks Inc. 2014 A Word on Anonymous Functions Scala programmers make great use of anonymous functions as can be seen in the code: flatMap( line => line.split(" ") ) 31 Argument to the function Body of the function
  • 32. Page32 Ā© Hortonworks Inc. 2014 Scala Functions Come In a Variety of Styles flatMap( line => line.split(" ") ) flatMap((line:String) => line.split(" ")) flatMap(_.split(" ")) 32 Argument to the function (type inferred) Body of the function Argument to the function (explicit type) Body of the function No Argument to the function declared (placeholder) instead Body of the function includes placeholder _ which allows for exactly one use of one arg for each _ present. _ essentially means ā€˜whatever you pass meā€™
  • 33. Page33 Ā© Hortonworks Inc. 2014 And Finally ā€“ the Formal ā€˜defā€™ def myFunc(line:String): Array[String]={ return line.split(",") } //and now that it has a name: myFunc("Hi Mom, Iā€™m home.").foreach(println) Return type of the function) Body of the function Argument to the function)
  • 34. Page34 Ā© Hortonworks Inc. 2014 LAB: Spark RDD & Data Frames Demo ā€“ Philly Crime Data Set http://sandbox.hortonworks.com:8081/#/notebook/2B6HKTZDK
  • 35. Page35 Ā© Hortonworks Inc. 2014 Spark DataFrames
  • 36. Page36 Ā© Hortonworks Inc. 2014 What are DataFrames? ā€¢ Distributed Collection of Data organized in Columns ā€¢ Equivalent to Tables in Databases or DataFrame in R/PYTHON ā€¢ Much richer optimization than any other implementation of DF ā€¢ Can be constructed from a wide variety of sources and APIs ā€¢ Greater accessiblity ā€¢ Declarative rather thanimperative ā€¢ Catalyst Optimizer Why DataFrames?
  • 37. Page37 Ā© Hortonworks Inc. 2014 Writing a DataFrame val df = sqlContext.jsonFile("/tmp/people.json") df.show() df.printSchema() df.select ("First Name").show() df.select("First Name","Age").show() df.filter(df("age")>40).show() df.groupBy("age").count().show()
  • 38. Page38 Ā© Hortonworks Inc. 2014 Querying RDD Using SQL import org.apache.spark.sql.types.{StructType,StructField,StringType} val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) val sqlContext = new org.apache.spark.sql.SQLContext(sc) val people = sc.textFile("/tmp/people.txt") val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)) val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema) peopleDataFrame.registerTempTable("people") val results = sqlContext.sql("SELECT name FROM people") results.map(t => "Name: " + t(0)).collect().foreach(println)
  • 39. Page39 Ā© Hortonworks Inc. 2014 Querying RDD Using SQL // SQL statements can be run directly on RDDā€™s val teenagers = sqlC.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are SchemaRDDs and support // normal RDD operations: val nameList = teenagers.map(t => "Name: " + t(0)).collect() // Language integrated queries (ala LINQ) val teenagers = people.where('age >= 10).where('age <= 19).select('name)
  • 40. Page40 Ā© Hortonworks Inc. 2014 Dataframes for Apache Spark DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala Time to aggregate 10 million integer pairs (in seconds) DataFrames can be significantly faster than RDDs. And they perform the same, regardless of language.
  • 41. Page41 Ā© Hortonworks Inc. 2014 Transformations Actions filter count select collect drop show join take Transformations contribute to the query plan but nothing is executed until an action is called Dataframes ā€“ Transformation & Actions
  • 42. Page42 Ā© Hortonworks Inc. 2014 LAB: DataFrames http://sandbox.hortonworks.com:8081/#/notebook/2B4B7EWY7 http://sandbox.hortonworks.com:8081/#/notebook/2B5RMG4AM DataFrames + SQL DataFrames JSON
  • 43. Page43 Ā© Hortonworks Inc. 2014 DataFrames and JDBC val jdbc_attendees = sqlContext.load("jdbc", Map("url" -> "jdbc:mysql://localhost:3306/db1?user=root&password=xxx","dbtable" -> "attendees")) jdbc_attendees.show() jdbc.attendees.count() jdbc_attendees.registerTempTable("jdbc_attendees") val countall = sqlContext.sql("select count(*) from jdbc_attendees") countall.map(t=>"Records count is "+t(0)).collect().foreach(println)
  • 44. Page44 Ā© Hortonworks Inc. 2014 Code ā€˜select countā€™ Equivalent SQL Statement: Select count(*) from pagecounts WHERE state = ā€˜FLā€™ Scala statement: val file = sc.textFile("hdfs://ā€¦/log.txt") val numFL = file.filter(line => line.contains("fl")).count() scala> println(numFL) 44 1. Load the page as an RDD 2. Filter the lines of the page eliminating any that do not contain ā€œflā€œ 3. Count those lines that remain 4. Print the value of the counted lines containing ā€˜flā€™
  • 45. Page45 Ā© Hortonworks Inc. 2014 Spark SQL 45
  • 46. Page46 Ā© Hortonworks Inc. 2014 46 Platform APIs ā€¢ Joining Data from Different Sources ā€¢ Access Data using DataFrames / SQL
  • 47. Page47 Ā© Hortonworks Inc. 2014 47 Platform APIs ā€¢ Community Plugins ā€¢ 100+ connectors http://spark-packages.org/
  • 48. Page48 Ā© Hortonworks Inc. 2014 LAB: JDBC and 3rd party packages http://sandbox.hortonworks.com:8081/#/notebook/2B2P8RE82
  • 49. Page49 Ā© Hortonworks Inc. 2014 What About Integration With Hive? scala> val hiveCTX = new org.apache.spark.sql.hive.HiveContext(sc) scala> hiveCTX.hql("SHOW TABLES").collect().foreach(println) ā€¦ [omniture] [omniturelogs] [orc_table] [raw_products] [raw_users] ā€¦ 49
  • 50. Page50 Ā© Hortonworks Inc. 2014 More Integration With Hive: scala> hCTX.hql("DESCRIBE raw_users").collect().foreach(println) [swid,string,null] [birth_date,string,null] [gender_cd,string,null] scala> hCTX.hql("SELECT * FROM raw_users WHERE gender_cd='F' LIMIT 5").collect().foreach(println) [0001BDD9-EABF-4D0D-81BD-D9EABFCD0D7D,8-Apr-84,F] [00071AA7-86D2-4EB9-871A-A786D27EB9BA,7-Feb-88,F] [00071B7D-31AF-4D85-871B-7D31AFFD852E,22-Oct-64,F] [000F36E5-9891-4098-9B69-CEE78483B653,24-Mar-85,F] [00102F3F-061C-4212-9F91-1254F9D6E39F,1-Nov-91,F] 50
  • 51. Page51 Ā© Hortonworks Inc. 2014 ORC at Spotify 16x less HDFS read when using ORC versus Avro.(5) IOi 32x less CPU when using ORC versus Avro.(5) CPUi [2]
  • 52. Page52 Ā© Hortonworks Inc. 2014 LAB: HIVE ORC http://sandbox.hortonworks.com:8081/#/notebook/2B6KUW16Z
  • 53. Page53 Ā© Hortonworks Inc. 2014 Spark Streaming
  • 54. Page54 Ā© Hortonworks Inc. 2014 MicroBatch Spark Streams
  • 55. Page55 Ā© Hortonworks Inc. 2014 Physical Execution
  • 56. Page56 Ā© Hortonworks Inc. 2014 Spark Streaming 101 ā€¢ Spark has significant library support for streaming applications val ssc = new StreamingContext(sc, Seconds(5)) val tweetStream = TwitterUtils.createStream(ssc, Some(auth)) ā€¢ Allows to combine Streaming with Batch/ETL,SQL & ML ā€¢ Read data from HDFS, Flume, Kafka, Twitter, ZeroMQ & custom. ā€¢ Chop input data stream into batches ā€¢ Spark processes batches & results published in batches ā€¢ Fundamental unit is Discretized Streams (DStreams)
  • 57. Page57 Ā© Hortonworks Inc. 2014 Spark MLLib
  • 58. Page58 Ā© Hortonworks Inc. 2014 Spark MLlib ā€“ Algorithms Offered ā€¢ Classification: logistic regression, linear SVM, ā€“ naĆÆve Bayes, least squares, classification tree ā€¢ Regression: generalized linear models (GLMs), ā€“ regression tree ā€¢ Collaborative filtering: alternating least squares (ALS), ā€“ non-negative matrix factorization (NMF) ā€¢ Clustering: k-means ā€¢ Decomposition: SVD, PCA ā€¢ Optimization: stochastic gradient descent, L-BFGS
  • 59. Page59 Ā© Hortonworks Inc. 2014 59 ML - Pipelines ā€¢ New algorithms KMeans [SPARK-7879], Naive Bayes [SPARK- 8600], Bisecting KMeans ā€¢ [SPARK-6517], Multi-layer Perceptron (ANN) [SPARK-2352], Weighting for ā€¢ Linear Models [SPARK-7685] ā€¢ New transformers (close to parity with SciKit learn): CountVectorizer [SPARK-8703], ā€¢ PCA [SPARK-8664], DCT [SPARK-8471], N-Grams [SPARK-8455] ā€¢ Calling into single machine solvers (coming soon as a package)
  • 60. Page60 Ā© Hortonworks Inc. 2014 Twitter Language Classifier Goal: connect to real time twitter stream and print only those tweets whose language match our chosen language. Main issue: how to detect the language during run time? Solution: build a language classifier model offline capable of detecting language of tweet (Mlib). Then, apply it to real time twitter stream and do filtering (Spark Streaming).
  • 61. Page61 Ā© Hortonworks Inc. 2014 Spark External Datasources
  • 62. Page62 Ā© Hortonworks Inc. 2014 Spark External Datasources You can load datasets from various external sources: ā€¢ Local Filesystem ā€¢ HDFS ā€¢ HDFS using custom InputFormat ā€¢ Amazon S3 ā€¢ Relational Databases (RDBMS) ā€¢ Apache Cassandra, Mongo DB, etc.
  • 63. Page63 Ā© Hortonworks Inc. 2014 LABS: data load from MongoDB or Cassandra
  • 64. Page64 Ā© Hortonworks Inc. 2014 Recommendation Engine - ALS
  • 65. Page65 Ā© Hortonworks Inc. 2014 Step 1: Data Ingest ā€¢ Using the MovieLens 10M data set ā€¢ http://grouplens.org/datasets/movielens/ ā€¢ Ratings: UserID::MovieID::Rating::Timestamp ā€¢ 10.000.000 ratings on 10.000 movies by 72.000 users ā€¢ ratings.dat.gz ā€¢ Movies: MovieID::Title::Genres ā€¢ 10.000 movies ā€¢ movies.dat
  • 66. Page66 Ā© Hortonworks Inc. 2014 baseDir = os.path.join('movielens') ratingsFilename = os.path.join(baseDir, 'ratings.dat.gz') moviesFilename = os.path.join(baseDir, 'movies.dat') numPartitions = 2 rawRatings = sc.textFile(ratingsFilename).repartition(numPartitions) rawMovies = sc.textFile(moviesFilename) Step 1: Data Ingest ā€¢ Some simple python code followed by the creation of the first RDD import sys import os
  • 67. Page67 Ā© Hortonworks Inc. 2014 Step 2: Feature Extraction ā€¢ Transform the string data in tuples of useful data and remove unwanted pieces ā€¢ Ratings: UserID::MovieID::Rating::TImestamp 1::1193::5::978300760 1::661::3::978302109 => [(1, 1193, 5.0), (1, 914, 3.0), ā€¦] ā€¢ Movies: MovieID::Title::Genres 1::Toy Story (1995):: Animation|Childrenā€™s|Comedy 2::Jumanji (1995)::Adventure|Childrenā€™s|Fantasy => [(1, 'Toy Story (1995)'), (2, u'Jumanji (1995)'), ā€¦]
  • 68. Page68 Ā© Hortonworks Inc. 2014 def get_ratings_tuple(entry): float(items[2]) items = entry.split('::') return int(items[0]), int(items[1]), def get_movie_tuple(entry): items = entry.split('::') return int(items[0]), items[1] ratingsRDD = rawRatings.map(get_ratings_tuple).cache() moviesRDD = rawMovies.map(get_movie_tuple).cache() Step 2: Feature Extraction
  • 69. Page69 Ā© Hortonworks Inc. 2014 Step 2: Feature Extraction ā€¢ Inspect RDD using collect() ā€¢ Careful: make sure the whole dataset fits in the memory of the driver Driver job Executor Task Executor Task ā€¢ Use take(num) ā€¢ Safer: takes a num-size subset print 'Ratings: %s' % ratingsRDD.take(2) Ratings: [(1, 1193, 5.0), (1, 914, 3.0)] print 'Movies: %s' % moviesRDD.take(2) Movies: [(1, u'Toy Story (1995)'), (2, u'Jumanji (1995)')]
  • 70. Page70 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ The naĆÆve approach ā€¢ Recommend movies with the highest average rating ā€¢ Need a tuple containing the movie name and itā€™s average rating ā€¢ Only consider movies with at least 500 ratings ā€¢ Tuple must contain the number of ratings for the movie ā€¢ The tuple we need should be of the folowing form: ( averageRating, movieName, numberOfRatings )
  • 71. Page71 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ The naĆÆve approach ā€¢ Calculate the average rating of a movie ā€¢ From the ratingsRDD, we create tuples containing all the ratings for a movie: ā€“ Remember: ratingsRDD = (UserID, MovieID, Rating) movieIDsWithRatingsRDD = (ratingsRDD .map(lambda (user_id,movie_id,rating): (movie_id,[rating])) .reduceByKey(lambda a,b: a+b)) ā€¢ This is simpele map-reduce in spark: ā€¢ Map: (UserID, MovieID, Rating) => (MovieID, [Rating]) ā€¢ Reduce: (MovieID1, [Rating1]), (MovieID1, [Rating2]) => (MovieID1, [Rating1,Rating2])
  • 72. Page72 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ The naĆÆve approach ( len(RatingsTuple[1]), total/len(RatingsTuple[1])) ) movieIDsWithAvgRatingsRDD = movieIDsWithRatingsRDD.map(getCountsAndAverages) ā€¢ Note that the new key-value tuples have MovieID as key and a nested tuple (ratings,average) as value: [ (2, (332, 3.174698795180723) ), ā€¦ ] ā€¢ Next map the data to an RDD with average and number of ratings def getCountsAndAverages(RatingsTuple): total = 0.0 for rating in RatingsTuple[1]: total += rating return ( RatingsTuple[0],
  • 73. Page73 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ The naĆÆve approach ā€¢ Only the movie name is still missing from the tuple ā€¢ The name of the movie was not present in the ratings data. It must be joined in from the movie data movieNameWithAvgRatingsRDD = ( moviesRDD .join(movieIDsWithAvgRatingsRDD) .map(lambda ( movieid,(name,(ratings, average)) ): (average, name, ratings)) ) ā€¢ The join creates tuples that still contain the movieID and ends up nested three deep: (Key , (Left_Value, Right_value) ) ā€¢ A simple map() solves that problem and produces the tuple we need
  • 74. Page74 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ The naĆÆve approach ā€¢ The RDD now contains tuples of the correct form Print movieNameWithAvgRatingsRDD.take(3) [ (3.68181818181818, 'Happiest Millionaire, The (1967)', 22), (3.04682274247491, 'Grumpier Old Men (1995)', 299), (2.88297872340425, 'Hocus Pocus (1993)', 94) ]
  • 75. Page75 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ The naĆÆve approach ā€¢ Now we can easily filter out all the movies with less than 500 ratings, sort the RDD by average rating and show the top 20 movieLimitedAndSortedByRatingRDD = ( movieNameWithAvgRatingsRDD .filter( name, ratings): ratings > 500lambda (average, ) .sortBy(sortFunction, ascending=False) )
  • 76. Page76 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ The naĆÆve approach value = return tuple[1] (key + ' ' + value) ā€¢ sortFunction makes sure the tuples are sorted using both key and value which insures a consistent sort, even if a key appears more than once def sortFunction(tuple): key = unicode('%.3f' % tuple[0])
  • 77. Page77 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ The naĆÆve approach print 'Movies with highest ratings: %s' % movieLimitedAndSortedByRatingRDD.take(20) Movies with highest ratings: [ 1447),
  • 78. Page78 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ Collaborative Filtering ā€¢ The naĆÆve approach will recommend the same movies to everybody, regardless of their personal preferences. ā€¢ Collaborative Filtering will look for people with similar tastes and use their ratings to give recommendations fit to your personal preferences. Image from Wikipedia: https://en.wikipedia.org/wiki/Collaborative_filtering
  • 79. Page79 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ Collaborative Filtering ā€¢ We have a matrix where every row is the ratings for one user for all movies in the database. ā€¢ Since every user did not rate every movie, this matrix is incomplete. ā€¢ Predicting the missing ratings is exactly what we need to do in order to give the user good recommendations ā€¢ The algorithm that is usually applied to solve recommendation problems is ā€œAlternating Least Squaresā€ which takes an iterative approach to finding the missing values in the matrix. ā€¢ Sparkā€™s mllib has a module for Alternating Least Square recommendation, aptly called ā€œALSā€
  • 80. Page80 Ā© Hortonworks Inc. 2014 80 Step 3: Create Model ā€“ Collaborative Filtering ā€¢ Machine Learning workflow Full Dataset Training Set Validation Set Test Set Model Accuracy (Over-fitting test) Prediction
  • 81. Page81 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ Collaborative Filtering ā€¢ Randomly split the dataset we have in multiple groups for training, validating and testing using randomSplit(weights, seed=None) trainingRDD, validationRDD, testRDD = ratingsRDD.randomSplit([6, 2, 2], seed=0L) print 'Training: %s, validation: %s, test: %sn' % trainingRDD.count(), validationRDD.count(), testRDD.count()) Training: 292716, validation: 96902, test: 98032
  • 82. Page82 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ Collaborative Filtering ā€¢ Before we start training the model, we need a way to calculate how good a model is, so we can compare it against other tries ā€¢ Root Mean Square Error (RMSE) is often used to compute the error of a model ā€¢ RMSE compares the predicted values from the training set with the real values present in the validation set. By adding the absolute values of the differences, and taking the average of those values, we get a single number that represents the error of the model
  • 83. Page83 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ Collaborative Filtering def computeError(predictedRDD, actualRDD): predictedReformattedRDD = (predictedRDD .map(lambda (UserID, actualReformattedRDD MovieID, Rating):((UserID, = (actualRDD MovieID), Rating)) ) .map(lambda (UserID, MovieID, Rating):((UserID, MovieID), Rating)) ) squaredErrorsRDD = (predictedReformattedRDD .join(actualReformattedRDD) .map(lambda (k,(a,b)): math.pow((a-b),2))) totalError = squaredErrorsRDD.reduce(lambda a,b: a+b) numRatings = squaredErrorsRDD.count() return math.sqrt(float(totalError)/numRatings)
  • 84. Page84 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ Collaborative Filtering ā€¢ Create a trained model using the ALS.train() method from Spark mllib ā€¢ Rank is the most important parameter to tune ā€¢ The number of rows and columns in the matrix used ā€¢ A lower rank will mean higher error, a high rank may lead to overfitting ALS.train( trainingRDD, rank, # Weā€™ll try 3 ranks: 4, 8, 12 seed = 5L, iterations = 5, lambda_ = 0.1 )
  • 85. Page85 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ Collaborative Filtering ā€¢ Use the trained model to predict the missing ratings in the validation set ā€¢ Create a new RDD from te validation set where the ratings are removed ā€¢ Call the predictAll() method using the trained model on that RDD validationForPredictRDD = validationRDD .map( lambda (UserID, MovieID, Rating): (UserID, MovieID) ) predictedRatingsRDD = model.predictAll(validationForPredictRDD)
  • 86. Page86 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ Collaborative Filtering ā€¢ Finally use our computeError() method to calculate the error of our trained model by comparing the predicted ratings with the real ones error = computeError(predictedRatingsRDD, validationRDD)
  • 87. Page87 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ Collaborative Filtering from pyspark.mllib.recommendation import ALS validationForPredictRDD = ( validationRDD .map(lambda (UserID, MovieID, Rating): (UserID, MovieID)) ranks = [4, 8, 12] errors = [0, 0 , 0] err = 0 minError = float('inf') bestRank = -1bestIteration = -1 ā€¢ Import the ALS module, create the ā€œemptyā€ validatio RDD for prediction and set up some variables
  • 88. Page88 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ Collaborative Filtering for rank in ranks: model = ALS.train( trainingRDD, rank, seed=5L, iterations=5, lambda_=0.1) predictedRatingsRDD = model.predictAll(validationForPredictRDD) error = computeError(predictedRatingsRDD, validationRDD) errors[err] = error err += 1 print 'For rank %s the RMSE is %s' % (rank, error) minError:if error < minError bestRank = error = rank
  • 89. Page89 Ā© Hortonworks Inc. 2014 Step 3: Create Model ā€“ Collaborative Filtering ā€¢ The model that was trained with rank 8 has the lowest error (RMSE) print 'The best model was trained with rank %s' % bestRank For rank 4 the RMSE is 0.892734779484 For rank 8 the RMSE is 0.890121292255 For rank 12 the RMSE is 0.890216118367 The best model was trained with rank 8
  • 90. Page90 Ā© Hortonworks Inc. 2014 Step 4: Test Model ā€¢ So we have now found the best model, but now we still need to test if the model is actually good ā€¢ Testing using the same validation set is not a good test since it may leave us vulnerable to overfitting ā€¢ The model is so fit to the validation set, that it only produces good results for that set ā€¢ This is why we have split of a test set at the start of the Machine Learning process ā€¢ We will use the best rank result we obtained to train a model and then predict the ratings for the test set ā€¢ Calculating the RMSE for the test set predictions should tell us if our model is useable
  • 91. Page91 Ā© Hortonworks Inc. 2014 Step 4: Test Model ā€¢ We recreate the model, remove all the ratings present in the test set and run the predictAll() method 8,seed=5L, iterations=5, lambda_=0.1) myModel = ALS.train(trainingRDD, testForPredictingRDD = testRDD.map(lambda (UserID, MovieID, Rating): (UserID, MovieID)) predictedTestRDD = myModel.predictAll(testForPredictingRDD) testRMSE = computeError(testRDD, predictedTestRDD)
  • 92. Page92 Ā© Hortonworks Inc. 2014 Step 4: Test Model ā€¢ The RMSE is good. Our model does not suffer from overfitting and is usable. ā€¢ The RMSE of the validation set was 0.890121292255, only slightly better print 'The model had a RMSE on the test set of %s' % testRMSE The model had a RMSE on the test set of 0.891048561304
  • 93. Page93 Ā© Hortonworks Inc. 2014 Step 5: Use the model ā€¢ Letā€™s get some movie predictions! ā€¢ First I need to give the data set some ratings so it has something to deduce my taste myRatedMovies = [ # Rating (0, 845,5.0), # Blade Runner (1982) - 5.0/5 (0, 789,4.5), # Good Will Hunting (1997) - 4.5/5 (0, 983,4.8), # Christmas Story, A (1983) - 4.8/5 (0, 551,2.0), # Taxi Driver (1976) - 2.0/5 (0,1039,2.0), # Pulp Fiction (1994) - 2.0/5 (0, 651,5.0), # Dr. Strangelove (1963) - 5.0/5 (0,1195,4.0), # Raiders of the Lost Ark (1981) - 4.0/5 (0,1110,5.0), # Sixth Sense, The (1999) - 4.5/5 (0,1250,4.5), # Matrix, The (1999) - 4.5/5 - 4.0/5(0,1083,4.0) # Princess Bride, The (1987) ] myRatingsRDD = sc.parallelize(myRatedMovies)
  • 94. Page94 Ā© Hortonworks Inc. 2014 Step 5: Use the model ā€¢ Then we add my ratings to the data set ā€¢ since we now have more ratings, letā€™s train our model again ā€¢ and make sure the RMSE is still OK (re-using the test set RDDs from the previous step) trainingWithMyRatingsRDD = myRatingsRDD.union(trainingRDD) myRatingsModel = ALS.train(trainingWithMyRatingsRDD, 8, seed=5L, iterations=5, lambda_=0.1) predictedTestMyRatingsRDD = myRatingsModel .predictAll(testForPredictingRDD) testRMSEMyRatings = computeError(testRDD, predictedTestMyRatingsRDD)
  • 95. Page95 Ā© Hortonworks Inc. 2014 Step 5: Use the model ā€¢ And of course, check the RMSE again... Weā€™re good print 'The model had a RMSE on the test set of %s' % testRMSEMyRatings The model had a RMSE on the test set of 0.892023318284
  • 96. Page96 Ā© Hortonworks Inc. 2014 Step 5: Use the model ā€¢ Now we need an RDD with only the movies I did not rate, to run predictAll() on. (my userid is set to zero) ā€¢ [(0, movieID1), (0, movieID2), (0, movieID3), ā€¦] myUnratedMoviesRDD = (moviesRDD .map(lambda (movieID, name): movieID) .filter(lambda movieID: in myRatedMovies] )movieID not in [ mine[1] for mine .map(lambda movieID: (0, movieID))) predictedRatingsRDD = myRatingsModel.predictAll(myUnratedMoviesRDD)
  • 97. Page97 Ā© Hortonworks Inc. 2014 Step 5: Use the model ā€¢ From the predicted RDD, get the top 20 predicted ratings, but only for movies that had at least 75 ratings in total ā€¢ Re-use the RDD we created in the naĆÆve approach that had the average ratings and number of ratings. (movieIDsWithAvgRatingsRDD) ā€¢ Map it to tuples of form (movieID, number_of_ratings) ā€¢ Strip the userid from the predicted RDD ā€¢ Map it to tuples (movieID, predicted_rating) ā€¢ Join those two and add the movie names from the original movies data and clean up the result ā€¢ The resulting tuple is (predicted_rating, name, number_of_ratings) ā€¢ Filter out all movies that had less than 75 ratings
  • 98. Page98 Ā© Hortonworks Inc. 2014 Step 5: Use the model movieCountsRDD = movieIDsWithAvgRatingsRDD .map(lambda (movie_id, (ratings, average)): (movie_id, ratings)) predictedRDD = predictedRatingsRDD .map(lambda (uid, movie_id, rating): (movie_id, rating)) predictedWithCountsRDD= (predictedRDD.join(movieCountsRDD)) ratingsWithNamesRDD = (predictedWithCountsRDD .join(moviesRDD) .map(lambda (movie_id, ((pred, ratings), name)): (pred, name, .filter(lambda (pred, name, ratings): ratings ratings)) > 75))
  • 99. Page99 Ā© Hortonworks Inc. 2014 Step 5: Use the model ā€¢ And finally get the top 20 recommended movies for myself predictedHighestRatedMovies = ratingsWithNamesRDD.takeOrdered(20, key=lambda x: -x[0]) print ('My highest rated movies as predictedn%s' % 'n'.join(map(str, predictedHighestRatedMovies)))
  • 100. Page100 Ā© Hortonworks Inc. 2014 Step 5: Use the model My highest rated movies as predicted: (4.823536053603062, 'Once Upon a Time in the West (1969)', 82) (4.743456934724456, 'Texas Chainsaw Massacre, The (1974)', 111) (4.452221024980805, 'Evil Dead II (Dead By Dawn) (1987)', 305) (4.387531237859994, 'Duck Soup (1933)', 279) (4.373821653377477, 'Citizen Kane (1941)', 527) (4.344480264132989, 'Cabin Boy (1994)', 95) (4.332264360095111, 'Shaft (1971)', 85) (4.217371529794628, 'Night of the Living Dead (1968)', 352) (4.181318251399025, 'Yojimbo (1961)', 110) (4.171790272807383, 'Naked Gun: From the Files of Police Squad', 435) ā€¦
  • 101. Apache Spark on HDP 2.3 Predict Train Model Persist Submit Rating Recommen dation SparkShell User Improve Model Ā© Hortonworks Inc. 2011 ā€“ 2015. All Rights Reserved
  • 102. Page102 Ā© Hortonworks Inc. 2014 Conclusion and Q&A
  • 103. Page103 Ā© Hortonworks Inc. 2014 Learn More Spark + Hadoop Perfect Together HDP Spark General Info: http://hortonworks.com/hadoop/spark/ Learn more about our Focus on Spark: http://hortonworks.com/hadoop/spark/#section_6 Get the HDP Spark 1.5.1 Tech Preview: http://hortonworks.com/hadoop/spark/#section_5 Get started with Spark and Zeppelin and download the Sandbox: http://hortonworks.com/sandbox Try these tutorials: http://hortonworks.com/hadoop/spark/#tutorials http://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/ Learn more about GeoSpatial Spark processing with Magellan: http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/