Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
1.
Spark ETL
How to create an optimal daily fantasy baseball roster.
Chicago Hadoop Users Group
May 12, 2015
Don Drake
don@drakeconsulting.com
@dondrake
2.
Overview
• Who am I?
• ETL
• Daily Fantasy Baseball
• Spark
• Data Flow
• Extracting - web crawler
• Parquet
• DataFrames
• Python or Scala?
• Transforming - RDDs
• Transforming - Moving Average
3.
Who Am I?
• Don Drake @dondrake
• Currently: Principal Big Data Consultant @ Allstate
• 5 years consulting on Hadoop
• Independent consultant for last 14 years
• Previous clients:
• Navteq / Nokia
• Sprint
• Mobile Meridian - Co-Founder
• MailLaunder.com - my SaaS anti-spam service
• cars.com
• Tribune Media Services
• Family Video
• Museum of Science and Industry
4.
ETL
• Informally: Any repeatable programmed data movement
• Extraction - Get data from another source
• Oracle/PostgreSQL
• CSV
• Web crawler
• Transform - Normalize formatting of phone #’s, addresses
• Create surrogate keys
• Joining data sources
• Aggregate
• Load -
• Load into data warehouse
• CSV
• Sent to predictive model
5.
Spark
• Apache Spark is a fast and general purpose engine for large-scale data
processing.
• It provides high-level API’s in Java, Scala, and Python (REPL’s for Scala
and Python)
• Includes an advanced DAG execution engine that supports in-memory
computing
• RDD - Resilient Distributed Dataset — Core construct of the framework
• Includes a set of high-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph
processing and Spark Streaming
• Can run in a cluster (Hadoop (YARN), EC2, Mesos), Standalone, Local
• Open Source, core committers from DataBricks
• Latest version is 1.3.1, which includes DataFrames
• Started in 2009 (AMPLab) as research project, Apache project since 2013.
• LOTS of momentum.
7.
Spark 101- Execution
• Driver - your program’s main() method
• Only 1 per application
• Executors - do the distributed work
• As many as your cluster can handle
• You determine ahead of time how many
• You determine the amount of RAM required
8.
Spark 101 - RDD
• RDD - Resilient Distributed Dataset
• Can be created from Hadoop Input formats (text
file, sequence file, parquet file, HBase, etc.) OR by
transforming other RDDs.
• RDDs have actions and transformations which
return pointers to new RDDs
• RDD’s can contain anything
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
scala> textFile.count() // Number of items in this RDD
res0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
12.
Spark 101 - Lazy RDDs
• RDD’s are executed *only* when an action is called
upon it.
• This allows multiple transformations on a RDD,
allowing Spark to compute an optimal execution plan.
• Uncached RDD’s are evaluated *every* time an action
is called upon it
• Cache RDD’s if you know you will iterate over an RDD
more than once.
• You can cache to RAM + Disk, by default, persists to
RAM only.
15.
Extraction
• We have seen it’s easy to create a parallelized data
structure that we can execute code against
• Pro-tip: If extracting from relational database, use
Sqoop + save as Parquet
• We need to download a set of files for each
baseball game previously played (player statistics)
• TODO List Pattern
• We know the filenames to download for each
game (they are static)
• Download all files for a game in parallel
21.
Using DataFrame -
SparkSQL
• Previously called SchemaRDD, now with DataFrame contain extra
functionality to query/filter
• Allow you to write SQL (joins, filter, etc.) against DataFrames (or
RDD’s with a little effort)
• All DataFrame’s contain a schema
1 batter_mov_avg = sqlContext.parquetFile(self.rddDir + "/" +
2 "batter_moving_averages.parquet")
3 batter_mov_avg.registerTempTable("bma")
4 batter_mov_avg.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
5 print "batter_mov_avg=", batter_mov_avg.take(2)
6
7 batter_games = sqlContext.sql("select * from games g, bma,
8 game_players gp
9 where bma.player_id = gp.player_id
10 and bma.game_date = gp.game_date
11 and gp.game_id = g.game_id
12 and g.game_date = gp.game_date
13 and g.game_date = bma.game_date")
22.
DataFrame’s to the rescue
• DataFrames offer a DSL that provides a distributed
data manipulation
• In Python, you can convert a DF to Pandas data
frame and vice versa
23.
More DataFrames
• We can select columns from DataFrame
• Run our own transform on it with map()
• Transform function gets a Row() object, and returns one
• The toDF() function will infer data types for you.
• https://issues.apache.org/jira/browse/SPARK-7182
1 batting_features = batter_games.select(*unique_cols)
2 print "batting_features=", batting_features.schema
3 #print "batting_features=", batting_features.show()
4
5 def transformBatters(row_object):
6 row = row_object.asDict()
7 row = commonTransform(row)
8
9 return Row(**row)
10
11 batting_features = batting_features.map(transformBatters).toDF()
12 batting_features.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
13
14 self.rmtree(self.rddDir + "/" + "batting_features.parquet")
15 batting_features.saveAsParquetFile(self.rddDir + "/" +
16 "batting_features.parquet")
25.
AbstractDF
• Download: https://gist.github.com/dondrake/
c7fcf42cf051492fdd91
• AbstractDF adds 3 major features to the class
• Creates a python object with attributes based on the field
name defined in the schema
• e.g. g = Game(); g.game_id = ‘123’
• Exposes a helper function to create a Row object
containing all of the fields in schema stored with correct
data types.
• Needed so Scala can correctly infer data types of the
values sent.
• Provides method that will return a list of columns to use in
a select statement.
27.
Python or Scala???
• Use Python for prototyping
• Access to pandas, scikit-learn, etc.
• Python is strongly typed (but not statically typed)
• Spark Python API support lags, not as popular as Scala. Bugs might
not get fixed right away or at all.
• Python is slower due to Gateway required. All data must be serialized
and sent through Gateway to/from Scala. (Not as bad for DataFrames)
• Use Scala for application development
• Scala learning curve is steep.
• Functional Programming learning curve is steep.
• Scala is a statically typed language
• Java was intentionally left off
• Don’t bother with Java 7 (IMO)
30.
How do we compute a 5-
day MA for batting average?
• For an individual player, on an particular date, we need
the previous 5 days batting average.
• Please note: not all players play every day.
• To build a predictive model, we need historical data, so
we would need to calculate this for each day, every
player, for (possibly) each metric.
• We would also want to use different Moving Average
durations (e.g. a 5, 7 ,14 day moving average) for each
player - day.
• Our compute space just got big and is also
embarrassingly parallel
31.
Some Stats
• load table (2 full years + this year so far ~17%):
• gamePlayers = sqlContext.parquetFile(rddDir +
‘game_players.parquet').cache()
• # Hitters:
• gamePlayers.filter (gamePlayers.fd_position !=
‘P').count()
• 248856L (# of hitter-gamedate combinations)
• # Pitchers
• gamePlayers.filter (gamePlayers.fd_position ==
'P').count()
• 244926L
• Season-level moving average would require about 248,000 *
(162/2) rows of data (~20 million rows)
32.
MapReduce is dead.
Long live MapReduce.
• Spark has a set of transformations that can perform on
RDD’s of key / value pairs.
• Transformations
• groupByKey([numTasks])
• reduceByKey(func, [numTasks])
• aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
• sortByKey([ascending], [numTasks])
• join(otherDataset, [numTasks])
• cogroup(otherDataset, [numTasks])
• Actions
• reduce(func)
• countByKey()
33.
Broadcast Map-side join
• “Broadcast” a dictionary of lists of game_dates,
keys in dictionary is the year the game took place.
• FlatMap operation looping over broadcast
game_dates for the stats took place
• We use flatMap because we emit many rows
from a single input (7-day, 14-day, etc)
• Key: player_id + asOfDate + moving_length
• Value: Game Stats dictionary from input
• Since this generates a lot of output, I
repartition(50) the output of flatMap
34.
Reduce
• now call groupByKey()
• Creates a RDD that have a list of values with the same key.
These values are the stats to calculate the average
• If performing aggregation, use reduceByKey, less shuffling
involved
• Calculate the average (and stddev, etc.)
• Emit new key of player_id + asOfDate
• Value is the dictionary of all moving average fields
• Run reduceByKey() to combine (concatenate) 7-day, 14-day
metrics to a single row
• Save results as Parquet, will be joined with other features to
create a predictive model
It appears that you have an ad-blocker running. By whitelisting SlideShare on your ad-blocker, you are supporting our community of content creators.
Hate ads?
We've updated our privacy policy.
We’ve updated our privacy policy so that we are compliant with changing global privacy regulations and to provide you with insight into the limited ways in which we use your data.
You can read the details below. By accepting, you agree to the updated privacy policy.