FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
1. Spark ETL
How to create an optimal daily fantasy baseball roster.
Chicago Hadoop Users Group
May 12, 2015
Don Drake
don@drakeconsulting.com
@dondrake
2. Overview
• Who am I?
• ETL
• Daily Fantasy Baseball
• Spark
• Data Flow
• Extracting - web crawler
• Parquet
• DataFrames
• Python or Scala?
• Transforming - RDDs
• Transforming - Moving Average
3.
4. Who Am I?
• Don Drake @dondrake
• Currently: Principal Big Data Consultant @ Allstate
• 5 years consulting on Hadoop
• Independent consultant for last 14 years
• Previous clients:
• Navteq / Nokia
• Sprint
• Mobile Meridian - Co-Founder
• MailLaunder.com - my SaaS anti-spam service
• cars.com
• Tribune Media Services
• Family Video
• Museum of Science and Industry
5. ETL
• Informally: Any repeatable programmed data movement
• Extraction - Get data from another source
• Oracle/PostgreSQL
• CSV
• Web crawler
• Transform - Normalize formatting of phone #’s, addresses
• Create surrogate keys
• Joining data sources
• Aggregate
• Load -
• Load into data warehouse
• CSV
• Sent to predictive model
6. Spark
• Apache Spark is a fast and general purpose engine for large-scale data
processing.
• It provides high-level API’s in Java, Scala, and Python (REPL’s for Scala
and Python)
• Includes an advanced DAG execution engine that supports in-memory
computing
• RDD - Resilient Distributed Dataset — Core construct of the framework
• Includes a set of high-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph
processing and Spark Streaming
• Can run in a cluster (Hadoop (YARN), EC2, Mesos), Standalone, Local
• Open Source, core committers from DataBricks
• Latest version is 1.3.1, which includes DataFrames
• Started in 2009 (AMPLab) as research project, Apache project since 2013.
• LOTS of momentum.
9. Spark 101- Execution
• Driver - your program’s main() method
• Only 1 per application
• Executors - do the distributed work
• As many as your cluster can handle
• You determine ahead of time how many
• You determine the amount of RAM required
10. Spark 101 - RDD
• RDD - Resilient Distributed Dataset
• Can be created from Hadoop Input formats (text
file, sequence file, parquet file, HBase, etc.) OR by
transforming other RDDs.
• RDDs have actions and transformations which
return pointers to new RDDs
• RDD’s can contain anything
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
scala> textFile.count() // Number of items in this RDD
res0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
15. Spark 101 - Lazy RDDs
• RDD’s are executed *only* when an action is called
upon it.
• This allows multiple transformations on a RDD,
allowing Spark to compute an optimal execution plan.
• Uncached RDD’s are evaluated *every* time an action
is called upon it
• Cache RDD’s if you know you will iterate over an RDD
more than once.
• You can cache to RAM + Disk, by default, persists to
RAM only.
19. Extraction
• We have seen it’s easy to create a parallelized data
structure that we can execute code against
• Pro-tip: If extracting from relational database, use
Sqoop + save as Parquet
• We need to download a set of files for each
baseball game previously played (player statistics)
• TODO List Pattern
• We know the filenames to download for each
game (they are static)
• Download all files for a game in parallel
26. Using DataFrame -
SparkSQL
• Previously called SchemaRDD, now with DataFrame contain extra
functionality to query/filter
• Allow you to write SQL (joins, filter, etc.) against DataFrames (or
RDD’s with a little effort)
• All DataFrame’s contain a schema
1 batter_mov_avg = sqlContext.parquetFile(self.rddDir + "/" +
2 "batter_moving_averages.parquet")
3 batter_mov_avg.registerTempTable("bma")
4 batter_mov_avg.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
5 print "batter_mov_avg=", batter_mov_avg.take(2)
6
7 batter_games = sqlContext.sql("select * from games g, bma,
8 game_players gp
9 where bma.player_id = gp.player_id
10 and bma.game_date = gp.game_date
11 and gp.game_id = g.game_id
12 and g.game_date = gp.game_date
13 and g.game_date = bma.game_date")
27. DataFrame’s to the rescue
• DataFrames offer a DSL that provides a distributed
data manipulation
• In Python, you can convert a DF to Pandas data
frame and vice versa
28. More DataFrames
• We can select columns from DataFrame
• Run our own transform on it with map()
• Transform function gets a Row() object, and returns one
• The toDF() function will infer data types for you.
• https://issues.apache.org/jira/browse/SPARK-7182
1 batting_features = batter_games.select(*unique_cols)
2 print "batting_features=", batting_features.schema
3 #print "batting_features=", batting_features.show()
4
5 def transformBatters(row_object):
6 row = row_object.asDict()
7 row = commonTransform(row)
8
9 return Row(**row)
10
11 batting_features = batting_features.map(transformBatters).toDF()
12 batting_features.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
13
14 self.rmtree(self.rddDir + "/" + "batting_features.parquet")
15 batting_features.saveAsParquetFile(self.rddDir + "/" +
16 "batting_features.parquet")
30. AbstractDF
• Download: https://gist.github.com/dondrake/
c7fcf42cf051492fdd91
• AbstractDF adds 3 major features to the class
• Creates a python object with attributes based on the field
name defined in the schema
• e.g. g = Game(); g.game_id = ‘123’
• Exposes a helper function to create a Row object
containing all of the fields in schema stored with correct
data types.
• Needed so Scala can correctly infer data types of the
values sent.
• Provides method that will return a list of columns to use in
a select statement.
32. Python or Scala???
• Use Python for prototyping
• Access to pandas, scikit-learn, etc.
• Python is strongly typed (but not statically typed)
• Spark Python API support lags, not as popular as Scala. Bugs might
not get fixed right away or at all.
• Python is slower due to Gateway required. All data must be serialized
and sent through Gateway to/from Scala. (Not as bad for DataFrames)
• Use Scala for application development
• Scala learning curve is steep.
• Functional Programming learning curve is steep.
• Scala is a statically typed language
• Java was intentionally left off
• Don’t bother with Java 7 (IMO)
35. How do we compute a 5-
day MA for batting average?
• For an individual player, on an particular date, we need
the previous 5 days batting average.
• Please note: not all players play every day.
• To build a predictive model, we need historical data, so
we would need to calculate this for each day, every
player, for (possibly) each metric.
• We would also want to use different Moving Average
durations (e.g. a 5, 7 ,14 day moving average) for each
player - day.
• Our compute space just got big and is also
embarrassingly parallel
36. Some Stats
• load table (2 full years + this year so far ~17%):
• gamePlayers = sqlContext.parquetFile(rddDir +
‘game_players.parquet').cache()
• # Hitters:
• gamePlayers.filter (gamePlayers.fd_position !=
‘P').count()
• 248856L (# of hitter-gamedate combinations)
• # Pitchers
• gamePlayers.filter (gamePlayers.fd_position ==
'P').count()
• 244926L
• Season-level moving average would require about 248,000 *
(162/2) rows of data (~20 million rows)
37. MapReduce is dead.
Long live MapReduce.
• Spark has a set of transformations that can perform on
RDD’s of key / value pairs.
• Transformations
• groupByKey([numTasks])
• reduceByKey(func, [numTasks])
• aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
• sortByKey([ascending], [numTasks])
• join(otherDataset, [numTasks])
• cogroup(otherDataset, [numTasks])
• Actions
• reduce(func)
• countByKey()
38. Broadcast Map-side join
• “Broadcast” a dictionary of lists of game_dates,
keys in dictionary is the year the game took place.
• FlatMap operation looping over broadcast
game_dates for the stats took place
• We use flatMap because we emit many rows
from a single input (7-day, 14-day, etc)
• Key: player_id + asOfDate + moving_length
• Value: Game Stats dictionary from input
• Since this generates a lot of output, I
repartition(50) the output of flatMap
39. Reduce
• now call groupByKey()
• Creates a RDD that have a list of values with the same key.
These values are the stats to calculate the average
• If performing aggregation, use reduceByKey, less shuffling
involved
• Calculate the average (and stddev, etc.)
• Emit new key of player_id + asOfDate
• Value is the dictionary of all moving average fields
• Run reduceByKey() to combine (concatenate) 7-day, 14-day
metrics to a single row
• Save results as Parquet, will be joined with other features to
create a predictive model