SlideShare a Scribd company logo
1 of 42
Download to read offline
Spark ETL
How to create an optimal daily fantasy baseball roster.
Chicago Hadoop Users Group
May 12, 2015
Don Drake
don@drakeconsulting.com
@dondrake
Overview
• Who am I?
• ETL
• Daily Fantasy Baseball
• Spark
• Data Flow
• Extracting - web crawler
• Parquet
• DataFrames
• Python or Scala?
• Transforming - RDDs
• Transforming - Moving Average
Who Am I?
• Don Drake @dondrake
• Currently: Principal Big Data Consultant @ Allstate
• 5 years consulting on Hadoop
• Independent consultant for last 14 years
• Previous clients:
• Navteq / Nokia
• Sprint
• Mobile Meridian - Co-Founder
• MailLaunder.com - my SaaS anti-spam service
• cars.com
• Tribune Media Services
• Family Video
• Museum of Science and Industry
ETL
• Informally: Any repeatable programmed data movement
• Extraction - Get data from another source
• Oracle/PostgreSQL
• CSV
• Web crawler
• Transform - Normalize formatting of phone #’s, addresses
• Create surrogate keys
• Joining data sources
• Aggregate
• Load -
• Load into data warehouse
• CSV
• Sent to predictive model
Spark
• Apache Spark is a fast and general purpose engine for large-scale data
processing.
• It provides high-level API’s in Java, Scala, and Python (REPL’s for Scala
and Python)
• Includes an advanced DAG execution engine that supports in-memory
computing
• RDD - Resilient Distributed Dataset — Core construct of the framework
• Includes a set of high-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph
processing and Spark Streaming
• Can run in a cluster (Hadoop (YARN), EC2, Mesos), Standalone, Local
• Open Source, core committers from DataBricks
• Latest version is 1.3.1, which includes DataFrames
• Started in 2009 (AMPLab) as research project, Apache project since 2013.
• LOTS of momentum.
Daily Fantasy Baseball
Spark 101- Execution
• Driver - your program’s main() method
• Only 1 per application
• Executors - do the distributed work
• As many as your cluster can handle
• You determine ahead of time how many
• You determine the amount of RAM required
Spark 101 - RDD
• RDD - Resilient Distributed Dataset
• Can be created from Hadoop Input formats (text
file, sequence file, parquet file, HBase, etc.) OR by
transforming other RDDs.
• RDDs have actions and transformations which
return pointers to new RDDs
• RDD’s can contain anything

scala> val textFile = sc.textFile("README.md") 

textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

scala> textFile.count() // Number of items in this RDD 

res0: Long = 126 

scala> textFile.first() // First item in this RDD 

res1: String = # Apache Spark
DEMO
Spark 101 - DEMO
1 #!/bin/env python
2
3 import pyspark.SparkContext
4
5
6 sc = new SparkContext()
7
8 # create some numbers
9 nums = sc.parallelize(xrange(1000))
10
11 nums.getNumPartitions()
12
13 nums.count()
Spark 101 - Lazy Demo
1 #!/bin/env python
2
3 import pyspark.SparkContext
4
5 sc = new SparkContext()
6
7 # create some numbers
8 nums = sc.parallelize(xrange(1000))
9
10 check = sc.accumulator(0)
11
12 def isEven(x):
13 check.add(1)
14 return x % 2 == 0
15
16 evens = nums.filter(isEven)
17
18 evens.count()
19
20 evens.collect()
21
22 check
Spark 101 - Lazy RDDs
• RDD’s are executed *only* when an action is called
upon it.
• This allows multiple transformations on a RDD,
allowing Spark to compute an optimal execution plan.
• Uncached RDD’s are evaluated *every* time an action
is called upon it
• Cache RDD’s if you know you will iterate over an RDD
more than once.
• You can cache to RAM + Disk, by default, persists to
RAM only.
Data Flow
Overview
Extraction & Ingestion
Extraction
• We have seen it’s easy to create a parallelized data
structure that we can execute code against
• Pro-tip: If extracting from relational database, use
Sqoop + save as Parquet
• We need to download a set of files for each
baseball game previously played (player statistics)
• TODO List Pattern
• We know the filenames to download for each
game (they are static)
• Download all files for a game in parallel
1 def getFiles(game):
2 session = requests.Session()
3 files = scrape.get_files([game], session=session)
4 count, fails = scrape.download(files, DownloadMLB.cache)
5 return (count, fails)
6
7 def summarize(a, x):
8 total = a[0] + x[0]
9 alist = a[1]
10 alist.extend(x[1])
11 return (total, alist)
12
13 def run(self):
14
15 sc = SparkContext()
16 start_scrape = datetime.now()
17 begin, begin_parts = scrape.get_boundary(self.begin)
18 end, end_parts = scrape.get_boundary(self.end)
19
20 session = requests.Session()
21
22 print "here"
23 all_years_months_days = self.getYearsMonths(self.WEB_ROOT,
session)
24 games = scrape.get_games(all_years_months_days, session=session)
25
26 gamesRDD = sc.parallelize(games)
27 print "fileRDD=", gamesRDD
28
29 gamesRDD.foreach(dump)
30 print "# parttions:", gamesRDD.getNumPartitions()
31 print "count=", gamesRDD.count()
32 res = gamesRDD.map(getFiles).reduce(summarize)
33 print "res=", res
34
35 count = res[0]
36 fails = res[1]
37 end_scrape = datetime.now()
38 self.log.info("%d files downloaded in %s", count,
39 str(end_scrape - start_scrape))
40 if fails:
41 for url in fails:
42 self.log.error("failed to download %s", url)
43
44 sc.stop()
TODO DEMO
Butter…..Butter……
Parquet?
• Features
• Interoperability - Spark, Impala, Sqoop, much more
• Space efficient
• Query efficient
• Schema - (field name, data type, nullable)
• Columnar Format
• Different Encoding (Compression) Algorithms
• Delta Encoding - diffs per row per column
• Dictionary Encoding (~60k items). e.g. Dates, IP address,
• Run Length Encoding - (for repetitive data)
• Example: http://blog.cloudera.com/blog/2014/05/using-
impala-at-scale-at-allstate/
http://parquet.apache.org/presentations/
http://parquet.apache.org/presentations/
Using DataFrame -
SparkSQL
• Previously called SchemaRDD, now with DataFrame contain extra
functionality to query/filter
• Allow you to write SQL (joins, filter, etc.) against DataFrames (or
RDD’s with a little effort)
• All DataFrame’s contain a schema
1 batter_mov_avg = sqlContext.parquetFile(self.rddDir + "/" +
2 "batter_moving_averages.parquet")
3 batter_mov_avg.registerTempTable("bma")
4 batter_mov_avg.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
5 print "batter_mov_avg=", batter_mov_avg.take(2)
6
7 batter_games = sqlContext.sql("select * from games g, bma,
8 game_players gp 
9 where bma.player_id = gp.player_id 
10 and bma.game_date = gp.game_date 
11 and gp.game_id = g.game_id 
12 and g.game_date = gp.game_date 
13 and g.game_date = bma.game_date")
DataFrame’s to the rescue
• DataFrames offer a DSL that provides a distributed
data manipulation
• In Python, you can convert a DF to Pandas data
frame and vice versa
More DataFrames
• We can select columns from DataFrame
• Run our own transform on it with map()
• Transform function gets a Row() object, and returns one
• The toDF() function will infer data types for you.
• https://issues.apache.org/jira/browse/SPARK-7182
1 batting_features = batter_games.select(*unique_cols)
2 print "batting_features=", batting_features.schema
3 #print "batting_features=", batting_features.show()
4
5 def transformBatters(row_object):
6 row = row_object.asDict()
7 row = commonTransform(row)
8
9 return Row(**row)
10
11 batting_features = batting_features.map(transformBatters).toDF()
12 batting_features.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
13
14 self.rmtree(self.rddDir + "/" + "batting_features.parquet")
15 batting_features.saveAsParquetFile(self.rddDir + "/" +
16 "batting_features.parquet")
Programmatically Specifying
the Schema1 class Game(AbstractDF):
2 schema = StructType( sorted(
3 [
4 StructField("game_id", StringType()),
5 StructField("game_date", DateType()),
6 StructField("id", IntegerType()),
7 StructField("type", StringType()),
8 StructField("local_game_time", StringType()),
9 StructField("game_pk", IntegerType()),
10 StructField("gameday_sw", StringType()),
11 StructField("game_time_et", TimestampType()),
12 StructField("home_code", StringType()),
13 StructField("home_abbrev", StringType()),
14 StructField("home_name", StringType()),
15 StructField("home_won", IntegerType()),
16 StructField("home_loss", IntegerType()),
17 StructField("home_division_id", IntegerType()),
18 StructField("home_league", StringType()),
19 StructField("away_code", StringType()),
20 StructField("away_abbrev", StringType()),
21 StructField("away_name", StringType()),
22 StructField("away_won", IntegerType()),
23 StructField("away_loss", IntegerType()),
24 StructField("away_division_id", IntegerType()),
25 StructField("away_league", StringType()),
26 StructField("stadium_id", IntegerType()),
27 StructField("stadium_name", StringType()),
28 StructField("stadium_venue_w_chan_loc", StringType()),
29 StructField("stadium_location", StringType()),
30 StructField("modified", TimestampType()),
31 ],
32 key = lambda x: x.name))
33 skipSelectFields = ['modified']
AbstractDF
• Download: https://gist.github.com/dondrake/
c7fcf42cf051492fdd91
• AbstractDF adds 3 major features to the class
• Creates a python object with attributes based on the field
name defined in the schema
• e.g. g = Game(); g.game_id = ‘123’
• Exposes a helper function to create a Row object
containing all of the fields in schema stored with correct
data types.
• Needed so Scala can correctly infer data types of the
values sent.
• Provides method that will return a list of columns to use in
a select statement.
Python or Scala??
Python or Scala???
• Use Python for prototyping
• Access to pandas, scikit-learn, etc.
• Python is strongly typed (but not statically typed)
• Spark Python API support lags, not as popular as Scala. Bugs might
not get fixed right away or at all.
• Python is slower due to Gateway required. All data must be serialized
and sent through Gateway to/from Scala. (Not as bad for DataFrames)
• Use Scala for application development
• Scala learning curve is steep.
• Functional Programming learning curve is steep.
• Scala is a statically typed language
• Java was intentionally left off
• Don’t bother with Java 7 (IMO)
Transformations
Building a moving average
Example of a moving average
How do we compute a 5-
day MA for batting average?
• For an individual player, on an particular date, we need
the previous 5 days batting average.
• Please note: not all players play every day.
• To build a predictive model, we need historical data, so
we would need to calculate this for each day, every
player, for (possibly) each metric.
• We would also want to use different Moving Average
durations (e.g. a 5, 7 ,14 day moving average) for each
player - day.
• Our compute space just got big and is also
embarrassingly parallel
Some Stats
• load table (2 full years + this year so far ~17%):
• gamePlayers = sqlContext.parquetFile(rddDir +
‘game_players.parquet').cache()
• # Hitters:
• gamePlayers.filter (gamePlayers.fd_position !=
‘P').count()
• 248856L (# of hitter-gamedate combinations)
• # Pitchers
• gamePlayers.filter (gamePlayers.fd_position ==
'P').count()
• 244926L
• Season-level moving average would require about 248,000 *
(162/2) rows of data (~20 million rows)
MapReduce is dead.
Long live MapReduce.
• Spark has a set of transformations that can perform on
RDD’s of key / value pairs.
• Transformations
• groupByKey([numTasks])
• reduceByKey(func, [numTasks])
• aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
• sortByKey([ascending], [numTasks])
• join(otherDataset, [numTasks])
• cogroup(otherDataset, [numTasks])
• Actions
• reduce(func)
• countByKey()
Broadcast Map-side join
• “Broadcast” a dictionary of lists of game_dates,
keys in dictionary is the year the game took place.
• FlatMap operation looping over broadcast
game_dates for the stats took place
• We use flatMap because we emit many rows
from a single input (7-day, 14-day, etc)
• Key: player_id + asOfDate + moving_length
• Value: Game Stats dictionary from input
• Since this generates a lot of output, I
repartition(50) the output of flatMap
Reduce
• now call groupByKey()
• Creates a RDD that have a list of values with the same key.
These values are the stats to calculate the average
• If performing aggregation, use reduceByKey, less shuffling
involved
• Calculate the average (and stddev, etc.)
• Emit new key of player_id + asOfDate
• Value is the dictionary of all moving average fields
• Run reduceByKey() to combine (concatenate) 7-day, 14-day
metrics to a single row
• Save results as Parquet, will be joined with other features to
create a predictive model
Useful Links
1. https://databricks.com/blog/2015/04/28/project-
tungsten-bringing-spark-closer-to-bare-metal.html
2. https://www.usenix.org/system/files/conference/
nsdi15/nsdi15-paper-ousterhout.pdf
3. https://zeppelin.incubator.apache.org/
4. https://codewords.recurse.com/issues/one/an-
introduction-to-functional-programming
5. http://apache-spark-user-list.
1001560.n3.nabble.com/
Spark Books
Q & A

More Related Content

What's hot

DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufStructured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufDatabricks
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 

What's hot (20)

DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufStructured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 

Viewers also liked

Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleJen Aman
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianSpark Summit
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlSpark Summit
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Video Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache SparkVideo Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache SparkSpark Summit
 
Building data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemyBuilding data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemyRoger Barnes
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAiougVizagChapter
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesNAYATech
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learningAdam Gibson
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 

Viewers also liked (20)

Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Video Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache SparkVideo Games at Scale: Improving the gaming experience with Apache Spark
Video Games at Scale: Improving the gaming experience with Apache Spark
 
Building data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemyBuilding data flows with Celery and SQLAlchemy
Building data flows with Celery and SQLAlchemy
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 

Similar to Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkMartin Toshev
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseMartin Toshev
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkJon Haddad
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 

Similar to Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster (20)

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Spark!
Spark!Spark!
Spark!
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy Spark
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 

Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

  • 1. Spark ETL How to create an optimal daily fantasy baseball roster. Chicago Hadoop Users Group May 12, 2015 Don Drake don@drakeconsulting.com @dondrake
  • 2. Overview • Who am I? • ETL • Daily Fantasy Baseball • Spark • Data Flow • Extracting - web crawler • Parquet • DataFrames • Python or Scala? • Transforming - RDDs • Transforming - Moving Average
  • 3.
  • 4. Who Am I? • Don Drake @dondrake • Currently: Principal Big Data Consultant @ Allstate • 5 years consulting on Hadoop • Independent consultant for last 14 years • Previous clients: • Navteq / Nokia • Sprint • Mobile Meridian - Co-Founder • MailLaunder.com - my SaaS anti-spam service • cars.com • Tribune Media Services • Family Video • Museum of Science and Industry
  • 5. ETL • Informally: Any repeatable programmed data movement • Extraction - Get data from another source • Oracle/PostgreSQL • CSV • Web crawler • Transform - Normalize formatting of phone #’s, addresses • Create surrogate keys • Joining data sources • Aggregate • Load - • Load into data warehouse • CSV • Sent to predictive model
  • 6. Spark • Apache Spark is a fast and general purpose engine for large-scale data processing. • It provides high-level API’s in Java, Scala, and Python (REPL’s for Scala and Python) • Includes an advanced DAG execution engine that supports in-memory computing • RDD - Resilient Distributed Dataset — Core construct of the framework • Includes a set of high-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing and Spark Streaming • Can run in a cluster (Hadoop (YARN), EC2, Mesos), Standalone, Local • Open Source, core committers from DataBricks • Latest version is 1.3.1, which includes DataFrames • Started in 2009 (AMPLab) as research project, Apache project since 2013. • LOTS of momentum.
  • 8.
  • 9. Spark 101- Execution • Driver - your program’s main() method • Only 1 per application • Executors - do the distributed work • As many as your cluster can handle • You determine ahead of time how many • You determine the amount of RAM required
  • 10. Spark 101 - RDD • RDD - Resilient Distributed Dataset • Can be created from Hadoop Input formats (text file, sequence file, parquet file, HBase, etc.) OR by transforming other RDDs. • RDDs have actions and transformations which return pointers to new RDDs • RDD’s can contain anything
 scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count() // Number of items in this RDD res0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark
  • 11. DEMO
  • 12.
  • 13. Spark 101 - DEMO 1 #!/bin/env python 2 3 import pyspark.SparkContext 4 5 6 sc = new SparkContext() 7 8 # create some numbers 9 nums = sc.parallelize(xrange(1000)) 10 11 nums.getNumPartitions() 12 13 nums.count()
  • 14. Spark 101 - Lazy Demo 1 #!/bin/env python 2 3 import pyspark.SparkContext 4 5 sc = new SparkContext() 6 7 # create some numbers 8 nums = sc.parallelize(xrange(1000)) 9 10 check = sc.accumulator(0) 11 12 def isEven(x): 13 check.add(1) 14 return x % 2 == 0 15 16 evens = nums.filter(isEven) 17 18 evens.count() 19 20 evens.collect() 21 22 check
  • 15. Spark 101 - Lazy RDDs • RDD’s are executed *only* when an action is called upon it. • This allows multiple transformations on a RDD, allowing Spark to compute an optimal execution plan. • Uncached RDD’s are evaluated *every* time an action is called upon it • Cache RDD’s if you know you will iterate over an RDD more than once. • You can cache to RAM + Disk, by default, persists to RAM only.
  • 16.
  • 19. Extraction • We have seen it’s easy to create a parallelized data structure that we can execute code against • Pro-tip: If extracting from relational database, use Sqoop + save as Parquet • We need to download a set of files for each baseball game previously played (player statistics) • TODO List Pattern • We know the filenames to download for each game (they are static) • Download all files for a game in parallel
  • 20. 1 def getFiles(game): 2 session = requests.Session() 3 files = scrape.get_files([game], session=session) 4 count, fails = scrape.download(files, DownloadMLB.cache) 5 return (count, fails) 6 7 def summarize(a, x): 8 total = a[0] + x[0] 9 alist = a[1] 10 alist.extend(x[1]) 11 return (total, alist) 12 13 def run(self): 14 15 sc = SparkContext() 16 start_scrape = datetime.now() 17 begin, begin_parts = scrape.get_boundary(self.begin) 18 end, end_parts = scrape.get_boundary(self.end) 19 20 session = requests.Session() 21 22 print "here" 23 all_years_months_days = self.getYearsMonths(self.WEB_ROOT, session) 24 games = scrape.get_games(all_years_months_days, session=session) 25 26 gamesRDD = sc.parallelize(games) 27 print "fileRDD=", gamesRDD 28 29 gamesRDD.foreach(dump) 30 print "# parttions:", gamesRDD.getNumPartitions() 31 print "count=", gamesRDD.count() 32 res = gamesRDD.map(getFiles).reduce(summarize) 33 print "res=", res 34 35 count = res[0] 36 fails = res[1] 37 end_scrape = datetime.now() 38 self.log.info("%d files downloaded in %s", count, 39 str(end_scrape - start_scrape)) 40 if fails: 41 for url in fails: 42 self.log.error("failed to download %s", url) 43 44 sc.stop()
  • 22. Butter…..Butter…… Parquet? • Features • Interoperability - Spark, Impala, Sqoop, much more • Space efficient • Query efficient • Schema - (field name, data type, nullable) • Columnar Format • Different Encoding (Compression) Algorithms • Delta Encoding - diffs per row per column • Dictionary Encoding (~60k items). e.g. Dates, IP address, • Run Length Encoding - (for repetitive data) • Example: http://blog.cloudera.com/blog/2014/05/using- impala-at-scale-at-allstate/
  • 25.
  • 26. Using DataFrame - SparkSQL • Previously called SchemaRDD, now with DataFrame contain extra functionality to query/filter • Allow you to write SQL (joins, filter, etc.) against DataFrames (or RDD’s with a little effort) • All DataFrame’s contain a schema 1 batter_mov_avg = sqlContext.parquetFile(self.rddDir + "/" + 2 "batter_moving_averages.parquet") 3 batter_mov_avg.registerTempTable("bma") 4 batter_mov_avg.persist(storageLevel=StorageLevel.MEMORY_AND_DISK) 5 print "batter_mov_avg=", batter_mov_avg.take(2) 6 7 batter_games = sqlContext.sql("select * from games g, bma, 8 game_players gp 9 where bma.player_id = gp.player_id 10 and bma.game_date = gp.game_date 11 and gp.game_id = g.game_id 12 and g.game_date = gp.game_date 13 and g.game_date = bma.game_date")
  • 27. DataFrame’s to the rescue • DataFrames offer a DSL that provides a distributed data manipulation • In Python, you can convert a DF to Pandas data frame and vice versa
  • 28. More DataFrames • We can select columns from DataFrame • Run our own transform on it with map() • Transform function gets a Row() object, and returns one • The toDF() function will infer data types for you. • https://issues.apache.org/jira/browse/SPARK-7182 1 batting_features = batter_games.select(*unique_cols) 2 print "batting_features=", batting_features.schema 3 #print "batting_features=", batting_features.show() 4 5 def transformBatters(row_object): 6 row = row_object.asDict() 7 row = commonTransform(row) 8 9 return Row(**row) 10 11 batting_features = batting_features.map(transformBatters).toDF() 12 batting_features.persist(storageLevel=StorageLevel.MEMORY_AND_DISK) 13 14 self.rmtree(self.rddDir + "/" + "batting_features.parquet") 15 batting_features.saveAsParquetFile(self.rddDir + "/" + 16 "batting_features.parquet")
  • 29. Programmatically Specifying the Schema1 class Game(AbstractDF): 2 schema = StructType( sorted( 3 [ 4 StructField("game_id", StringType()), 5 StructField("game_date", DateType()), 6 StructField("id", IntegerType()), 7 StructField("type", StringType()), 8 StructField("local_game_time", StringType()), 9 StructField("game_pk", IntegerType()), 10 StructField("gameday_sw", StringType()), 11 StructField("game_time_et", TimestampType()), 12 StructField("home_code", StringType()), 13 StructField("home_abbrev", StringType()), 14 StructField("home_name", StringType()), 15 StructField("home_won", IntegerType()), 16 StructField("home_loss", IntegerType()), 17 StructField("home_division_id", IntegerType()), 18 StructField("home_league", StringType()), 19 StructField("away_code", StringType()), 20 StructField("away_abbrev", StringType()), 21 StructField("away_name", StringType()), 22 StructField("away_won", IntegerType()), 23 StructField("away_loss", IntegerType()), 24 StructField("away_division_id", IntegerType()), 25 StructField("away_league", StringType()), 26 StructField("stadium_id", IntegerType()), 27 StructField("stadium_name", StringType()), 28 StructField("stadium_venue_w_chan_loc", StringType()), 29 StructField("stadium_location", StringType()), 30 StructField("modified", TimestampType()), 31 ], 32 key = lambda x: x.name)) 33 skipSelectFields = ['modified']
  • 30. AbstractDF • Download: https://gist.github.com/dondrake/ c7fcf42cf051492fdd91 • AbstractDF adds 3 major features to the class • Creates a python object with attributes based on the field name defined in the schema • e.g. g = Game(); g.game_id = ‘123’ • Exposes a helper function to create a Row object containing all of the fields in schema stored with correct data types. • Needed so Scala can correctly infer data types of the values sent. • Provides method that will return a list of columns to use in a select statement.
  • 32. Python or Scala??? • Use Python for prototyping • Access to pandas, scikit-learn, etc. • Python is strongly typed (but not statically typed) • Spark Python API support lags, not as popular as Scala. Bugs might not get fixed right away or at all. • Python is slower due to Gateway required. All data must be serialized and sent through Gateway to/from Scala. (Not as bad for DataFrames) • Use Scala for application development • Scala learning curve is steep. • Functional Programming learning curve is steep. • Scala is a statically typed language • Java was intentionally left off • Don’t bother with Java 7 (IMO)
  • 34. Example of a moving average
  • 35. How do we compute a 5- day MA for batting average? • For an individual player, on an particular date, we need the previous 5 days batting average. • Please note: not all players play every day. • To build a predictive model, we need historical data, so we would need to calculate this for each day, every player, for (possibly) each metric. • We would also want to use different Moving Average durations (e.g. a 5, 7 ,14 day moving average) for each player - day. • Our compute space just got big and is also embarrassingly parallel
  • 36. Some Stats • load table (2 full years + this year so far ~17%): • gamePlayers = sqlContext.parquetFile(rddDir + ‘game_players.parquet').cache() • # Hitters: • gamePlayers.filter (gamePlayers.fd_position != ‘P').count() • 248856L (# of hitter-gamedate combinations) • # Pitchers • gamePlayers.filter (gamePlayers.fd_position == 'P').count() • 244926L • Season-level moving average would require about 248,000 * (162/2) rows of data (~20 million rows)
  • 37. MapReduce is dead. Long live MapReduce. • Spark has a set of transformations that can perform on RDD’s of key / value pairs. • Transformations • groupByKey([numTasks]) • reduceByKey(func, [numTasks]) • aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) • sortByKey([ascending], [numTasks]) • join(otherDataset, [numTasks]) • cogroup(otherDataset, [numTasks]) • Actions • reduce(func) • countByKey()
  • 38. Broadcast Map-side join • “Broadcast” a dictionary of lists of game_dates, keys in dictionary is the year the game took place. • FlatMap operation looping over broadcast game_dates for the stats took place • We use flatMap because we emit many rows from a single input (7-day, 14-day, etc) • Key: player_id + asOfDate + moving_length • Value: Game Stats dictionary from input • Since this generates a lot of output, I repartition(50) the output of flatMap
  • 39. Reduce • now call groupByKey() • Creates a RDD that have a list of values with the same key. These values are the stats to calculate the average • If performing aggregation, use reduceByKey, less shuffling involved • Calculate the average (and stddev, etc.) • Emit new key of player_id + asOfDate • Value is the dictionary of all moving average fields • Run reduceByKey() to combine (concatenate) 7-day, 14-day metrics to a single row • Save results as Parquet, will be joined with other features to create a predictive model
  • 40. Useful Links 1. https://databricks.com/blog/2015/04/28/project- tungsten-bringing-spark-closer-to-bare-metal.html 2. https://www.usenix.org/system/files/conference/ nsdi15/nsdi15-paper-ousterhout.pdf 3. https://zeppelin.incubator.apache.org/ 4. https://codewords.recurse.com/issues/one/an- introduction-to-functional-programming 5. http://apache-spark-user-list. 1001560.n3.nabble.com/
  • 42. Q & A