Successfully reported this slideshow.
Your SlideShare is downloading. ×

Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 42 Ad

Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

Download to read offline

Spark ETL techniques including Web Scraping, Parquet files, RDD transformations, SparkSQL, DataFrames, building moving averages and more.

Spark ETL techniques including Web Scraping, Parquet files, RDD transformations, SparkSQL, DataFrames, building moving averages and more.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster (20)

Recently uploaded (20)

Advertisement

Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

  1. 1. Spark ETL How to create an optimal daily fantasy baseball roster. Chicago Hadoop Users Group May 12, 2015 Don Drake don@drakeconsulting.com @dondrake
  2. 2. Overview • Who am I? • ETL • Daily Fantasy Baseball • Spark • Data Flow • Extracting - web crawler • Parquet • DataFrames • Python or Scala? • Transforming - RDDs • Transforming - Moving Average
  3. 3. Who Am I? • Don Drake @dondrake • Currently: Principal Big Data Consultant @ Allstate • 5 years consulting on Hadoop • Independent consultant for last 14 years • Previous clients: • Navteq / Nokia • Sprint • Mobile Meridian - Co-Founder • MailLaunder.com - my SaaS anti-spam service • cars.com • Tribune Media Services • Family Video • Museum of Science and Industry
  4. 4. ETL • Informally: Any repeatable programmed data movement • Extraction - Get data from another source • Oracle/PostgreSQL • CSV • Web crawler • Transform - Normalize formatting of phone #’s, addresses • Create surrogate keys • Joining data sources • Aggregate • Load - • Load into data warehouse • CSV • Sent to predictive model
  5. 5. Spark • Apache Spark is a fast and general purpose engine for large-scale data processing. • It provides high-level API’s in Java, Scala, and Python (REPL’s for Scala and Python) • Includes an advanced DAG execution engine that supports in-memory computing • RDD - Resilient Distributed Dataset — Core construct of the framework • Includes a set of high-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing and Spark Streaming • Can run in a cluster (Hadoop (YARN), EC2, Mesos), Standalone, Local • Open Source, core committers from DataBricks • Latest version is 1.3.1, which includes DataFrames • Started in 2009 (AMPLab) as research project, Apache project since 2013. • LOTS of momentum.
  6. 6. Daily Fantasy Baseball
  7. 7. Spark 101- Execution • Driver - your program’s main() method • Only 1 per application • Executors - do the distributed work • As many as your cluster can handle • You determine ahead of time how many • You determine the amount of RAM required
  8. 8. Spark 101 - RDD • RDD - Resilient Distributed Dataset • Can be created from Hadoop Input formats (text file, sequence file, parquet file, HBase, etc.) OR by transforming other RDDs. • RDDs have actions and transformations which return pointers to new RDDs • RDD’s can contain anything
 scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count() // Number of items in this RDD res0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark
  9. 9. DEMO
  10. 10. Spark 101 - DEMO 1 #!/bin/env python 2 3 import pyspark.SparkContext 4 5 6 sc = new SparkContext() 7 8 # create some numbers 9 nums = sc.parallelize(xrange(1000)) 10 11 nums.getNumPartitions() 12 13 nums.count()
  11. 11. Spark 101 - Lazy Demo 1 #!/bin/env python 2 3 import pyspark.SparkContext 4 5 sc = new SparkContext() 6 7 # create some numbers 8 nums = sc.parallelize(xrange(1000)) 9 10 check = sc.accumulator(0) 11 12 def isEven(x): 13 check.add(1) 14 return x % 2 == 0 15 16 evens = nums.filter(isEven) 17 18 evens.count() 19 20 evens.collect() 21 22 check
  12. 12. Spark 101 - Lazy RDDs • RDD’s are executed *only* when an action is called upon it. • This allows multiple transformations on a RDD, allowing Spark to compute an optimal execution plan. • Uncached RDD’s are evaluated *every* time an action is called upon it • Cache RDD’s if you know you will iterate over an RDD more than once. • You can cache to RAM + Disk, by default, persists to RAM only.
  13. 13. Data Flow Overview
  14. 14. Extraction & Ingestion
  15. 15. Extraction • We have seen it’s easy to create a parallelized data structure that we can execute code against • Pro-tip: If extracting from relational database, use Sqoop + save as Parquet • We need to download a set of files for each baseball game previously played (player statistics) • TODO List Pattern • We know the filenames to download for each game (they are static) • Download all files for a game in parallel
  16. 16. 1 def getFiles(game): 2 session = requests.Session() 3 files = scrape.get_files([game], session=session) 4 count, fails = scrape.download(files, DownloadMLB.cache) 5 return (count, fails) 6 7 def summarize(a, x): 8 total = a[0] + x[0] 9 alist = a[1] 10 alist.extend(x[1]) 11 return (total, alist) 12 13 def run(self): 14 15 sc = SparkContext() 16 start_scrape = datetime.now() 17 begin, begin_parts = scrape.get_boundary(self.begin) 18 end, end_parts = scrape.get_boundary(self.end) 19 20 session = requests.Session() 21 22 print "here" 23 all_years_months_days = self.getYearsMonths(self.WEB_ROOT, session) 24 games = scrape.get_games(all_years_months_days, session=session) 25 26 gamesRDD = sc.parallelize(games) 27 print "fileRDD=", gamesRDD 28 29 gamesRDD.foreach(dump) 30 print "# parttions:", gamesRDD.getNumPartitions() 31 print "count=", gamesRDD.count() 32 res = gamesRDD.map(getFiles).reduce(summarize) 33 print "res=", res 34 35 count = res[0] 36 fails = res[1] 37 end_scrape = datetime.now() 38 self.log.info("%d files downloaded in %s", count, 39 str(end_scrape - start_scrape)) 40 if fails: 41 for url in fails: 42 self.log.error("failed to download %s", url) 43 44 sc.stop()
  17. 17. TODO DEMO
  18. 18. Butter…..Butter…… Parquet? • Features • Interoperability - Spark, Impala, Sqoop, much more • Space efficient • Query efficient • Schema - (field name, data type, nullable) • Columnar Format • Different Encoding (Compression) Algorithms • Delta Encoding - diffs per row per column • Dictionary Encoding (~60k items). e.g. Dates, IP address, • Run Length Encoding - (for repetitive data) • Example: http://blog.cloudera.com/blog/2014/05/using- impala-at-scale-at-allstate/
  19. 19. http://parquet.apache.org/presentations/
  20. 20. http://parquet.apache.org/presentations/
  21. 21. Using DataFrame - SparkSQL • Previously called SchemaRDD, now with DataFrame contain extra functionality to query/filter • Allow you to write SQL (joins, filter, etc.) against DataFrames (or RDD’s with a little effort) • All DataFrame’s contain a schema 1 batter_mov_avg = sqlContext.parquetFile(self.rddDir + "/" + 2 "batter_moving_averages.parquet") 3 batter_mov_avg.registerTempTable("bma") 4 batter_mov_avg.persist(storageLevel=StorageLevel.MEMORY_AND_DISK) 5 print "batter_mov_avg=", batter_mov_avg.take(2) 6 7 batter_games = sqlContext.sql("select * from games g, bma, 8 game_players gp 9 where bma.player_id = gp.player_id 10 and bma.game_date = gp.game_date 11 and gp.game_id = g.game_id 12 and g.game_date = gp.game_date 13 and g.game_date = bma.game_date")
  22. 22. DataFrame’s to the rescue • DataFrames offer a DSL that provides a distributed data manipulation • In Python, you can convert a DF to Pandas data frame and vice versa
  23. 23. More DataFrames • We can select columns from DataFrame • Run our own transform on it with map() • Transform function gets a Row() object, and returns one • The toDF() function will infer data types for you. • https://issues.apache.org/jira/browse/SPARK-7182 1 batting_features = batter_games.select(*unique_cols) 2 print "batting_features=", batting_features.schema 3 #print "batting_features=", batting_features.show() 4 5 def transformBatters(row_object): 6 row = row_object.asDict() 7 row = commonTransform(row) 8 9 return Row(**row) 10 11 batting_features = batting_features.map(transformBatters).toDF() 12 batting_features.persist(storageLevel=StorageLevel.MEMORY_AND_DISK) 13 14 self.rmtree(self.rddDir + "/" + "batting_features.parquet") 15 batting_features.saveAsParquetFile(self.rddDir + "/" + 16 "batting_features.parquet")
  24. 24. Programmatically Specifying the Schema1 class Game(AbstractDF): 2 schema = StructType( sorted( 3 [ 4 StructField("game_id", StringType()), 5 StructField("game_date", DateType()), 6 StructField("id", IntegerType()), 7 StructField("type", StringType()), 8 StructField("local_game_time", StringType()), 9 StructField("game_pk", IntegerType()), 10 StructField("gameday_sw", StringType()), 11 StructField("game_time_et", TimestampType()), 12 StructField("home_code", StringType()), 13 StructField("home_abbrev", StringType()), 14 StructField("home_name", StringType()), 15 StructField("home_won", IntegerType()), 16 StructField("home_loss", IntegerType()), 17 StructField("home_division_id", IntegerType()), 18 StructField("home_league", StringType()), 19 StructField("away_code", StringType()), 20 StructField("away_abbrev", StringType()), 21 StructField("away_name", StringType()), 22 StructField("away_won", IntegerType()), 23 StructField("away_loss", IntegerType()), 24 StructField("away_division_id", IntegerType()), 25 StructField("away_league", StringType()), 26 StructField("stadium_id", IntegerType()), 27 StructField("stadium_name", StringType()), 28 StructField("stadium_venue_w_chan_loc", StringType()), 29 StructField("stadium_location", StringType()), 30 StructField("modified", TimestampType()), 31 ], 32 key = lambda x: x.name)) 33 skipSelectFields = ['modified']
  25. 25. AbstractDF • Download: https://gist.github.com/dondrake/ c7fcf42cf051492fdd91 • AbstractDF adds 3 major features to the class • Creates a python object with attributes based on the field name defined in the schema • e.g. g = Game(); g.game_id = ‘123’ • Exposes a helper function to create a Row object containing all of the fields in schema stored with correct data types. • Needed so Scala can correctly infer data types of the values sent. • Provides method that will return a list of columns to use in a select statement.
  26. 26. Python or Scala??
  27. 27. Python or Scala??? • Use Python for prototyping • Access to pandas, scikit-learn, etc. • Python is strongly typed (but not statically typed) • Spark Python API support lags, not as popular as Scala. Bugs might not get fixed right away or at all. • Python is slower due to Gateway required. All data must be serialized and sent through Gateway to/from Scala. (Not as bad for DataFrames) • Use Scala for application development • Scala learning curve is steep. • Functional Programming learning curve is steep. • Scala is a statically typed language • Java was intentionally left off • Don’t bother with Java 7 (IMO)
  28. 28. Transformations Building a moving average
  29. 29. Example of a moving average
  30. 30. How do we compute a 5- day MA for batting average? • For an individual player, on an particular date, we need the previous 5 days batting average. • Please note: not all players play every day. • To build a predictive model, we need historical data, so we would need to calculate this for each day, every player, for (possibly) each metric. • We would also want to use different Moving Average durations (e.g. a 5, 7 ,14 day moving average) for each player - day. • Our compute space just got big and is also embarrassingly parallel
  31. 31. Some Stats • load table (2 full years + this year so far ~17%): • gamePlayers = sqlContext.parquetFile(rddDir + ‘game_players.parquet').cache() • # Hitters: • gamePlayers.filter (gamePlayers.fd_position != ‘P').count() • 248856L (# of hitter-gamedate combinations) • # Pitchers • gamePlayers.filter (gamePlayers.fd_position == 'P').count() • 244926L • Season-level moving average would require about 248,000 * (162/2) rows of data (~20 million rows)
  32. 32. MapReduce is dead. Long live MapReduce. • Spark has a set of transformations that can perform on RDD’s of key / value pairs. • Transformations • groupByKey([numTasks]) • reduceByKey(func, [numTasks]) • aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) • sortByKey([ascending], [numTasks]) • join(otherDataset, [numTasks]) • cogroup(otherDataset, [numTasks]) • Actions • reduce(func) • countByKey()
  33. 33. Broadcast Map-side join • “Broadcast” a dictionary of lists of game_dates, keys in dictionary is the year the game took place. • FlatMap operation looping over broadcast game_dates for the stats took place • We use flatMap because we emit many rows from a single input (7-day, 14-day, etc) • Key: player_id + asOfDate + moving_length • Value: Game Stats dictionary from input • Since this generates a lot of output, I repartition(50) the output of flatMap
  34. 34. Reduce • now call groupByKey() • Creates a RDD that have a list of values with the same key. These values are the stats to calculate the average • If performing aggregation, use reduceByKey, less shuffling involved • Calculate the average (and stddev, etc.) • Emit new key of player_id + asOfDate • Value is the dictionary of all moving average fields • Run reduceByKey() to combine (concatenate) 7-day, 14-day metrics to a single row • Save results as Parquet, will be joined with other features to create a predictive model
  35. 35. Useful Links 1. https://databricks.com/blog/2015/04/28/project- tungsten-bringing-spark-closer-to-bare-metal.html 2. https://www.usenix.org/system/files/conference/ nsdi15/nsdi15-paper-ousterhout.pdf 3. https://zeppelin.incubator.apache.org/ 4. https://codewords.recurse.com/issues/one/an- introduction-to-functional-programming 5. http://apache-spark-user-list. 1001560.n3.nabble.com/
  36. 36. Spark Books
  37. 37. Q & A

×