Frustration-Reduced PySpark: Data engineering with DataFrames

Frustration-Reduced
PySpark
Data engineering with DataFrames
Ilya Ganelin

Why are we here?
 Spark for quick and easy batch ETL (no streaming)
 Actually using data frames
 Creation
 Modification
 Access
 Transformation
 Lab!
 Performance tuning and operationalization

What does it take to solve a
data science problem?
 Data Prep
 Ingest
 Cleanup
 Error-handling &
missing values
 Data munging
 Transformation
 Formatting
 Splitting
 Modeling
 Feature extraction
 Algorithm selection
 Data creation
 Train
 Test
 Validate
 Model building
 Model scoring

Why Spark?
 Batch/micro-batch processing of large datasets
 Easy to use, easy to iterate, wealth of common
industry-standard ML algorithms
 Super fast if properly configured
 Bridges the gap between the old (SQL, single machine
analytics) and the new (declarative/functional
distributed programming)

Why not Spark?
 Breaks easily with poor usage or improperly specified
configs
 Scaling up to larger datasets 500GB -> TB scale
requires deep understanding of internal configurations,
garbage collection tuning, and Spark mechanisms
 While there are lots of ML algorithms, a lot of them
simply don’t work, don’t work at scale, or have poorly
defined interfaces / documentation

Scala
 Yes, I recommend Scala
 Python API is underdeveloped, especially for ML Lib
 Java (until Java 8) is a second class citizen as far as
convenience vs. Scala
 Spark is written in Scala – understanding Scala helps you
navigate the source
 Can leverage the spark-shell to rapidly prototype new
code and constructs
 http://www.scala-
lang.org/docu/files/ScalaByExample.pdf

Why DataFrames?
 Iterate on datasets MUCH faster
 Column access is easier
 Data inspection is easier
 groupBy, join, are faster due to under-the-hood
optimizations
 Some chunks of ML Lib now optimized to use data
frames

Why not DataFrames?
 RDD API is still much better developed
 Getting data into DataFrames can be clunky
 Transforming data inside DataFrames can be clunky
 Many of the algorithms in ML Lib still depend on RDDs

Creation
 Read in a file with an embedded header
 http://stackoverflow.com/questions/24718697/pyspark-drop-rows

 Create a DF
 Option A – Inferred types from Rows RDD
 Option B – Specify schema as strings
DataFrame Creation

 Option C – Define the schema explicitly
 Check your work with df.show()
DataFrame Creation

Column Manipulation
 Selection
 GroupBy
 Confusing! You get a GroupedData object, not an RDD or
DataFrame
 Use agg or built-ins to get back to a DataFrame.
 Can convert to RDD with dataFrame.rdd

Custom Column Functions
 Add a column with a custom function:
 http://stackoverflow.com/questions/33287886/replace-
empty-strings-with-none-null-values-in-dataframe

Row Manipulation
 Filter
 Range:
 Equality:
 Column functions
 https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.ht
ml#pyspark.sql.Column

Joins
 Option A (inner join)
 Option B (explicit)
 Join types: inner, outer, left_outer, right_outer, leftsemi
 DataFrame joins benefit from Tungsten optimizations
 Note: PySpark will not drop columns for outer joins

Null Handling
 Built in support for handling nulls/NA in data frames.
 Drop, fill, replace
 https://spark.apache.org/docs/1.6.0/api/python/pyspark.
sql.html#pyspark.sql.DataFrameNaFunctions

Lab Rules
 Ask Google and StackOverflow before you ask me 
 You do not have to use my code.
 Use DataFrames until you can’t.
 Keep track of what breaks!
 There are no stupid questions.

Lab
 Ingest Data
 Remove invalid entrees or fill missing entries
 Split into test, train, validate
 Reformat a single column, e.g. map IDs or change
format
 Add a custom metric or feature based on other columns
 Run a classification algorithm on this data to figure out
who will survive!

What problems did you
encounter?
What are you still confused
about?

 Partitions
 How data is split on disk
 Affects memory usage, shuffle size
 Count ~ speed, Count ~ 1/memory
 Caching
 Persist RDDs in distributed memory
 Major speedup for repeated operations
 Serialization
 Efficient movement of data
 Java vs. Kryo
Partitions, Caching, and
Serialization

Shuffle!
 All-all operations
 reduceByKey, groupByKey
 Data movement
 Serialization
 Akka
 Memory overhead
 Dumps to disk when OOM
 Garbage collection
 EXPENSIVE!
Map Reduce

What else?
 Save your work => Write completed datasets to file
 Work on small data first, then go to big data
 Create test data to capture edge cases
 LMGTFY

By popular demand:
screen pyspark
--driver-memory 100g
--num-executors 60
--executor-cores 5
--master yarn-client
--conf "spark.executor.memory=20g”
--conf "spark.io.compression.codec=lz4"
--conf "spark.shuffle.consolidateFiles=true"
--conf "spark.dynamicAllocation.enabled=false"
--conf "spark.shuffle.manager=tungsten-sort"
--conf "spark.akka.frameSize=1028"
--conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m -
XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 -
XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -
XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts -
XX:+UseCompressedOops"

Any Spark on YARN
 E.g. Deploy Spark 1.6 on CDH 5.4
 Download your Spark binary to the cluster and untar
 In $SPARK_HOME/conf/spark-env.sh:
 export
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/co
nf
 This tells Spark where Hadoop is deployed, this also gives it the
link it needs to run on YARN
 export SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop
classpath)
 This defines the location of the Hadoop binaries used at run
time

References
 http://spark.apache.org/docs/latest/programming-guide.html
 http://spark.apache.org/docs/latest/sql-programming-guide.html
 http://tinyurl.com/leqek2d (Working With Spark, by Ilya Ganelin)
 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-
spark-jobs-part-1/ (by Sandy Ryza)
 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-
spark-jobs-part-2/ (by Sandy Ryza)
 http://www.slideshare.net/ilganeli/frustrationreduced-pyspark-data-
engineering-with-dataframes

Frustration-Reduced PySpark: Data engineering with DataFrames

More Related Content

What's hot

Similar to Frustration-Reduced PySpark: Data engineering with DataFrames

Recently uploaded

Frustration-Reduced PySpark: Data engineering with DataFrames

Editor's Notes