Successfully reported this slideshow.
Your SlideShare is downloading. ×

Beyond SQL: Speeding up Spark with DataFrames

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 24 Ad

Beyond SQL: Speeding up Spark with DataFrames

Download to read offline

In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.

In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Beyond SQL: Speeding up Spark with DataFrames (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Beyond SQL: Speeding up Spark with DataFrames

  1. 1. Beyond SQL: Speeding up Spark with DataFrames Michael Armbrust - @michaelarmbrust March 2015 – Spark Summit East
  2. 2. 2 0   50   100   150   # of Unique Contributors 0   50   100   150   200   # Of Commits Per Month Graduated from Alpha in 1.3 About Me and SQL   Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014)
  3. 3. 3 SELECT  COUNT(*)   FROM  hiveTable   WHERE  hive_udf(data)     About Me and SQL   Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments
  4. 4. 4 About Me and SQL   Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC
  5. 5. Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, and Java 5 About Me and SQL  
  6. 6. Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, and Java @michaelarmbrust •  Lead developer of Spark SQL @databricks About Me and 6 SQL  
  7. 7. The not-so-secret truth... 7 is not about SQL.   SQL  
  8. 8. Execution Engine Performance 8 0 50 100 150 200 250 300 350 400 450 3 7 19 27 34 42 43 46 52 53 55 59 63 68 73 79 89 98 TPC-DS Performance Shark Spark SQL
  9. 9. The not-so-secret truth... 9 is about more than SQL.   SQL  
  10. 10. Spark SQL: The whole story Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Let the optimizer do the hard work 10
  11. 11. DataFrame noun – [dey-tuh-freym] 11 1.  A distributed collection of rows organized into named columns. 2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).  
  12. 12. Write Less Code: Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 12 { JSON } Built-In External JDBC and more…
  13. 13. Write Less Code: High-Level Operations Common operations can be expressed concisely as calls to the DataFrame API: •  Selecting required columns •  Joining different data sources •  Aggregation (count, sum, average, etc) •  Filtering 13
  14. 14. Write Less Code: Compute an Average private  IntWritable  one  =        new  IntWritable(1)   private  IntWritable  output  =      new  IntWritable()   proctected  void  map(          LongWritable  key,          Text  value,          Context  context)  {      String[]  fields  =  value.split("t")      output.set(Integer.parseInt(fields[1]))      context.write(one,  output)   }     IntWritable  one  =  new  IntWritable(1)   DoubleWritable  average  =  new  DoubleWritable()     protected  void  reduce(          IntWritable  key,          Iterable<IntWritable>  values,          Context  context)  {      int  sum  =  0      int  count  =  0      for(IntWritable  value  :  values)  {            sum  +=  value.get()            count++          }      average.set(sum  /  (double)  count)      context.Write(key,  average)   }   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [x.[1],  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()  
  15. 15. Write Less Code: Compute an Average 15 Using RDDs   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [int(x[1]),  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()        Using DataFrames   sqlCtx.table("people")          .groupBy("name")          .agg("name",  avg("age"))          .collect()       Full API Docs •  Python •  Scala •  Java
  16. 16. Not Just Less Code: Faster Implementations 16 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame SQL Time to Aggregate 10 million int pairs (secs)
  17. 17. 17 Demo: Data Sources API Using Spark SQL to read, write, and transform data in a variety of formats. http://people.apache.org/~marmbrus/talks/dataframe.demo.pdf
  18. 18. Read Less Data The fastest way to process big data is to never read it. Spark SQL can help you read less data automatically: 1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned off by default in Spark 1.3 18 •  Converting to more efficient formats •  Using columnar formats (i.e. parquet) •  Using partitioning (i.e., /year=2014/month=02/…)1 •  Skipping data using statistics (i.e., min, max)2 •  Pushing predicates into storage systems (i.e., JDBC)  
  19. 19. Plan Optimization & Execution 19 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  20. 20. Optimization happens as late as possible, therefore Spark SQL can optimize even across functions. 20
  21. 21. 21 def  add_demographics(events):        u  =  sqlCtx.table("users")                                          #  Load  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)            #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))            #  Run  udf  to  add  city  column    events  =  add_demographics(sqlCtx.load("/data/events",  "json"))     training_data  =  events.where(events.city  ==  "New  York").select(events.timestamp).collect()     Logical Plan filter join events file users table expensive only join relevent users Physical Plan join scan (events) filter scan (users)
  22. 22. 22 def  add_demographics(events):        u  =  sqlCtx.table("users")                                          #  Load  partitioned  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)            #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))            #  Run  udf  to  add  city  column     Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events  =  add_demographics(sqlCtx.load("/data/events",  "parquet"))     training_data  =  events.where(events.city  ==  "New  York").select(events.timestamp).collect()     Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users)
  23. 23. Machine Learning Pipelines 23 tokenizer  =  Tokenizer(inputCol="text",  outputCol="words”)   hashingTF  =  HashingTF(inputCol="words",  outputCol="features”)   lr  =  LogisticRegression(maxIter=10,  regParam=0.01)   pipeline  =  Pipeline(stages=[tokenizer,  hashingTF,  lr])     df  =  sqlCtx.load("/path/to/data")   model  =  pipeline.fit(df)   ds0 ds1 ds2 ds3tokenizer hashingTF lr.model lr Pipeline Model  
  24. 24. Create and Run Spark Programs Faster: •  Write less code •  Read less data •  Let the optimizer do the hard work SQL   Questions?

×