Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
DataFrames for Large-scale
Data Science
Reynold Xin @rxin
Feb 17, 2015 (Spark User Meetup)
2
Year of the lamb, goat, sheep, and ram …?
A slide from 2013 …
3
From MapReduce to Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, ...
Spark’s Growth
5
Google Trends for “Apache Spark”
Beyond Hadoop Users
6
Early adopters
Data Scientists
Statisticians
R users …
PyData
Users
Understands
MapReduce
& function...
RDD API
• Most data is structured (JSON, CSV, Avro, Parquet, Hive …)
–  Programming RDDs inevitably ends up with a lot of ...
8
DataFrames in Spark
• Distributed collection of data grouped into named
columns (i.e. RDD with schema)
• Domain-specific f...
10
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime performance of aggregating 10 million int pair...
Agenda
• Introduction
• Learn by demo
• Design & internals
–  API design
–  Plan optimization
–  Integration with data sou...
Learn by Demo (in a Databricks Cloud
Notebook)
• Creation
• Project
• Filter
• Aggregations
• Join
• SQL
• UDFs
• Pandas
1...
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Machine Learning Integration
27
tokenizer = Tokenizer(inputCol="text", outputCol="words”)
hashingTF = HashingTF(inputCol="...
Design Philosophy
Simple tasks easy
-  DSL for common operations
-  Infer schema automatically (CSV,
Parquet, JSON, …)
-  ...
DataFrame Internals
• Represented internally as a “logical plan”
• Execution is lazy, allowing it to be optimized by Catal...
Plan Optimization & Execution
30
SQL	
  AST	
  
DataFrame	
  
Unresolved	
  
Logical	
  Plan	
  
Logical	
  Plan	
  
Op;mi...
31
32
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= ”2015-01-01”)
logical plan
...
Data Sources supported by DataFrames
33
{ JSON }
built-in external
JDBC
and more …
More Than Naïve Scans
• Data Sources API can automatically prune columns and
push filters to the source
–  Parquet: skip i...
35
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date > ”2015-01-01”)
logical plan
f...
DataFrames in Spark
• APIs in Python, Java, Scala, and R (via SparkR)
• For new users: make it easier to program Big Data
...
Our Vision
37
Thank you! Questions?
More Information
Blog post introducing DataFrames:
http://tinyurl.com/spark-dataframes
Build from source:
http://github.co...
Upcoming SlideShare
Loading in …5
×

Introducing DataFrames in Spark for Large Scale Data Science

35,830 views

Published on

View video of this presentation here: https://www.youtube.com/watch?v=vxeLcoELaP4

Introducing DataFrames in Spark for Large-scale Data Science

Published in: Software

Introducing DataFrames in Spark for Large Scale Data Science

  1. 1. DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup)
  2. 2. 2 Year of the lamb, goat, sheep, and ram …?
  3. 3. A slide from 2013 … 3
  4. 4. From MapReduce to Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  5. 5. Spark’s Growth 5 Google Trends for “Apache Spark”
  6. 6. Beyond Hadoop Users 6 Early adopters Data Scientists Statisticians R users … PyData Users Understands MapReduce & functional APIs
  7. 7. RDD API • Most data is structured (JSON, CSV, Avro, Parquet, Hive …) –  Programming RDDs inevitably ends up with a lot of tuples (_1, _2, …) • Functional transformations (e.g. map/reduce) are not as intuitive 7
  8. 8. 8
  9. 9. DataFrames in Spark • Distributed collection of data grouped into named columns (i.e. RDD with schema) • Domain-specific functions designed for common tasks –  Metadata –  Sampling –  Project, filter, aggregation, join, … –  UDFs • Available in Python, Scala, Java, and R (via SparkR) 9
  10. 10. 10 0 2 4 6 8 10 RDD Scala RDD Python Spark Scala DF Spark Python DF Runtime performance of aggregating 10 million int pairs (secs)
  11. 11. Agenda • Introduction • Learn by demo • Design & internals –  API design –  Plan optimization –  Integration with data sources 11
  12. 12. Learn by Demo (in a Databricks Cloud Notebook) • Creation • Project • Filter • Aggregations • Join • SQL • UDFs • Pandas 12 For the purpose of distributing the slides online, I’m attaching screenshots of the notebooks.
  13. 13. 13
  14. 14. 14
  15. 15. 15
  16. 16. 16
  17. 17. 17
  18. 18. 18
  19. 19. 19
  20. 20. 20
  21. 21. 21
  22. 22. 22
  23. 23. 23
  24. 24. 24
  25. 25. 25
  26. 26. 26
  27. 27. Machine Learning Integration 27 tokenizer = Tokenizer(inputCol="text", outputCol="words”) hashingTF = HashingTF(inputCol="words", outputCol="features”) lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) df = context.load("/path/to/data") model = pipeline.fit(df)
  28. 28. Design Philosophy Simple tasks easy -  DSL for common operations -  Infer schema automatically (CSV, Parquet, JSON, …) -  MLlib pipeline integration Performance -  Catalyst optimizer -  Code generation Complex tasks possible -  RDD API -  Full expression library Interoperability -  Various data sources and formats -  Pandas, R, Hive … 28
  29. 29. DataFrame Internals • Represented internally as a “logical plan” • Execution is lazy, allowing it to be optimized by Catalyst 29
  30. 30. Plan Optimization & Execution 30 SQL  AST   DataFrame   Unresolved   Logical  Plan   Logical  Plan   Op;mized   Logical  Plan   Physical  Plans  Physical  Plans   RDDs   Selected   Physical  Plan   Analysis   Logical   Op;miza;on   Physical   Planning   Cost  Model   Physical  Plans   Code   Genera;on   Catalog   DataFrames and SQL share the same optimization/execution pipeline
  31. 31. 31
  32. 32. 32 joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date >= ”2015-01-01”) logical plan filter join scan (users) scan (events) physical plan join scan (users) filter scan (events) this join is expensive à
  33. 33. Data Sources supported by DataFrames 33 { JSON } built-in external JDBC and more …
  34. 34. More Than Naïve Scans • Data Sources API can automatically prune columns and push filters to the source –  Parquet: skip irrelevant columns and blocks of data; turn string comparison into integer comparisons for dictionary encoded data –  JDBC: Rewrite queries to push predicates down 34
  35. 35. 35 joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date > ”2015-01-01”) logical plan filter join scan (users) scan (events) optimized plan join scan (users) filter scan (events) optimized plan with intelligent data sources join scan (users) filter scan (events)
  36. 36. DataFrames in Spark • APIs in Python, Java, Scala, and R (via SparkR) • For new users: make it easier to program Big Data • For existing users: make Spark programs simpler & easier to understand, while improving performance • Experimental API in Spark 1.3 (early March) 36
  37. 37. Our Vision 37
  38. 38. Thank you! Questions?
  39. 39. More Information Blog post introducing DataFrames: http://tinyurl.com/spark-dataframes Build from source: http://github.com/apache/spark (branch-1.3)

×