Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Trends for Big Data and
Apache Spark in 2017
Matei Zaharia
@matei_zaharia
2016: A Great Year for Spark
• Spark 2.0: stabilizes structuredAPIs,
10x speedups, SQL 2003 support
• StructuredStreaming
...
This Talk
What are the new trends for big data apps in 2017?
Work to address them at Databricks + elsewhere
Three Key Trends
Hardware:compute bottleneck
Users: democratizingaccess to big data
Applications:productionapps
1
2
3
Three Key Trends
Hardware:compute bottleneck
Users: democratizingaccess to big data
Applications:productionapps
1
2
3
Hardware Trends
2010
Storage
100 MB/s
(HDD)
Network 1Gbps
CPU ~3GHz
Hardware Trends
2010 2017
Storage
100 MB/s
(HDD)
1000 MB/s
(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz
Hardware Trends
2010 2017
Storage
100 MB/s
(HDD)
1000 MB/s
(SSD)
10x
Network 1Gbps 10Gbps 10x
CPU ~3GHz ~3GHz L
Response: ...
Summary
In 2005-2010, I/O was the name of the game
• Network locality, compression, in-memory caching
Now, compute efficie...
Spark Effort: Project Tungsten
Optimize Apache Spark’s CPU and memory usage, via:
• Binary storage format
• Runtime code g...
Tungsten’s Binary Encoding
6 “bricks”0x0 123 32L 48L 4 “data”
Offset to data Field lengths
(123, “data”, “bricks”)
Tungsten’s Code Generation
df.where(df("year") > 2015)
GreaterThan(year#234, Literal(2015))
bool filter(Object baseObject)...
Impact of Tungsten
Whole-stagecode gen
Spark 1.6
14M
rows/s
Spark 2.0
125M
rows/s
Spark 1.6
11M
rows/s
Spark 2.0
90M
rows/...
Ongoing Work
Standard binary format to pass data to externalcode
• Either existing format or Apache Arrow (SPARK-19489, SP...
Three Key Trends
Hardware:compute bottleneck
Users: democratizingaccess to big data
Applications:productionapps
1
2
3
Languages Used for Spark
84%
38% 38%
65%
29%
62%
20%
2014 Languages Used 2016 Languages Used
+ Everyone uses SQL (95%)
Our Approach
Spark SQL
• Spark 2.0: more of SQL 2003 than any other open source engine
High-level APIs based on single-nod...
Ongoing Work
Next generationof SQL and DataFrames
• Cost-based optimizer (SPARK-16026 + many others)
• Improved data sourc...
Three Key Trends
Hardware:compute bottleneck
Users: democratizingaccess to big data
Applications:productionapps
1
2
3
Big Data in Production
Big data is moving from offlineanalytics to production use
• Incorporate new data in seconds (strea...
Structured Streaming
High-level streaming API built on DataFrames
• Transactional I/O to files,databases, queues
• Integra...
Structured Streaming
High-level streaming API built on DataFrames
• Transactional I/O to files,databases, queues
• Integra...
Structured Streaming
High-level streaming API built on DataFrames
• Transactional I/O to files,databases, queues
• Integra...
Early Experience
Running in our analytics pipeline since
second half of 2016
Powers real-timemetrics for properties
includ...
Ongoing Work
Integrations with more systems
• JDBC source and sink (SPARK-19478, SPARK-19031)
• Unified access to Kafka (S...
Three Key Trends
Hardware:compute bottleneck
Users: democratizingaccess to big data
Applications:productionapps
1
2
3
All ...
Thanks
Enjoy Spark Summit!
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Upcoming SlideShare
Loading in …5
×

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

4,828 views

Published on

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.

Published in: Data & Analytics

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

  1. 1. Trends for Big Data and Apache Spark in 2017 Matei Zaharia @matei_zaharia
  2. 2. 2016: A Great Year for Spark • Spark 2.0: stabilizes structuredAPIs, 10x speedups, SQL 2003 support • StructuredStreaming • 3.6x growthin meetupmembers 2015 2016 Spark Meetup Members 240K 66K
  3. 3. This Talk What are the new trends for big data apps in 2017? Work to address them at Databricks + elsewhere
  4. 4. Three Key Trends Hardware:compute bottleneck Users: democratizingaccess to big data Applications:productionapps 1 2 3
  5. 5. Three Key Trends Hardware:compute bottleneck Users: democratizingaccess to big data Applications:productionapps 1 2 3
  6. 6. Hardware Trends 2010 Storage 100 MB/s (HDD) Network 1Gbps CPU ~3GHz
  7. 7. Hardware Trends 2010 2017 Storage 100 MB/s (HDD) 1000 MB/s (SSD) Network 1Gbps 10Gbps CPU ~3GHz ~3GHz
  8. 8. Hardware Trends 2010 2017 Storage 100 MB/s (HDD) 1000 MB/s (SSD) 10x Network 1Gbps 10Gbps 10x CPU ~3GHz ~3GHz L Response: simplerbut more parallel devices (e.g. GPU,FPGA)
  9. 9. Summary In 2005-2010, I/O was the name of the game • Network locality, compression, in-memory caching Now, compute efficiencymatters evenfor data-intensive apps • And harder to obtain with more types of hardware!
  10. 10. Spark Effort: Project Tungsten Optimize Apache Spark’s CPU and memory usage, via: • Binary storage format • Runtime code generation
  11. 11. Tungsten’s Binary Encoding 6 “bricks”0x0 123 32L 48L 4 “data” Offset to data Field lengths (123, “data”, “bricks”)
  12. 12. Tungsten’s Code Generation df.where(df("year") > 2015) GreaterThan(year#234, Literal(2015)) bool filter(Object baseObject) { int offset = baseOffset + bitSetWidthInBytes + 3*8L; int value = Platform.getInt(baseObject, offset); return value > 2015; } DataFrame/SQLCode Logical Expressions Java Bytecode compiles to pointer arithmetic
  13. 13. Impact of Tungsten Whole-stagecode gen Spark 1.6 14M rows/s Spark 2.0 125M rows/s Spark 1.6 11M rows/s Spark 2.0 90M rows/s Optimized Parquet
  14. 14. Ongoing Work Standard binary format to pass data to externalcode • Either existing format or Apache Arrow (SPARK-19489, SPARK-13545) • Binary format for data sources (SPARK-15689) Integrations withdeep learning libraries • Intel BigDL, Databricks TensorFrames (see talks today) Acceleratorsas a first-classresource
  15. 15. Three Key Trends Hardware:compute bottleneck Users: democratizingaccess to big data Applications:productionapps 1 2 3
  16. 16. Languages Used for Spark 84% 38% 38% 65% 29% 62% 20% 2014 Languages Used 2016 Languages Used + Everyone uses SQL (95%)
  17. 17. Our Approach Spark SQL • Spark 2.0: more of SQL 2003 than any other open source engine High-level APIs based on single-node tools • DataFrames, ML Pipelines, PySpark, SparkR
  18. 18. Ongoing Work Next generationof SQL and DataFrames • Cost-based optimizer (SPARK-16026 + many others) • Improved data sources (SPARK-16099, SPARK-18352) Continue improving Python/R (SPARK-18924, 17919, 13534, …) Make Spark easier to run on a single node • Publish to PyPI (SPARK-18267) and CRAN (SPARK-15799) • Optimize for large servers • As convenient as Python multiprocessing
  19. 19. Three Key Trends Hardware:compute bottleneck Users: democratizingaccess to big data Applications:productionapps 1 2 3
  20. 20. Big Data in Production Big data is moving from offlineanalytics to production use • Incorporate new data in seconds (streaming) • Power low-latency queries (data serving) Currentlyvery hard to build: separate streaming,serving & batch systems Our goal: single API for “continuous apps” Ad-hoc Queries Input Stream Serving System Continuous Application Static Data Batch Job Batch Jobs >_
  21. 21. Structured Streaming High-level streaming API built on DataFrames • Transactional I/O to files,databases, queues • Integrationwith batch & interactive queries
  22. 22. Structured Streaming High-level streaming API built on DataFrames • Transactional I/O to files,databases, queues • Integrationwith batch & interactive queries API: incrementalize an existing DataFrame query logs = ctx.read.format(“json”).open(“s3://logs”) logs.groupBy(“userid”, “hour”).avg(“latency”) .write.format(”parquet") .save(“s3://stats”) Example batch job:
  23. 23. Structured Streaming High-level streaming API built on DataFrames • Transactional I/O to files,databases, queues • Integrationwith batch & interactive queries API: incrementalize an existing DataFrame query logs = ctx.readStream.format(“json”).load(“s3://logs”) logs.groupBy(“userid”, “hour”).avg(“latency”) .writeStream.format(”parquet") .start(“s3://stats”) Example as streaming:
  24. 24. Early Experience Running in our analytics pipeline since second half of 2016 Powers real-timemetrics for properties including Nickelodeonand MTV Monitors 1000s of WiFi access points
  25. 25. Ongoing Work Integrations with more systems • JDBC source and sink (SPARK-19478, SPARK-19031) • Unified access to Kafka (SPARK-18682) New operators • mapWithState operator (SPARK-19067) • Session windows (SPARK-10816) Performance and latency
  26. 26. Three Key Trends Hardware:compute bottleneck Users: democratizingaccess to big data Applications:productionapps 1 2 3 All integrated in Apache Spark 2.0
  27. 27. Thanks Enjoy Spark Summit!

×