Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and demos by Michael Armbrust and Tim Hunter

3,301 views

Published on

2017 continues to be an exciting year for big data and Apache Spark. I will talk about two major initiatives that Databricks has been building: Structured Streaming, the new high-level API for stream processing, and new libraries that we are developing for machine learning. These initiatives can provide order of magnitude performance improvements over current open source systems while making stream processing and machine learning more accessible than ever before.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and demos by Michael Armbrust and Tim Hunter

  1. 1. New Frontiers for Apache Spark Matei Zaharia @matei_zaharia
  2. 2. Welcome to Spark Summit 2017 Our largest summit,followinganother year of communitygrowth 66K 225K 365K 2015 2016 2017 Spark Meetup Members Worldwide 0% 20% 40% 60% 80% 100% 06/2016 12/2016 06/2017 Spark Version Usage in Databricks 2.1 2.0 1.6 1.5
  3. 3. Summit Highlights
  4. 4. Apache Spark Philosophy Unified engine for complete data applications High-leveluser-friendlyAPIs SQLStreaming ML Graph …
  5. 5. Coming in Spark 2.2 • Data warehousing:cost-based SQL optimizer • StructuredStreaming:markedproduction-ready • Pythonusability: pip install pyspark Currentlyin release candidate stage on dev list
  6. 6. Two New Open Source Efforts from Databricks Deep Learning Streaming Performance 1 2
  7. 7. Deep Learning has Huge Potential Unprecedentedability to work with unstructureddata suchas images and text
  8. 8. But Deep Learning is Hard to Use Current APIs (TensorFlow,Keras, BigDL, etc) are low-level • Build a computation graph from scratch • Scale-out typically requires manual parallelization Hard to expose models in larger applications Very similar to early big data APIs (MapReduce)
  9. 9. Our Goal Enable an order of magnitude more users to build applications using deep learning Provide scale & production use out of the box
  10. 10. Deep Learning Pipelines A new high-level API for deep learning that integrates with Apache Spark’s ML Pipelines • Commonuse cases in just a few lines of code • Automaticallyscale out on Spark • Expose modelsin batch/streamingapps & Spark SQL Builds on existingengines (TensorFlow,Keras, BigDL)
  11. 11. Deep Learning Pipelines Demo Tim Hunter - @timjhunter Spark Summit 2017
  12. 12. UsingApache Spark and Deep Learning • New library: DeepLearning Pipelines • Simple API for Deep Learning, integratedwith ML pipelines • Scales commontasks with transformersand estimators • Embeds Deep Learningmodels in Spark
  13. 13. Example: Image classification
  14. 14. Example: Identify the James Bond cars
  15. 15. Example: Identify the James Bond cars
  16. 16. Good application for Deep Learning • Neural networksare very good at dealing withimages • Can work with complexsituations: INVARIANCE TO ROTATIONS INCOMPLETE DATA
  17. 17. Transfer Learning
  18. 18. Transfer Learning
  19. 19. Transfer Learning
  20. 20. Transfer Learning
  21. 21. Transfer Learning
  22. 22. SoftMax GIANT PANDA 0.9 RED PANDA 0.05 RACCOON 0.01 … Classifier Transfer Learning DeepImageFeaturizer
  23. 23. Deep Learningwithout Deep Pockets • Simple API for Deep Learning, integratedwith MLlib • Scales commontasks with transformersand estimators • Embeds Deep Learningmodels in MLlib and SparkSQL • Earlyrelease of Deep Learning Pipelines https://github.com/databricks/spark-deep-learning
  24. 24. Two New Open Source Efforts from Databricks Deep Learning Streaming Performance 1 2
  25. 25. Structured Streaming: Ready For Production Michael Armbrust - @michaelarmbrust Spark Summit 2017
  26. 26. What is Structured Streaming? Our goal was to build the easieststreaming engineusingthe power of Spark SQL. • High-Level APIs — DataFrames, Datasets and SQL. Same in streaming and in batch. • Event-time Processing — Native support for working with out-of-order and late data. • End-to-end Exactly Once — Transactional both in processing and output.
  27. 27. Simple APIs: Benchmark 28 KStream<String, ProjectedEvent> filteredEvents = kEvents.filter((key, value) -> { return value.event_type.equals("view"); }).mapValues((value) -> { return new ProjectedEvent(value.ad_id, value.event_time); }); KTable<String, String> kCampaigns = builder.table("campaigns", "campaign-state"); KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> { Map<String, String> campMap = Json.parser.readValue(value); return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id")); }); KStream<String, String> joined = filteredEvents.join(deserCampaigns, (value1, value2) -> { return value2.campaign_id; }, Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(), new ProjectedEventDeserializer())); KStream<String, String> keyedByCampaign = joined.selectKey((key, value) -> value); KTable<Windowed<String>, Long> counts = keyedByCampaign.groupByKey() .count(TimeWindows.of(10000), "time-windows"); Filter by click type and project Join with campaigns table Group and windowed count streams
  28. 28. KTable<String, String> kCampaigns = builder.table("campaigns", "campaign-state"); KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> { Map<String, String> campMap = Json.parser.readValue(value); return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id")); }); KStream<String, String> joined = filteredEvents.join(deserCampaigns, (value1, value2) -> { return value2.campaign_id; }, Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(), new ProjectedEventDeserializer())); KStream<String, ProjectedEvent> filteredEvents = kEvents.filter((key, value) -> { return value.event_type.equals("view"); }).mapValues((value) -> { return new ProjectedEvent(value.ad_id, value.event_time); }); KStream<String, String> keyedByCampaign = joined.selectKey((key, value) -> value); KTable<Windowed<String>, Long> counts = keyedByCampaign.groupByKey() .count(TimeWindows.of(10000), "time-windows"); Simple APIs: Benchmark 29 DataFrames streams events .where("event_type = 'view'") .join(table("campaigns"), "ad_id") .groupBy( window('event_time, "10 seconds"), 'campaign_id) .count()
  29. 29. Simple APIs: Benchmark 30 SQL SELECT COUNT(*) FROM events JOIN campaigns USING ad_id WHERE event_type = 'view' GROUP BY window(event_time, "10 seconds"), campaign_id) KTable<String, String> kCampaigns = builder.table("campaigns", "campaign-state"); KTable<String, CampaignAd> deserCampaigns = kCampaigns.mapValues((value) -> { Map<String, String> campMap = Json.parser.readValue(value); return new CampaignAd(campMap.get("ad_id"), campMap.get("campaign_id")); }); KStream<String, String> joined = filteredEvents.join(deserCampaigns, (value1, value2) -> { return value2.campaign_id; }, Serdes.String(), Serdes.serdeFrom(new ProjectedEventSerializer(), new ProjectedEventDeserializer())); KStream<String, ProjectedEvent> filteredEvents = kEvents.filter((key, value) -> { return value.event_type.equals("view"); }).mapValues((value) -> { return new ProjectedEvent(value.ad_id, value.event_time); }); KStream<String, String> keyedByCampaign = joined.selectKey((key, value) -> value); KTable<Windowed<String>, Long> counts =keyedByCampaign.groupByKey() .count(TimeWindows.of(10000), "time-windows"); streams
  30. 30. Performance: Benchmark 31 https://data-artisans.com/blog/extending-the-yahoo-streaming-benchmark Throughput At ~200ms Latency 5xlower cost Streamingqueries use the Catalyst Optimizerand the Tungsten Execution Engine. 0 10 20 30 40 50 60 70 Kafka Streams Apache Flink Apache SparkRecords/Second Millions
  31. 31. GA in Spark 2.2 • Processedover 3 trillionrows last monthin production • More production use cases of StructuredStreamingthan DStreamsamongDatabricks Customers • Spark 2.2 removesthe “Experimental”tagfrom all APIs 32
  32. 32. What about Latency? 33
  33. 33. Demo: StreamingFind James Bond 34 • Stream of events containing images • Need to join with locations table and filter using ML • Alert quickly on any suspicious Bond sightings… Need < 10ms to catch him!
  34. 34. Continuous Processing A new executionmode that allows fullypipelined execution. • Streaming execution without microbatches • Supports async checkpointing and ~1 ms latency • No changes required for user code Proposal available at https://issues.apache.org/jira/browse/SPARK-20928
  35. 35. Apache Spark Structured Streaming The easiest streamingengine is now also the fastest!
  36. 36. Conclusion We’re bringing two new workloads to Apache Spark: • Deep learning • Low-latencystreaming Find out more in the sessions today!
  37. 37. Thanks Enjoy Spark Summit!

×