Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive"

Download to read offline

In this session, Sergei Sokolenko, the Google product manager for Cloud Dataflow, will share the implementation details of many of the unique features available in Apache Beam and Cloud Dataflow, including:

- autoscaling of resources based on data inputs;
- separating compute and state storage for better scaling of resources;
- simultaneous grouping and joining of 100s of Terabytes in a hybrid in-memory/on-desk file system;
- dynamic work rebalancing of work items away from overutilized worker nodes and many others.

Customers benefit from these advances through faster execution of jobs, resource savings, and a fully managed data processing environment that runs in the Cloud and removes the need to manage infrastructure.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive"

  1. 1. Advances in Stream Analytics: Google Cloud Dataflow and Apache Beam Kyiv, October 5th, 2019 Sergei Sokolenko Google
  2. 2. Your choices for doing Streaming Processing in Google Cloud Separating State Storage from Compute Autoscaling Making Streaming Easy Session overview
  3. 3. Google Cloud Platform Our global infrastructure PLCN (HK, LA) 2019 Faster (US, JP, TW) 2016 Unity (US, JP) 2010 Dunant (US, FR) 2020 Monet (US, BR) 2017 Junior (Rio, Santos) 2018 Tannat (BR, UY, AR) 2018 SJC (JP, HK, SG) 2013 Indigo (SG, ID, AU) 2019 HK-G (HK, GU) 2019 JGA (AU, GU, JP) 2019 Curie (CL, US) 2019 Havfrue (US, IE, DK) 2019 Network Edge points of presence CDN nodes Mumbai Singapore Kuala Lumpur Sydney Tokyo Chennai Taipei Seattle San Francisco Montréal Hamburg Zurich Madrid Paris London Hong Kong Osaka Toronto Chicago Los Angeles Denver Dallas Miami Atlanta Washington DC New York Rio de Janeiro São Paulo Buenos Aires Munich Milan Marseille Amsterdam Stockholm Frankfurt Dedicated Interconnect Current regions and number of zones Future regions and number of zones Mumbai Singapore Jakarta Sydney Tokyo Osaka Hong Kong Taiwan 3 3 3 3 3 3 3 3 3 33 3 3 3 4 3 3 Oregon Los Angeles Iowa S. Carolina N. Virginia Montréal São Paulo Finland Frankfurt Zurich 3 Belgium London Netherlands 3Seoul 3 3 Salt Lake City 3 3
  4. 4. A comprehensive Big Data platform, not just infrastructure Data ingestion at any scale Reliable streaming data pipeline Advanced analytics Data warehousing and data lake Apache Beam Cloud Pub/Sub Cloud Dataflow Cloud Dataproc BigQuery Cloud Storage Data Transfer Service Cloud Composer Cloud IoT Core Cloud Dataprep Cloud AI Services Google Data Studio Tensorflow Sheets Storage Transfer Service Data Catalog Data Fusion
  5. 5. Google’s data processing timeline 20122002 2004 2006 2008 2010 MapReduce GFS Big Table Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataflow 2016 Apache Beam
  6. 6. Why FlumeJava Mapreduce MAP MAP MAP MAP MAP RED RED RED (K,V) (K,V*) (K,W)
  7. 7. MapReduce can quickly get out of hand One Google pipeline had 116 stages!
  8. 8. DAGs offer a better abstraction from execution Filter Filter Join Group Filter Filter fs:// Databasefs:// Database
  9. 9. By 2025, more than a quarter of data created in the global datasphere will be real time in nature. *IDC
  10. 10. 9:008:00 14:0013:0012:0011:0010:00 Processing time 8:00 8:00 8:00 Event time Data Streams and Late Arriving Data
  11. 11. Goal: Grouping by Event Time into Time Windows 9:00 14:0013:0012:0011:0010:00Event time 9:00 14:0013:0012:0011:0010:00Processing time Input Output
  12. 12. MillWheel - low-latency, accurate data-processing
  13. 13. Common steps in Stream Analytics End-user apps Cloud Composer Orchestrate IoT Events Cloud Pub/Sub Dataflow Streaming DBs Ingest & distribute Aggregate, enrich, detect Backfill, reprocess Cloud AI Platform Bigtable Dataflow Batch Action Reference architecture of Streaming Processing in GCP BigQuery BigQuery Streaming API Machine Learning Data Warehousing
  14. 14. What is Beam and Dataflow? Open source programming model Unified batch and streaming Top Apache project by dev@ activity Runner and language portability Cloud Dataflow Automatic optimizations scale to millions of QPS Serverless, fully managed data processing State storage in Shuffle and Streaming Engine Exactly-once streaming semantics SDK
  15. 15. The Beam Vision Input.apply (Sum.integersPerKey()) Java input | Sum.PerKey() Python stats.Sum(s, input) Go SELECT key, SUM(value) FROM input GROUP BY key SQL Cloud Dataflow Apache Spark Apache Flink Apache Apex Gearpump Apache Samza Apache Nemo (incubating) IBM Streams Sum Per Key
  16. 16. ● Separating compute from state storage ● Automatic scaling ● Building Streaming systems can be hard, but it does not have to be Lessons Learned While Building Cloud Dataflow
  17. 17. Separating compute from state storage to improve scalability
  18. 18. Traditional Distributed Data Processing Architecture User code VM User code VM User code VM User code VM State storage ● Jobs executed on clusters of VMs ● Job state stored on network-attached volumes ● Control plane orchestrates data plane Network Control plane VM State storage State storage State storage
  19. 19. Traditional Architecture works well ... Filter Filter Join Group Filter Filter fs:// Databasefs:// Database … except for Joins and Group By’s
  20. 20. Shuffling key-value pairs ● Unsorted Data Elements <key1, record> <key5, record> <key3, record> <key8, record> <key4, record> ... <key5, record> <key5, record> <key2, record> <key3, record> <key8, record> ... <key3, record> <key3, record> <key8, record> <key3, record> <key6, record> ... <key2, record> <key1, record> <key5, record> <key8, record> <key4, record> ...
  21. 21. ● Unsorted data elements ● Goal: sort data elements by key <key1, record> <key1, record> <key2, record> <key2, record> <key2, record> ... <key3, record> <key3, record> <key3, record> <key3, record> <key3, record> <key4, record> ... <key5, record> <key5, record> <key5, record> <key5, record> <key6, record> ... <key7, record> <key8, record> <key8, record> <key8, record> ... Shuffling key-value pairs
  22. 22. ● Unsorted data elements ● Goal: sort data elements by key ● KV pairs need to be exchanged between nodes <key1, record> <key5, record> <key3, record> <key8, record> <key4, record> ... <key5, record> <key5, record> <key2, record> <key3, record> <key8, record> ... <key3, record> <key3, record> <key8, record> <key3, record> <key6, record> ... <key2, record> <key1, record> <key5, record> <key8, record> <key4, record> ... Shuffling key-value pairs
  23. 23. ● Unsorted data elements ● Goal: sort data elements by key ● KV pairs need to be exchanged between nodes ● Until everything is sorted Shuffling key-value pairs <key1, record> <key1, record> <key2, record> <key2, record> <key2, record> ... <key3, record> <key3, record> <key3, record> <key3, record> <key3, record> <key4, record> ... <key5, record> <key5, record> <key5, record> <key5, record> <key6, record> ... <key7, record> <key8, record> <key8, record> <key8, record> ... key1, key 2 key3, key4 key5, key6 key7, key8
  24. 24. Traditional Architecture Requires Manual Tuning User code VM User code VM User code VM User code VM State storage ● When data volumes exceed dozens of TBs Network Control plane VM State storage State storage State storage
  25. 25. Distributed in-memory Shuffle in batch Cloud Dataflow Compute Petabit network Dataflow Shuffle Region Zone ‘a’ Zone ‘b’ Zone ‘c’Distributed in-memory file system Distributed on-disk file system Shuffle proxy Autozone placement
  26. 26. No tuning required Dataflow Shuffle is usually faster than worker-based shuffle, including those using SSD-PD. Faster Processing Runtime of shuffle Runtime (mins)
  27. 27. Shuffle 200TB+ Dataflow shuffle has been used to shuffle 200TB+ datasets. Supporting larger datasets Dataset size of shuffle Dataset size (TB)
  28. 28. Storing state What about streaming pipelines? Streaming shuffle Just like in batch, need to group and join streams Distributed streaming shuffle Window data elements Late Arriving Data requires buffering time window data Accumulate elements until triggering conditions occur
  29. 29. Goal: Grouping by Event Time into Time Windows 9:00 14:0013:0012:0011:0010:00Event time 9:00 14:0013:0012:0011:0010:00Processing time Input Output
  30. 30. Even more state to store on disks in streaming User code VM User code VM User code VM User code VM Shuffle data elements ● Key ranges are assigned to workers ● Data elements of these keys is stored on Persistent Disks State storage State storage State storage State storage key 0000 ... … key 1234 key 1235 ... … key ABC2 key ABC3 ... … key DEF5 key DEF6 ... … key GHI2 Time window data ● Also assigned to workers ● When time windows close, data processed on workers
  31. 31. Dataflow Streaming Engine Benefits ● Better supportability ● Less worker resources ● Smoother autoscaling User code Streaming engine Worker User code Worker User code Worker User code Worker Window state storage Streaming shuffle
  32. 32. Autoscaling: Even better with separate Compute and State Storage User code Streaming engine Worker User code Worker Window state storage Streaming shuffle Dataflow with Streaming Engine User code VM User code VM State storage State storage key 0000 ... … key 1234 key 1235 ... … key ABC2 Dataflow without Streaming Engine
  33. 33. Dataflow with Streaming Engine Dataflow without Streaming Engine
  34. 34. Streaming can be hard, but does not have to be
  35. 35. We’ve set out to make Streaming as accessible as Batch.
  36. 36. Easy Stream Analytics in SQL Group by Input1 Output Join Input2 SELECT input1.*, input2.* FROM input1 LEFT OUTER JOIN input2 ON input1.Id = input2.Id Use Dataflow SQL from BigQuery UI: ● Join Pub/Sub Streams with Files or Tables ● Write into BigQuery for dashboarding ● Store Pub/Sub schema in Data Catalog ● Use SQL skills for streaming data processing
  37. 37. Demo
  38. 38. Demo: Streaming Analytics with SQL Transactions PubSub Dataflow BigQuery SELECT sr.sales_region, TUMBLE_START("INTERVAL 5 SECOND") AS period_start, SUM(tr.payload.amount) as amount FROM `pubsub.dataflow-sql.transactions` AS tr INNER JOIN `bigquery.dataflow-sql.opsdb.us_state_salesregions` AS sr ON tr.payload.state = sr.state_code GROUP BY sr.sales_region, TUMBLE(tr.event_timestamp, "INTERVAL 5 SECOND") PubSub topic Streaming SQL pipeline Table Table
  39. 39. Google Cloud offers both infrastructure-as-a-service as well as fully managed services Separating compute from state storage help make stream and batch processing scalable SQL brings complexity of Streaming Processing way down Main takeaways
  40. 40. Thank you!
  • hectoriankovski

    May. 21, 2020

In this session, Sergei Sokolenko, the Google product manager for Cloud Dataflow, will share the implementation details of many of the unique features available in Apache Beam and Cloud Dataflow, including: - autoscaling of resources based on data inputs; - separating compute and state storage for better scaling of resources; - simultaneous grouping and joining of 100s of Terabytes in a hybrid in-memory/on-desk file system; - dynamic work rebalancing of work items away from overutilized worker nodes and many others. Customers benefit from these advances through faster execution of jobs, resource savings, and a fully managed data processing environment that runs in the Cloud and removes the need to manage infrastructure.

Views

Total views

469

On Slideshare

0

From embeds

0

Number of embeds

238

Actions

Downloads

14

Shares

0

Comments

0

Likes

1

×