Advertisement

Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by Ron Pantofaro

InfluxData
Mar. 17, 2019
Advertisement

More Related Content

Slideshows for you(20)

Similar to Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by Ron Pantofaro(20)

Advertisement

More from InfluxData(20)

Advertisement

Building Modern Data Pipelines for Time Series Data on GCP with InfluxData by Ron Pantofaro

  1. Data pipelines on Google Cloud Ron Pantofaro, Solutions Architect, Google @panto
  2. User experience Business goal: Respond to business events as they happen Data consumers AnalyzeTransformIngest IoT Mobile Web Endpoint clients Transactional & device data Databases
  3. IT goal: Simplify ETL architecture Pub/Sub Data producer Data consumer File Data producer Data consumer File File ● Applications must persist millions of small files ● Every file must arrive as a precondition to job completion, delaying processing ● Unit of access is different in application logic ● No persistence of files required ● Every message guaranteed to be delivered to every reader ● Unit of access is same in application logic VSFiles Events
  4. Mobile Devices Tens of Thousands Events/sec Tens of Billions Events/month Hundreds of Billions Events/year The Lambda Model
  5. Mobile Devices Apache Beam + Tens of Thousands Events/sec Tens of Billions Events/month Hundreds of Billions Events/year A Unified Model or or
  6. What is Cloud Pub/Sub? Publisher Subscriber Topic Subscription Message Message fully-managed real-time messaging service that allows you to send and receive messages between independent applications.
  7. Compute and Storage Unbounded Bounded Resource Management Resource Auto-scaler Dynamic Work Rebalancer Work Scheduler Monitoring Log Collection Graph Optimization Auto-Healing Intelligent WatermarkingS O U R C E S I N K What is Cloud Dataflow?
  8. Mobile Devices Cloud Pub/Sub Cloud Dataflow Storage Tens of Thousands Events/sec Tens of Billions Events/month Hundreds of Billions Events/year A Unified Model on Google Cloud Platform
  9. Here’s a simple graphic showing how Dataflow can integrate and transform data from two sources. One discrete job Endless incoming data Cloud Dataflow What is Cloud Dataflow?
  10. Deploy Schedule & Monitor Autoscaling mid-job Fully-managed and auto-configured Auto graph-optimized for best execution path Dynamic Work Rebalancing mid-job 1 2 3 4 Why Use Cloud Dataflow?
  11. Autoscaling mid-job Auto graph-optimized for best execution path Dynamic Work Rebalancing mid-job 1 2 3 4 C D C+ D C D C+ D Why Use Cloud Dataflow? Fully-managed and auto-configured
  12. 800 RPS 1200 RPS 5000 RPS 50 RPS *means 100% cluster utilization by definition Autoscaling mid-job Fully-managed and auto-configured Auto graph-optimized for best execution path Dynamic Work Rebalancing mid-job 1 2 3 4 Why Use Cloud Dataflow?
  13. Autoscaling mid-job Fully-managed and auto-configured Auto graph-optimized for best execution path Dynamic Work Rebalancing mid-job 1 2 3 4 100 mins. 65 mins. vs. Why Use Cloud Dataflow?
  14. Start off with 3 workers, things are looking okay 10 minutes 3 days Re-estimation shows there’s orders of magnitude more work: need 100 workers! Idle You have 100 workers but you don’t have 100 pieces of work! ...and that’s really the most important part Autoscaling at Work
  15. public static void main(String[] args) { … Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.named("ReadLines") .from(options.getInputFile())) .apply(new CountWords()) .apply(ParDo.of(new FormatAsTextFn())) .apply(TextIO.Write.named("WriteCounts") .to(options.getOutput())); p.run(); }
  16. pipeline .apply(PubsubIO.Read.named("read from PubSub") .topic(String.format("projects/%s/topics/%s", options.getSourceProject(), options.getSourceTopic())) .timestampLabel("ts") .withCoder(TableRowJsonCoder.of())) .apply("window 1s", Window.into(FixedWindows.of(Duration.standardSeconds(1)))) .apply("mark rides", MapElements.via(new MarkRides())) .apply("count similar", Count.perKey()) .apply("format rides", MapElements.via(new TransformRides())) .apply(PubsubIO.Write.named("WriteToPubsub") .topic(String.format("projects/%s/topics/%s", options.getSinkProject(), options.getSinkTopic())) .withCoder(TableRowJsonCoder.of())); Read from Pubsub Window of 1 second Create KV pairs Count them by key Format for output Write to Pubsub
  17. Using Dataflow Templates Launching a Simple Pipeline Ingest Cloud Pub/Sub Pipelines Cloud Dataflow Analytics BigQuer y
  18. Pub/Sub to BigQuery Dataflow templates let you stage your job’s artifacts in Google Cloud Storage. Launch template jobs via REST API, or Cloud Console.
  19. Simpler operation via serverless infrastructure/fully-managed services Faster development: single code base for batch & streaming with Apache Beam The integrated, open way to ingest, process, and analyze data that is also easy to adopt, scale, and manage Lower cost with efficient scheduling, fast auto-scaling, granular billing Easier to get started with hybrid deployments via open, standard APIs 1 2 3 4 Why analytics on Google Cloud Platform?
  20. Simplify management and operations All resources provisioned automatically for nearly limitless scale: ● Ingest data from anywhere to anywhere at up to 100GB/s with consistent performance ● Data processing worker nodes auto-scale for maximum utilization, with dynamic rebalancing ● Rely on encryption everywhere, policy-based access control, and HIPAA compliance ● End-to-end monitoring and alerting helps troubleshoot pipelines while they’re running
  21. v Data Studio Cloud Pub/Sub Apache Kafka Container Engine Apache Spark Cloud Dataproc 3rd-party BI Tools Data consumers Or Or Apache Beam Cloud Dataflow Analyze (data warehouse) Cloud Bigtable HBase API Mix-and-match GCP’s native services with open source IoT Mobile Web Endpoint clients Transactional & device data Databases Cloud ML BigQuery TransformIngest
  22. Data consumers Metrics collection Monitor Telegraf Sensu Collectd Stackdriver
  23. IoT Pipeline Example

Editor's Notes

  1. This tradeoff gives rise to the Lambda architecture. In the lambda model we create a data pipeline that handles streaming data, perhaps computing results an event at a time. We also create a batch pipeline that handles the data once it’s complete in order to true up the results we got from the incomplete data with results from our complete data. Problems - we potentially wait a long time for correct results. And usually, as is the case in this picture, we end up with two different codebases to process what’s really the same data.
  2. When the job starts, Dataflow will automatically optimize the pipeline, fusing some operations and breaking some apart. this process of optimization resembles somewhat the process that database execution engines carry out when you provide SQL that it then optimizes into a physical execution plan. Dataflow might choose to fuse operations together in order to avoid costly processing or IO.
  3. While the job runs, Dataflow monitors the throughput and can automatically scale the worker count up in response to spikes and down when workers are no longer needed.
  4. Dataflow monitors execution time of tasks within the job, Dataflow automatically rebalances work across workers. This rebalancing ensures that stragglers or skewed data do not cause your job to run longer. In the picture, we see a couple of examples - on the left, a job without rebalancing. Stragglers take longer to complete their tasks perhaps because a few workers are misbehaving. You can see a bunch of the thin lines reaching up to the top, while the majority finish earlier. With work rebalancing, we redistribute the work from the stragglers to other nodes, and our pipeline completes dramatically faster.
  5. What does autoscaling do for you? Take an example where you have a pipeline that starts with 3 workers. Everything’s looking good. But some time later, we update our original completion time estimate from 10 minutes to 3 days! Dataflow decides that it would like to use 100 workers to complete the job faster. So what does it do? Dataflow will create more workers, then take the existing work and redistribute it across the nodes automatically.
  6. So here’s an example of a very simple pipeline. We read from a TextFile that we specify on the command line, apply a CountWords transform to, you guessed it, count the words, then we format the counts, and write the counts to a file we specify on the command line.
  7. A more complex example which uses Pubsub to get its data, and similarly counts things. One distinction here is how easy it is to take a count of things based on a window. Here we’re https://github.com/googlecodelabs/cloud-dataflow-nyc-taxi-tycoon/blob/master/dataflow/src/main/java/com/google/codelabs/dataflow/CountRides.java
Advertisement