Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ABD330_Combining Batch and Stream Processing to Get the Best of Both Worlds

826 views

Published on

Today, many architects and developers are looking to build solutions that integrate batch and real-time data processing, and deliver the best of both approaches. Lambda architecture (not to be confused with the AWS Lambda service) is a design pattern that leverages both batch and real-time processing within a single solution to meet the latency, accuracy, and throughput requirements of big data use cases. Come join us for a discussion on how to implement Lambda architecture (batch, speed, and serving layers) and best practices for data processing, loading, and performance tuning.

  • Be the first to comment

ABD330_Combining Batch and Stream Processing to Get the Best of Both Worlds

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Combining Batch and Stream Processing to Get the Best of Both Worlds R a j e e v S r i n i v a s a n – A W S S o l u t i o n A r c h i t e c t U j j w a l R a t a n – A W S S o l u t i o n A r c h i t e c t A B D 3 3 0 N o v e m b e r 2 7 , 2 0 1 7
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What Is This Chalk Talk About ? Problem Statement Today, businesses are looking for ways to process large scale batch and high velocity streaming data simultaneously in the cloud using a proven architecture to meet their latency, accuracy, and throughput requirements. Lambda Architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch-processing and stream-processing methods.” —Wikipedia Lambda Architecture not to be confused with AWS Lambda
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lambda Architect Block Diagram
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Stream Processing Lambda Architecture – Flow Diagram New Data Merged Query SPEED LAYER BATCH LAYER SERVING LAYER Batch View Real-Time View Master Dataset Pre-Compute View Batch Recompute Incremental View Real-Time Increment Partial aggregates… Real-Time Data Partial aggregates Partial aggregates
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lambda Architecture on AWS – Batch Layer Landing S3 bucket AWS Glue ETL Amazon QuickSight Data Source Amazon Athena Batch View S3 bucket AWS Glue Data Catalog Amazon Glue Crawler Amazon EMR Batch View S3 bucket Serverless Batch Processing
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lambda Architecture on AWS – Speed Layer AWS Lambda Amazon Kinesis Firehose (Incremental stream) Amazon Kinesis Analytics Incremental View S3 bucket Amazon Kinesis StreamData Source Amazon EMR Serverless Stream Processing Visualization Incremental View S3 bucket
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lambda Architecture on AWS D a t a S o u r c e Landing S3 bucket AWS Lambda Kinesis Firehose (Incremental stream) Kinesis Analytics Amazon Athena Batch View S3 bucket Incremental View S3 bucket Batch Layer Speed Layer Serving Layer Kinesis Stream V i s u a l i z a t i o n Merged View S3 bucket Amazon EMR AWS Glue Crawler AWS Glue Catalog AWS Glue ETL
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lambda Architecture on AWS D a t a S o u r c e Landing S3 bucket Batch View S3 bucket Batch Layer Speed Layer Serving Layer Kinesis Stream V i s u a l i z a t i o n Merged View S3 bucket Amazon EMR AWS Glue Crawler AWS Glue Catalog AWS Glue ETL
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. import . . . DataFrame dataFromS3 = sqlContext.read().json(”s3://").toDF(); dataFromS3.registerTempTable(”batchData"); . . . val ssc = new StreamingContext(sc, …) val kinesisStreams = (0 until numStreams).map { i => KinesisUtils.createStream(ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2) } val unionStreams = ssc.union(kinesisStreams) unionStreams.foreachRDD((rdd:RDD[String])=>{ . . . rdd.toDF().registerTempTable(”streamData") val mergedResult = sqlContext.sql("SELECT ... FROM streamData s JOIN batchData b ON a.data = b.data ...") mergedResult.save(”s3://... ", "parquet", SaveMode.Overwrite) }}) ssc.start() C omb ining Stre am and B atch Proce ssing Qu e ry – Sp ark Querying Batch data from Amazon S3 using Spark SQL Querying data from Kinesis Stream using Spark Streaming Merge query using Spark SQL
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kinesis Stream and Kinesis Firehose • Server-side encryption with a AWS KMS-managed key (SSE-KMS) • Kinesis Stream also supports client side encryption AWS Lambda • KMS/customer key to encrypt in both rest and transit Amazon Athena • Server side encryption with an Amazon S3- managed key (SSE-S3) • Server-side encryption with a AWS Key Management Service (AWS KMS)-managed key (SSE-KMS) • Create table statement with TBLPROPERTIES 'has_encrypted_data'='true' Security Options
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. WHITEBOARDING SESSION
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! P l e a s e c o m p l e t e y o u r s u r v e y !

×