Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Real-Time Data Analytics Applications on AWS - September 2016 Webinar Series

2,882 views

Published on

Evolving your analytics from batch processing to real-time processing can have a major impact on your business. However, designing and maintaining applications to continuously collect and analyze large amounts of data per hour can be extremely difficult without the right tools. Using Amazon Kinesis you can easily ingest streaming data without building any complex pipelines. In this webinar, we will show you how to build your first real-time processing application on AWS, using a combination of managed services like Amazon Kinesis and open source tools such as Apache Spark and Zeppelin.  First, you will learn how to push randomly generated event sensor data into Kinesis. Once the data is available in Kinesis we are going to consume and process it using Spark streaming. Finally, we will build data visualizations with this data, using Zeppelin, an open source GUI tool which creates interactive notebooks for data explorations using Spark and can be easily installed on Amazon EMR with a simple click.

Learning Objectives:
• Understand the basics of ingesting streaming data with Amazon Kinesis Streams
• Create a real-time ingestion application
• Work with Apache Zeppelin and Spark on Amazon EMR to processing Kinesis Streams

Who Should Attend:
• Developers, Data Analysts, Data Engineers, Data Scientists and Architects

Published in: Technology

Building Real-Time Data Analytics Applications on AWS - September 2016 Webinar Series

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Keith Steward, Ph.D., Specialist (EMR) Solutions Architect September 29, 2016 Building Real-Time Processing Data Analytics Applications on AWS Build powerful applications easily
  2. 2. Agenda 1. Introduction to Amazon EMR and Kinesis 2. Real Time Processing Application Demo 3. Best Practices
  3. 3. Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Customize the cluster
  4. 4. Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Mahout, Sqoop HBase/Phoenix Presto Hue (SQL Interface/Metastore Management) Zeppelin (Interactive Notebook) Ganglia (Monitoring) HiveServer2/Spark Thriftserver (JDBC/ODBC) Amazon EMR service
  5. 5. Options to submit jobs to EMR Amazon EMR Step API Submit a Spark application Amazon EMR Airflow, Luigi, or other schedulers on EC2 Create pipeline to schedule job submission or create complex workflows Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster AWS Lambda AWS Data Pipeline
  6. 6. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift EMR File System (EMRFS) Amazon S3 Amazon EMR
  7. 7. EMR 5.0 - Applications
  8. 8. Apache Spark 2.0
  9. 9. Quick introduction to Spark join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in DataFrames in memory • Partitioning-aware to avoid network-intensive shuffle
  10. 10. Spark components to match your use case
  11. 11. Spark 2.0 – Performance Enhancements • Second generation Tungsten engine • Whole-stage code generation to create optimized bytecode at runtime • Improvements to Catalyst optimizer for query performance • New vectorized Parquet decoder to increase throughput
  12. 12. Datasets & DataFrames (Spark 2.0) Datasets • Distributed collection of data • Strong typing, can use Lambda functions • Object-oriented operations (similar to RDD API) • Optimized encoders for better performance; minimize serialization /deserialization overhead • Compile-time type safety for robust applications DataFrames • Dataset organized into named columns • Represented as a Dataset of rows
  13. 13. Spark SQL (Spark 2.0) • SparkSession – replaces the old SQLContext and HiveContext • Seamlessly mix SQL with Spark programs • ANSI SQL Parser and subquery support • HiveQL compatibility and can directly use tables in Hive metastore • Connect through JDBC / ODBC using the Spark Thrift server
  14. 14. Spark 2.0 – Structured Streaming • Structured Streaming API: extension to the DataFrame/Dataset API (instead of DStream) • SparkSession is the new entry point for streaming • Better merges processing on static and streaming datasets, abstracting the velocity of the data
  15. 15. Configuring Executors – Dynamic Allocation • Optimal resource utilization • YARN dynamically adjusts executors based on resource needs of Spark application • Spark uses the executor memory and executor cores settings in the configuration for each executor • Amazon EMR uses dynamic allocation by default, and calculates the default executor size to use based on the instance family of your Core Group
  16. 16. Try different configurations to find your optimal architecture. CPU c3 family c4 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m4 family m3 family m1 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  17. 17. Zeppelin 0.6.1 – New Features • Shiro Authentication • Notebook Authorization Configurations to save notebook in S3 zeppelin-env.sh: export ZEPPELIN_NOTEBOOK_S3_BUCKET = bucket_name export ZEPPELIN_NOTEBOOK_S3_USER = username (optional) export ZEPPELIN_NOTEBOOK_S3_KMS_KEY_ID = kms-key-id
  18. 18. Amazon Kinesis Streams Build your own custom applications that process or analyze streaming data Amazon Kinesis Firehose Easily load massive volumes of streaming data into Amazon S3 and Redshift Amazon Kinesis Analytics Easily analyze data streams using standard SQL queries Amazon Kinesis: Streaming data made easy Services make it easy to capture, deliver, and process streams on AWS
  19. 19. Amazon Kinesis Concepts Streams: • Ordered sequence of data records Data Records: • Unit of data stored in a Kinesis stream Producers: • Anything that puts records into a Kinesis stream • Kinesis Producer Library (KPL) Consumers: • End-points that get records from a Kinesis stream & processes them • Kinesis Consumer Library (KCL)
  20. 20. Amazon Kinesis Streams features… • Kinesis Client Library in Python, Node.JS, Ruby… • PutRecords API, up to 500 records or 5 MB of payload • Kinesis Producer Library to simplify producer development • Server-Side Timestamps • individual max record payload up to 1 MB • Low end-to-end propagation delay • Stream Retention defaults to 24 hrs, but can increase to 7 days
  21. 21. Real Time Processing Application and Demo
  22. 22. Real Time Processing Application + Kinesis Producer (KCL) Amazon Kinesis Apache Zeppelin + Produce Collect Process Analyze Amazon EMR
  23. 23. Real Time Processing Application – 5 Steps http://bit.ly/realtime-aws 1. Create Kinesis Stream 2. Create Amazon EMR Cluster 3. Start Kinesis Producer 4. Ad-hoc Analytics • Analyze data using Apache Zeppelin on Amazon EMR with Spark Streaming 5. Continuous Streaming • Spark Streaming application submitted to Amazon EMR cluster using Step API Amazon Kinesis Amazon EMR Amazon EC2 (Kinesis Producer) (Kinesis Consumer)
  24. 24. Real Time Processing Application 1. Create Kinesis Stream: $ aws kinesis create-stream --stream-name spark-demo --shard-count 2
  25. 25. Real Time Processing Application 2. Create Amazon EMR Cluster with Spark and Zeppelin: $ aws emr create-cluster --release-label emr-5.0.0 --applications Name=Zeppelin Name=Spark Name=Hadoop --enable-debugging --ec2-attributes KeyName=test-key-1,AvailabilityZone=us-east-1d --log-uri s3://kinesis-spark-streaming1/logs --instance-groups Name=Master,InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1 Name=Core,InstanceGroupType=CORE,InstanceType=m3.xlarge,InstanceCount=2 Name=Task,InstanceGroupType=TASK,InstanceType=m3.xlarge,InstanceCount=2 --name "kinesis-processor"
  26. 26. Real Time Processing Application 3. Start Kinesis Producer (Using Kinesis Producer Library) a. On a host (e.g. EC2), download JAR and run Kinesis Producer: $ wget https://s3.amazonaws.com/chayel-emr/KinesisProducer.jar $ java –jar KinesisProducer.jar b. Continuous stream of data records will be fed into Kinesis in CSV format: … device_id,temperature,timestamp … Git link: https://github.com/manjeetchayel/emr-kpl-demo
  27. 27. Real Time Processing Application Use Zeppelin on Amazon EMR: • Configure Spark interpreter to use org.apache.spark:spark-streaming- kinesis-asl_2.11:2.0.0 dependency • Import notebook from https://raw.githubusercontent.com/m anjeetchayel/aws-big-data- blog/master/aws-blog-realtime- analytics-using- zeppelin/Spark_Streaming.json • Run spark code blocks to generate real-time analytics.
  28. 28. Best Practices
  29. 29. Best Practices - Kinesis • Use Kinesis Producer Library • Increase capacity utilization with Aggregation • Setup CloudWatch alarms • Randomized Partitions key to distribute puts across Shards • Deal with unsuccessful records • Use PutRecords for higher throughput! • Kinesis Scaling Utility for scaling Streams
  30. 30. Best Practices – Amazon EMR + Spark Streaming • Use Step API to submit long running applications • Run Spark application in cluster mode • Use maxResourceAllocation • Unqiue Spark Application name for KCL DynamoDB metadata • Setup CloudWatch alarms • Use IAM roles
  31. 31. Thank you! stewardk@amazon.com http://aws.amazon.com/emr http://blogs.aws.amazon.com/bigdata

×