Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real-time Analytics with Open-Source

1,660 views

Published on

Working with big volumes of data is a complicated task, but it's even harder if you have to do everything in real time and try to figure it all out yourself. Over the past decades many open-source projects helped solve problems within the data analytics lifecycle around ingestion, storage, processing and visualisation of data. This session will use practical examples to discuss architectural best practices and lessons learned when solving real-time analytics and data visualisation decision-making problems with open-source at scale with the power of Amazon Web Services. It furthermore dives into a demo, using source code from the AWS Labs to visualise live data streams at scale.

Olivier Klein, Solutions Architect, Amazon Web Services, Greater China

Published in: Technology
  • Be the first to comment

Real-time Analytics with Open-Source

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Olivier Klein 奧樂凱 Solutions Architect, Greater China April 2016 Real-time Analytics with Open-Source and AWS
  2. 2. “We see our customers as invited guests to a party, and we are the hosts. It’s our job to make every important aspect of the customer experience a little bit better.” Jeff Bezos CEO, Amazon.com
  3. 3. Data analysis for a better customer experience • Your business creates and stores data and logs all the time • Data points and logs allow you to understand individual customer experience and improve it • Analysis of logs and trails help gain insights
  4. 4. How does Open-Source fit into Data Analytics?
  5. 5. Most Notably: Apache Hadoop • Open-Source Project for distributed storage and distributed processing of very large data sets • Scales linearly on commodity hardware compute nodes • Has an entire ecosystem built around it for various purposes
  6. 6.  Accumulo – cell-based access control NoSQL  Avro – data serialization system  Cascading – alternative language APIs on MR  Cassandra – multi-master NoSQL DB  Chukwa – data collection system at scale  Flume – collecting, aggregating, moving logs  Giraph – iterative graph processing system  HBase – large table NoSQL DB  HDFS – distributed file system  Hive – SQL on MapReduce Data Warehouse  Mahout – scalable machine learning library  MapReduce – parallel processing on YARN  Nutch – web crawler software  Pig – high-level scripting on MapReduce  R - statistical computing and graphics  Spark – general compute engine on YARN  Sqoop – transferring data to/from RDBMS  Tez – data-flow programming on YARN  Thrift – build scalable cross-language services  ZooKeeper – coordination Hadoop Ecosystem
  7. 7. Tell me more about Big Data!
  8. 8. Ever Increasing Amount of Data Volume Velocity Variety
  9. 9. Generation Collection & Storage Analytics & Computation Collaboration & Sharing
  10. 10. More devices Lower cost Higher throughput Generation Collection & Storage Analytics & Computation Collaboration & Sharing
  11. 11. Highly constrained More devices Lower cost Higher throughput Generation Collection & Storage Analytics & Computation Collaboration & Sharing
  12. 12. Amazon Web Services helps remove constraints
  13. 13. Big Data: • Potentially massive datasets • Iterative, experimental style of data manipulation and analysis • Frequently not a steady-state workload; peaks and valleys • Data is a combination of structured and unstructured data in many formats AWS Cloud: • Virtually unlimited capacity • Iterative, experimental usage cost through on-demand infrastructure • Fully scalable infrastructure for highly variable workloads • Tools & Services for managing structured, unstructured and stream data
  14. 14. Let’s simplify Big Data with AWS!
  15. 15. Three Types of Data Analytics Retrospective analysis and reporting Here-and-now real-time processing and dashboards Predictions to enable smart apps
  16. 16. Ingest Store Process Visualize Data Answers Time Simplified Big Data Pipeline
  17. 17. Amazon S3 Amazon DynamoDB Amazon RDS Ingest Store Process Visualize Amazon Mobile Analytics Amazon EMR Amazon Redshift Amazon Lambda Amazon Kinesis Firehose Amazon Machine Learning Amazon EC2 Amazon Glacier Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight AWS Import/Export Snowball Amazon Kinesis
  18. 18. Amazon S3 Amazon DynamoDB Amazon RDS Ingest Store Process Visualize Amazon Mobile Analytics Amazon EMR Amazon Redshift Amazon Lambda Amazon Kinesis Firehose Amazon Machine Learning Amazon EC2 Amazon Glacier Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight AWS Import/Export Snowball Amazon Kinesis
  19. 19. Fluentd: Open Source Log Collection https://github.com/fluent/fluentd/ • Fluentd is an open source data collector to unify data collection and consumption • Integration into many data sources (App Logs, Syslogs, Twitter etc.) • Direct integration into AWS such as S3 & Kinesis <source> type tail format apache2 path /var/log/apache2/access_log tag s3.apache.access </source> <match s3.*.*> type s3 s3_bucket myweblogs path logs/ </match>
  20. 20. Amazon S3 Amazon DynamoDB Amazon RDS Ingest Store Process Visualize Amazon Mobile Analytics Amazon EMR Amazon Redshift Amazon Lambda Amazon Kinesis Firehose Amazon Machine Learning Amazon EC2 Amazon Glacier Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight AWS Import/Export Snowball Amazon Kinesis
  21. 21. Amazon S3 • Highly available object storage • Designed for 99.999999999% annual data durability • Replicated across 3 facilities • Virtually unlimited scale • Pay only for what you use, you don’t need to pre-provision • Allows event notifications to trigger further action Amazon S3
  22. 22. Amazon S3 Amazon DynamoDB Amazon RDS Ingest Store Process Visualize Amazon Mobile Analytics Amazon EMR Amazon Redshift Amazon Lambda Amazon Kinesis Firehose Amazon Machine Learning Amazon EC2 Amazon Glacier Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight AWS Import/Export Snowball Amazon Kinesis
  23. 23. Amazon DynamoDB • Schemaless Data Model • Seamless scalability • No storage or throughput limits • Consistent low latency performance • High durability and availability • Replicated across 3 facilities DynamoDB table items attributes Fully Managed NoSQL Database Service
  24. 24. 500,000 writes / second to their Amazon DynamoDB tables 200 additional servers during Superbowl 0 additional servers right after
  25. 25. Amazon S3 Amazon DynamoDB Amazon RDS Ingest Store Process Visualize Amazon Mobile Analytics Amazon EMR Amazon Redshift Amazon Lambda Amazon Kinesis Firehose Amazon Machine Learning Amazon EC2 Amazon Glacier Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight AWS Import/Export Snowball Amazon Kinesis
  26. 26. Stream in Real Time: Amazon Kinesis • Real-Time Data Processing over large distributed streams • Elastic capacity that scales to millions of events per second • React In real-time upon incoming stream events • Reliable stream storage replicated across 3 facilities Amazon Kinesis
  27. 27. Kinesis for Real- Time
  28. 28. AWS Labs – Open Source Code for AWS • Code and Connectors used with Amazon Kinesis and other AWS services are Open-Source • Available under Apache License 2.0 https://github.com/awslabs
  29. 29. Amazon S3 Amazon DynamoDB Amazon RDS Ingest Store Process Visualize Amazon Mobile Analytics Amazon EMR Amazon Redshift Amazon Lambda Amazon Kinesis Firehose Amazon Machine Learning Amazon EC2 Amazon Glacier Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight AWS Import/Export Snowball Amazon Kinesis
  30. 30. Amazon Elasticsearch Service • Powerful, real-time, distributed, open- source search and analytics engine built on Apache Lucene • Full integration into AWS with IAM for security, Cloudtrail for auditing and CloudWatch for monitoring • Fully managed cluster that scales for data size and throughput
  31. 31. Amazon S3 Amazon DynamoDB Amazon RDS Ingest Store Process Visualize Amazon Mobile Analytics Amazon EMR Amazon Redshift Amazon Lambda Amazon Kinesis Firehose Amazon Machine Learning Amazon EC2 Amazon Glacier Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight AWS Import/Export Snowball Amazon Kinesis
  32. 32. Amazon EMR • Amazon EMR is a fully managed Hadoop cluster • Transient and long running clusters • Direct integration into Amazon S3 and Amazon Kinesis • Easy to scale and enable burstable capacity • Integration with AWS Spot Market
  33. 33. 1 instance x 100 hours = 100 instances x 1 hour (and with Spot Pricing not only faster but also cheaper)
  34. 34. Process – Amazon EMR • Amazon EMR supports all common Hadoop Frameworks such as: • Spark, Pig, Hive, Hue, Oozie … • Hbase, Presto, Impala … • Decouples storage from compute • Allows independent scaling • Direct Integration with DynamoDB and S3 Amazon S3Amazon DynamoDB Amazon EMR
  35. 35. • FINRA regulates trading practices of brokerage firms and exchange markets to protect market integrity • Market surveillance platform stores 30 billion market events every day • Leverages Amazon S3 to store events and allow analysts to interactively query market dynamics using Amazon EMR Hive & HBase clusters with increased agility Re-Architecting Compliance Unlimited Storage Distributed Computing Interactive Market Queries Ensure compliance 30 billion market events
  36. 36. CREATE TABLE call_data_records ( start_time bigint, end_time bigint, phone_number STRING, carrier STRING, recorded_duration bigint, calculated_duration bigint, lat double, long double ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler' TBLPROPERTIES("kinesis.stream.name"=”MyTestStream"); Amazon EMR integration: Hive
  37. 37. Apache Spark • Apache Spark is an in-memory analytics cluster using RDD (Resilient Distributed Dataset) for fast processing • Faster than Map-Reduce due to removal of shuffling phases to HDFS • Apache Spark Streaming can read directly from DynamoDB, S3 and a Kinesis stream
  38. 38. Processing Amazon Kinesis streams Amazon Kinesis EMR with Spark Streaming KinesisUtils.createStream(‘twitter-stream’) .filter(_.getText.contains(‘Big Data’)) .countByWindow(Seconds(5)) Counting tweets on a sliding window
  39. 39. Amazon S3 Amazon DynamoDB Amazon RDS Ingest Store Process Visualize Amazon Mobile Analytics Amazon EMR Amazon Redshift Amazon Lambda Amazon Kinesis Firehose Amazon Machine Learning Amazon EC2 Amazon Glacier Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight AWS Import/Export Snowball Amazon Kinesis
  40. 40. React in Real-Time: AWS Lambda • Run your code in the cloud, fully managed and highly-available • Triggered through API calls or state changes in your setup (S3, DynamoDB, SNS, Kinesis) • Scales automatically to match the incoming event rate • Charged per 100ms execution time Amazon Kinesis Amazon Lambda Amazon S3 Amazon DynamoDB Amazon API Gateway Amazon SNS
  41. 41. AWS Lambda • Use AWS Lambda to clean and massage incoming data • Write code to load data sources (S3, DynamoDB) automatically in your data warehouse (e.g. Amazon Redshift) • React in real-time to incoming events in Amazon Kinesis Amazon Lambda Amazon Redshift Amazon Kinesis
  42. 42. Amazon S3 Amazon DynamoDB Amazon RDS Ingest Store Process Visualize Amazon Mobile Analytics Amazon EMR Amazon Redshift Amazon Lambda Amazon Kinesis Firehose Amazon Machine Learning Amazon EC2 Amazon Glacier Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight AWS Import/Export Snowball Amazon Kinesis
  43. 43. Amazon Redshift • Fully managed petabyte-scale data warehouse • Scalable amount of cluster nodes • ODBC/JDBC connector for BI tools using SQL • Supports Amazon DynamoDB and Amazon S3 to load data • Less than a 10th of a cost of traditional solutions Amazon Redshift
  44. 44. Amazon Redshift – Use Case • Web Log Analaysis at amazon.com (Online Retail Business) • Understand customer behavior • Who’s browsing but not buying? • Which products are winners? • What sequence led to higher customer conversion? • Metrics • Every day 2TB new data • Largest table: 400TB
  45. 45. Amazon Redshift – Use Case • Performance • Scan 2.25 trillion rows of data in 14 minutes • Load 5 billion rows of data in 10 minutes • Comparison • Hadoop (Pig) to Redshift from 2 days to 1 hour • Oracle DB to Redshift from 90 hours to 8 hours
  46. 46. Amazon S3 Amazon DynamoDB Amazon RDS Ingest Store Process Visualize Amazon Mobile Analytics Amazon EMR Amazon Redshift Amazon Lambda Amazon Kinesis Firehose Amazon Machine Learning Amazon EC2 Amazon Glacier Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight AWS Import/Export Snowball Amazon Kinesis
  47. 47. Amazon Quicksight • Fast, cloud-powered, BI service for 1/10th the cost of old-guard BI software • Connectors for files, third party platforms and AWS services • In-memory calculation engine (SPICE) to accelerate analysis and visualization • Supports other partner BI tools • $9 per user per month
  48. 48. Amazon S3 Amazon DynamoDB Amazon RDS Ingest Store Process Visualize Amazon Mobile Analytics Amazon EMR Amazon Redshift Amazon Lambda Amazon Kinesis Firehose Amazon EC2 Amazon Glacier Amazon Elasticsearch Service Amazon Kinesis Analytics Amazon QuickSight AWS Import/Export Snowball Amazon Kinesis Amazon Machine Learning
  49. 49. Kibana: Open Source Visualization https://github.com/elastic/kibana • Kibana is an open-source project of Elastic.IO to visualize data in browser • Uses Elasticsearch as indexing engine (based on Apache Lucene)
  50. 50. Let’s put it all together: Demo Time!
  51. 51. Amazon Kinesis Twitter Stream Amazon Lambda Demo: Live Twitter Feed Analysis * https://blog.twitter.com/2013/new-tweets-per-second-record-and-how Twitter Blog* - On a typical day (in 2013): • More than 500 million Tweets sent • Average 5,700 TPS Amazon Elasticsearch Service
  52. 52. Thank you! Olivier Klein 奧樂凱 Solutions Architect, Greater China

×