Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS Webcast - Managing Big Data in the AWS Cloud_20140924


Published on

This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.

Published in: Technology
  • Tired of being scammed? Take advantage of a program that, actually makes you money! ▲▲▲
    Are you sure you want to  Yes  No
    Your message goes here

AWS Webcast - Managing Big Data in the AWS Cloud_20140924

  1. 1. Managing Big Data in the AWS Cloud Siva Raghupathy Principal Solutions Architect Amazon Web Services
  2. 2. Agenda • Big data challenges • AWS big data portfolio • Architectural considerations • Customer success stories • Resources to help you get started • Q&A
  3. 3. Data Volume, Velocity, & Variety • 4.4 zettabytes (ZB) of data exists in the digital universe today – 1 ZB = 1 billion terabytes • 450 billion transaction per day by 2020 • More unstructured data than structured data GB TB PB ZB EB 1990 2000 2010 2020
  4. 4. Big Data • Hourly server logs: how your systems were misbehaving an hour ago • Weekly / Monthly Bill: What you spent this past billing cycle? • Daily customer-preferences report from your web-site’s click stream: tells you what deal or ad to try next time • Daily fraud reports: tells you if there was fraud yesterday Real-time Big Data • Real-time metrics: what just went wrong now • Real-time spending alerts/caps: guaranteeing you can’t overspend • Real-time analysis: tells you what to offer the current customer now • Real-time detection: blocks fraudulent use now Big Data : Best Served Fresh
  5. 5. Data Analysis Gap Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares Generated data Available for analysis Data volume - Gap 1990 2000 2010 2020
  6. 6. Big Data Potentially massive datasets Iterative, experimental style of data manipulation and analysis Frequently not a steady-state workload; peaks and valleys Time to results is key Hard to configure/manage AWS Cloud Massive, virtually unlimited capacity Iterative, experimental style of infrastructure deployment/usage At its most efficient with highly variable workloads Parallel compute clusters from singe data source Managed services
  7. 7. AWS Big Data Portfolio Collect / Ingest Kinesis Store Process / Analyze Visualize / Report EMR EC2 Redshift Data Pipeline S3 DynamoDB Glacier RDS Import Export Direct Connect Amazon SQS
  8. 8. Ingest: The act of collecting and storing data
  9. 9. Why Data Ingest Tools? • Data ingest tools convert random streams of data into fewer set of sequential streams – Sequential streams are easier to process – Easier to scale – Easier to persist Processing Processing Processing Processing Processing Kafka Or Kinesis Processing
  10. 10. Data Ingest Tools • Facebook Scribe  Data collectors • Apache Kafka  Data collectors • Apache Flume  Data Movement and Transformation • Amazon Kinesis  Data collectors
  11. 11. Real-time processing of streaming data High throughput Elastic Easy to use Connectors for EMR, S3, Redshift, DynamoDB Amazon Kinesis
  12. 12. AmAamzaozno Kn iKneinseiss iAs rAchrcitheictetucrtuere AZ AZ AZ Durable, highly consistent storage replicates data across three data centers (availability zones) Amazon Web Services Aggregate and archive to S3 Millions of sources producing 100s of terabytes per hour Front End Authentication Authorization Ordered stream of events supports multiple readers Real-time dashboards and alarms Machine learning algorithms or sliding window analytics Aggregate analysis in Hadoop or a data warehouse Inexpensive: $0.028 per million puts
  13. 13. Kinesis Stream: Managed ability to capture and store data • Streams are made of Shards • Each Shard ingests data up to 1MB/sec, and up to 1000 TPS • Each Shard emits up to 2 MB/sec • All data is stored for 24 hours • Scale Kinesis streams by adding or removing Shards • Replay data inside of 24Hr. Window
  14. 14. Simple Put interface to store data in Kinesis • Producers use a PUT call to store data in a Stream • PutRecord {Data, PartitionKey, StreamName} • A Partition Key is supplied by producer and used to distribute the PUTs across Shards • Kinesis MD5 hashes supplied partition key over the hash key range of a Shard • A unique Sequence # is returned to the Producer upon a successful PUT call
  15. 15. Building Kinesis Processing Apps: Kinesis Client Library Client library for fault-tolerant, at least-once, Continuous Processing o Java client library, source available on Github o Build & Deploy app with KCL on your EC2 instance(s) o KCL is intermediary b/w your application & stream  Automatically starts a Kinesis Worker for each shard  Simplifies reading by abstracting individual shards  Increase / Decrease Workers as # of shards changes  Checkpoints to keep track of a Worker’s location in the stream, Restarts Workers if they fail o Integrates with AutoScaling groups to redistribute workers to new instances
  16. 16. Sending & Reading Data from Kinesis Streams Sending Reading HTTP Post AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Write Read
  17. 17. AWS Partners for Data Load and Transformation Hparser, Big Data Edition Flume, Sqoop
  18. 18. Storage
  19. 19. Storage Structured – Simple Query NoSQL Amazon DynamoDB Cache Amazon ElastiCache (Memcached, Redis) Structured – Complex Query SQL Amazon RDS Data Warehouse Amazon Redshift Search Amazon CloudSearch Unstructured – No Query Cloud Storage Amazon S3 Amazon Glacier Unstructured – Custom Query Hadoop/HDFS Amazon Elastic Map Reduce Data Structure Complexity Query Structure Complexity
  20. 20. Store anything Object storage Scalable Designed for 99.999999999% durability Amazon S3
  21. 21. Why is Amazon S3 good for Big Data? • No limit on the number of Objects • Object size up to 5TB • Central data storage for all systems • High bandwidth • 99.999999999% durability • Versioning, Lifecycle Policies • Glacier Integration
  22. 22. Amazon S3 Best Practices • Use random hash prefix for keys • Ensure a random access pattern • Use Amazon CloudFront for high throughput GETs and PUTs • Leverage the high durability, high throughput design of Amazon S3 for backup and as a common storage sink • Durable sink between data services • Supports de-coupling and asynchronous delivery • Consider RRS for lower cost, lower durability storage of derivatives or copies • Consider parallel threads and multipart upload for faster writes • Consider parallel threads and range get for faster reads
  23. 23. Aggregate All Data in S3 Surrounded by a collection of the right tools EMR Kinesis Data Pipeline Redshift DynamoDB RDS Cassandra Storm Spark Streaming Amazon S3 Amazon S3
  24. 24. Fully-managed NoSQL database service Built on solid-state drives (SSDs) Consistent low latency performance Any throughput rate No storage limits Amazon DynamoDB
  25. 25. DynamoDB Concepts table items attributes schema-less schema is defined per attribute
  26. 26. DynamoDB: Access and Query Model • Two primary key options • Hash key: Key lookups: “Give me the status for user abc” • Composite key (Hash with Range): “Give me all the status updates for user ‘abc’ that occurred within the past 24 hours” • Support for multiple data types – String, number, binary… or sets of strings, numbers, or binaries • Supports both strong and eventual consistency – Choose your consistency level when you make the API call – Different parts of your app can make different choices • Global Secondary Indexes
  27. 27. DynamoDB: High Availability and Durability
  28. 28. What does DynamoDB handle for me? • Scaling without down-time • Automatic sharding • Security inspections, patches, upgrades • Automatic hardware failover • Multi-AZ replication • Hardware configuration designed specifically for DynamoDB • Performance tuning …and a lot more
  29. 29. Amazon DynamoDB Best Practices • Keep item size small • Store metadata in Amazon DynamoDB and blobs in Amazon S3 • Use a table with a hash key for extremely high scale • Use hash-range key to model – 1:N relationships – Multi-tenancy • Avoid hot keys and hot partitions • Use table per day, week, month etc. for storing time series data • Use conditional updates
  30. 30. Relational Databases Fully managed; zero admin MySQL, PostgreSQL, Oracle & SQL Server Amazon RDS
  31. 31. Process and Analyze
  32. 32. Processing Frameworks • Batch Processing – Take large amount (>100TB) of cold data and ask questions – Takes hours to get answers back • Stream Processing (real-time) – Take small amount of hot data and ask questions – Takes short amount of time to get your answer back
  33. 33. Processing Frameworks • Batch Processing – Amazon EMR (Hadoop) – Amazon Redshift • Stream Processing – Spark Streaming – Storm
  34. 34. Columnar data warehouse ANSI SQL compatible Massively parallel Petabyte scale Fully-managed Very cost-effective Amazon Redshift
  35. 35. Amazon Redshift architecture • Leader Node – SQL endpoint – Stores metadata – Coordinates query execution • Compute Nodes – Local, columnar storage – Execute queries in parallel – Load, backup, restore via Amazon S3 – Parallel load from Amazon DynamoDB • Hardware optimized for data processing • Two hardware platforms – DW1: HDD; scale from 2TB to 1.6PB – DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  36. 36. Amazon Redshift Best Practices • Use COPY command to load large data sets from Amazon S3, Amazon DynamoDB, Amazon EMR/EC2/Unix/Linux hosts – Split your data into multiple files – Use GZIP or LZOP compression – Use manifest file • Choose proper sort key – Range or equality on WHERE clause • Choose proper distribution key – Join column, foreign key or largest dimension, group by column
  37. 37. Hadoop/HDFS clusters Hive, Pig, Impala, HBase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Kinesis Amazon Elastic MapReduce
  38. 38. EMR Cluster S3 1. Put the data into S3 2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop apps like Hive/Pig/HBase 4. Get the output from S3 3. Launch the cluster using the EMR console, CLI, SDK, or APIs How Does EMR Work?
  39. 39. EMR Cluster EMR S3 You can easily resize the cluster And launch parallel clusters using the same data How Does EMR Work?
  40. 40. EMR Cluster EMR S3 Use Spot nodes to save time and money How Does EMR Work?
  41. 41. The Hadoop Ecosystem works inside of EMR
  42. 42. Amazon EMR Best Practices • Balance transient vs persistent clusters to get the best TCO • Leverage Amazon S3 integration – Consistent View for EMRFS • Use Compression (LZO is a good pick) • Avoid small files (< 100MB; s3distcp can help!) • Size cluster to suit each job • Use EC2 Spot Instances
  43. 43. Amazon EMR Nodes and Size • Tuning cluster size can be more efficient than tuning Hadoop code • Use m1 and c1 family for functional testing • Use m3 and c3 xlarge and larger nodes for production workloads • Use cc2/c3 for memory and CPU intensive jobs • hs1, hi1, i2 instances for HDFS workloads • Prefer a smaller cluster of larger nodes
  44. 44. Partners – Analytics (Scientific, algorithmic, predictive, etc)
  45. 45. Visualize
  46. 46. Partners - BI & Data Visualization
  47. 47. Putting All The AWS Data Tools Together & Architectural Considerations
  48. 48. One tool to rule them all
  49. 49. Data Characteristics: Hot, Warm, Cold Hot Warm Cold Volume MB–GB GB–TB PB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low–High High Very High Request rate Very High High Low Cost/GB $$-$ $-¢¢ ¢
  50. 50. Average latency Data volume Item size Request rate Cost ($/GB/month) Durability Elasti- Cache ms GB B-KB Very High $$ Low - Moderate Amazon DynamoDB ms GB-TBs (no limit) B-KB (64 KB max) Very High ¢¢ Very High Amazon RDS ms.sec GB-TB (3 TB max) KB (~rowsize) High ¢¢ High Cloud Search ms.sec GB-TB KB (1 MB max) High $ High Amazon Redshift sec.min TB-PB (1.6 PB max) KB (64 K max) Low ¢ High Amazon EMR (Hive) sec.min, hrs GB-PB (~nodes) KB-MB Low ¢ High Amazon S3 ms,sec, min (~size) GB-PB (no limit) KB-GB (5 TB max) Low-Very High (no limit) ¢ Very High Amazon Glacier hrs GB-PB (no limit) GB (40 TB max) Very Low (no limit) ¢ Very High
  51. 51. Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB? “I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…” Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2048 1483 777,600,000
  52. 52. Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month DynamoDB or S3? 300 2,048 1,483 777,600,000
  53. 53. Amazon DynamoDB Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month Scenario 1 300 2,048 1,483 777,600,000 Scenario 2 300 32,768 23,730 777,600,000 Amazon S3 use use
  54. 54. Lambda Architecture
  55. 55. Putting it all together De-coupled architecture • Multi-tier data processing architecture • Ingest & Store de-coupled from Processing • Ingest tools write to multiple data stores • Processing frameworks (Hadoop, Spark, etc.) read from data stores • Consumers can decide which data store to read from depending on their data processing requirement
  56. 56. Hot Data Temperature Cold Spark Streaming / Storm Redshift Impala Spark EMR/ Hadoop Redshift EMR/ Hadoop Spark Kinesis/ Kafka Data NoSQL / DynamoDB / Hadoop HDFS S3 Low Latency High Answers
  57. 57. Customer Use Cases
  58. 58. Automatic spelling corrections Autocomplete Search Recommendations
  59. 59. A look at how it works Data Analyzed Using EMR: Months of user history Common misspellings Weste Winstin Westa Whenstin Automatic spelling corrections
  60. 60. Yelp web site log data goes into Amazon S3 Months of user search data Search terms Misspellings Final click throughs Amazon S3
  61. 61. Amazon Elastic MapReduce spins up a 200 node Hadoop cluster Hadoop Cluster Amazon S3 Amazon EMR
  62. 62. All 200 nodes of the cluster simultaneously look for common misspellings Hadoop Cluster Amazon S3 Amazon EMR Westen Wistin Westan
  63. 63. A map of common misspellings and suggested corrections are loaded back into Amazon S3. Hadoop Cluster Amazon S3 Amazon EMR Westen Wistin Westan
  64. 64. Then the cluster is shut down Yelp only pays for the time they used it Hadoop Cluster Amazon S3 Amazon EMR
  65. 65. Each of Yelp’s 80 Engineers Can Do This Whenever They Have a Big Data Problem spins up over 250 Hadoop clusters per week in EMR. Amazon S3 Amazon EMR
  66. 66. Data Innovation Meets Action at Scale at NASDAQ OMX • NASDAQ’s technology powers more than 70 marketplaces in 50 countries • NASDAQ’s global platform can handle more than 1 million messages/second at a median speed of sub-55 microseconds • NASDAQ own & operate 26 markets including 3 clearinghouse & 5 central securities repositories • More than 5,500 structured products are tied to NASDAQ’s global indexes with the notional value of at least $1 trillion • NASDAQ powers 1 in 10 of the world’s securities transactions
  67. 67. NASDAQ’s Big Data Challenge • Archiving Market Data – A classic “Big Data” problem • Power Surveillance and Business Intelligence/Analytics • Minimize Cost – Not only infrastructure, but development/IT labor costs too • Empower the business for self-service
  68. 68. SIP Total Monthly Message Volumes OPRA, UQDF and CQS Market Data Is Big Data Charts courtesy of the Financial Information Forum NASDAQ Exchange Daily Peak Messages Financial Information Forum, Redistribution without permission from FIF prohibited, email: Total Monthly Message Volume Combined Average Daily Date UQDF CQS Volume Aug-12 2,317,804,321 8,241,554,280 459,102,548 Sep-12 1,948,330,199 7,452,279,225 494,768,917 Oct-12 1,016,336,632 7,452,279,225 403,267,422 Nov-12 2,148,867,295 9,552,313,807 557,199,100 Dec-12 2,017,355,401 8,052,399,165 503,487,728 Jan-13 2,099,233,536 7,474,101,082 455,873,077 Feb-13 1,969,123,978 7,531,093,813 500,011,463 Mar-13 2,010,832,630 7,896,498,260 495,366,545 Apr-13 2,447,109,450 9,805,224,566 556,924,273 May-13 2,400,946,680 9,430,865,048 537,809,624 Jun-13 2,601,863,331 11,062,086,463 683,197,490 Jul-13 2,142,134,920 8,266,215,553 473,106,840 Aug-13 2,188,338,764 9,079,813,726 512,188,750 23 OPRA Annual Increase: 69% CQS Annual Increase: 10% UQDF Annual Decrease: 6% Total Monthly Message Volume Average Daily Date OPRA Volume Aug-12 80,600,107,361 3,504,352,494 Sep-12 77,303,404,427 4,068,600,233 Oct-12 98,407,788,187 4,686,085,152 Nov-12 104,739,265,089 4,987,584,052 Dec-12 81,363,853,339 4,068,192,667 Jan-13 82,227,243,377 3,915,583,018 Feb-13 87,207,025,489 4,589,843,447 Mar-13 93,573,969,245 4,678,698,462 Apr-13 123,865,614,055 5,630,255,184 May-13 134,587,099,561 6,117,595,435 Jun-13 162,771,803,250 8,138,590,163 Jul-13 120,920,111,089 5,496,368,686 Aug-13 136,237,441,349 6,192,610,970 600,000,000 400,000,000 200,000,000 0 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00
  69. 69. NASDAQ’s Legacy Solution • On-premises MPP DB – Relatively expensive, finite storage – Required periodic additional expenses to add more storage – Ongoing IT (administrative) human costs • Legacy BI tool – Requires developer involvement for new data sources, reports, dashboards, etc.
  70. 70. New Solution: Amazon Redshift • Cost Effective – Redshift is 43% of the cost of legacy • Assuming equal storage capacities – Doesn’t include IT ongoing costs! • Performance – Outperforms NASDAQ’s legacy BI/DB solution – Insert 550K rows/second on a 2 node 8XL cluster • Elastic – NASDAQ can add additional capacity on demand, easy to grow their cluster
  71. 71. New Solution: Pentaho BI/ETL • Amazon Redshift partner – ers/pentaho/ • Self Service – Tools empower BI users to integrate new data sources, create their own analytics, dashboards, and reports without requiring development involvement • Cost effective
  72. 72. Net Result • New solution is cheaper, faster, and offers capabilities that NASDAQ didn’t have before – Empowers NASDAQ’s business users to explore data like they never could before – Reduces IT and development as bottlenecks – Margin improvement (expense reduction and supports business decisions to grow revenue)
  73. 73. NEXT STEPS
  74. 74. AWS is here to help Solution Architects Professional Services Premium Support AWS Partner Network (APN)
  75. 75. Partner with an AWS Big Data expert
  76. 76. Big Data Case Studies Learn from other AWS customers
  77. 77. AWS Marketplace AWS Online Software Store Shop the big data category
  78. 78. AWS Public Data Sets Free access to big data sets
  79. 79. AWS Grants Program AWS in Education
  80. 80. AWS Big Data Test Drives APN Partner-provided labs
  81. 81. AWS Training & Events Webinars, Bootcamps, and Self-Paced Labs
  82. 82. Big Data on AWS Course on Big Data
  83. 83.
  84. 84.
  85. 85. Thank You!