Successfully reported this slideshow.
Your SlideShare is downloading. ×

Big Data Architectural Patterns and Best Practices on AWS

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 56 Ad

Big Data Architectural Patterns and Best Practices on AWS

Download to read offline

The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.

The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (18)

Advertisement

Similar to Big Data Architectural Patterns and Best Practices on AWS (20)

More from Amazon Web Services (20)

Advertisement

Recently uploaded (20)

Big Data Architectural Patterns and Best Practices on AWS

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ian Meyers, Principal Solutions Architect July 7th , 2016 Big Data Architectural Patterns and Best Practices on AWS
  2. 2. Big Data Evolution Batch processing Stream processing Machine learning
  3. 3. Powerful Services, Open Ecosystem Amazon Glacier S3 DynamoDB RDS EMR Amazon Redshift Data Pipeline Amazon Kinesis Kinesis-enabled app Lambda ML SQS ElastiCache DynamoDB Streams Amazon Elasticsearch Service Amazon Web Services Open Source & Commercial Ecosystem
  4. 4. Separation of compute & storage • Event driven, loose coupling, independent scaling & sizing “Data Sympathy” - use the right tool for the job Data structure, latency, throughput, access patterns Managed services, open technologies Scalable/elastic, available, reliable, secure, no/low admin Focus on the data, and your business, not undifferentiated tasks Architectural Principles
  5. 5. Big Data != Big Cost Aim to reduce the cost of experimentation to near 0
  6. 6. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  7. 7. Devices Sensors & IoT platforms AWS IoT STREAMS IoTApplications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Messaging Message MESSAGES Messaging COLLECT Mobile apps Web apps Data centers AWS Direct Connect RECORDS AWS Mobile SDK AWS IoT Amazon CloudWatchLogs Agent Amazon CognitoSync AWS Mobile Analytics Easily Capture Mobile Data through Sync and Server Logging File based integration to Streaming Storage API Integration via SDK’s and Ecosystem Tools Amazon Kinesis Producer Library Amazon Kinesis Agent
  8. 8. Principles for Collection Architecture • Should be easy to install, manage, and understand • Meet requirements for data durability & recovery • Ensure ability to meet latency & performance requirements • Data handling should be sympathetic to the type of data
  9. 9. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  10. 10. Types of Data Devices Sensors & IoT platforms AWS IoT STREAMS IoTApplications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Messaging Message MESSAGES Messaging COLLECT Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database STORE Search File store Queue Stream storage Cache Heavily read, application defined structure Consistently read, relational or document data Frequently read, often requiring natural language processing or geospatial access Data received as files, or where object storage is appropriate Data intended for a single recipient, with low latency Data intended for many recipients, with low latency
  11. 11. COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENT S FILES Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Stream Amazon SQS Message Amazon Elasticsearch Service Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS SearchSQLNoSQLCacheFile LoggingIoTApplicationsTransportMessaging File store Queue Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Stream storage Cache, Database, Search
  12. 12. Database Anti-pattern Myriad of Applications One GREAT BIG Database
  13. 13. Myriad of Applications Best Practice - Use the Right Tool for the job Data Tier Search Amazon Elasticsearch Service Amazon CloudSearch Cache Amazon ElastiCache Redis Memcached SQL Amazon Aurora Amazon RDS Amazon Redshift NoSQL Amazon DynamoDB Cassandra HBase MongoDB Database Tier
  14. 14. COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENT S FILES Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Stream Amazon SQS Message Amazon S3 File LoggingIoTApplicationsTransportMessaging Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Queue Stream storage Amazon Elasticsearch Service Amazon DynamoDB Amazon ElastiCache Amazon RDS File store
  15. 15. Why is Amazon S3 Good for Big Data? • Virtually unlimited storage, unlimited object count • Very high bandwidth – no aggregate throughput limit • Highly available – can tolerate AZ failure • Designed for 99.999999999% durability • Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policies • Native support for Versioning • Secure – SSL, client/server-side encryption at rest • Low cost
  16. 16. Why is Amazon S3 Good for Big Data? • Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS) • No need to pay for data replication • Can run transient Hadoop clusters & Amazon EC2 Spot Instances • Multiple & heterogenous analysis clusters can use the same data
  17. 17. What about HDFS & Data Tiering (S3 IA & Amazon Glacier)? • Use HDFS for very frequently accessed (hot) data • Use Amazon S3 Standard as system of record for durability • Use Amazon S3 Standard – IA for near line archive data • Use Amazon Glacier for cold/offline archive data
  18. 18. Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Amazon SQS Amazon SQS Managed message queue service Amazon Kinesis Streams Managed stream storage + processing Amazon Kinesis Firehose Managed data delivery Amazon DynamoDB Streams Managed NoSQL database Tables can be stream-enabled Message & Stream Data Devices Sensors & IoT platforms AWS IoT STREAMS IoT COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Messaging Message MESSAGES Messaging Queue Stream Amazon Elasticsearch Service Amazon DynamoDB Amazon ElastiCache Amazon RDS Amazon S3 Queue Stream storage
  19. 19. Why Stream Storage? Decouple producers & consumers Persistent buffer Collect multiple streams Preserve client ordering Streaming analysis Parallel consumption 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 DynamoDB stream Amazon Kinesis stream
  20. 20. What Data Store Should I Use? Data structure → Fixed schema, JSON, key-value Access patterns → Store data in the format you will access it Data / access characteristics → Hot, warm, cold Cost → Right cost
  21. 21. Data Structure and Access Patterns Access Patterns What to use? Put/Get (key, value) Cache, NoSQL Simple relationships → 1:N, M:N NoSQL Cross table joins, transaction, SQL SQL Faceting, natural language, geo Search Data Structure What to use? Fixed schema SQL, NoSQL Schema-free (JSON) NoSQL, Search (Key, value) Cache, NoSQL
  22. 22. What Is the Temperature of Your Data / Access ?
  23. 23. Hot Warm Cold Volume MB–GB GB–TB PB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low – high High Very high Request rate Very high High-Med Low Cost/GB $$-$ $-¢¢ ¢ Hot data Warm data Cold data Data / Access Characteristics: Hot, Warm, Cold
  24. 24. Amazon ElastiCache Amazon DynamoDB Amazon RDS/Aurora Amazon Elasticsearch Amazon S3 Amazon Glacier Average latency ms ms ms, sec ms,sec ms,sec,min (~ size) hrs Typical data stored GB GB–TBs (no limit) GB–TB (64 TB max) GB–TB MB–PB (no limit) GB–PB (no limit) Typical item size B-KB KB (400 KB max) KB (64 KB max) KB (2 GB max) KB-TB (5 TB max) GB (40 TB max) Request Rate High – very high Very high (no limit) High High Low – high (no limit) Very low Storage cost GB/month $$ ¢¢ ¢¢ ¢¢ ¢ ¢/10 Durability Low - moderate Very high Very high High Very high Very high Availability High 2 AZ Very high 3 AZ Very high 3 AZ High 2 AZ Very high 3 AZ Very high 3 AZ Hot data Warm data Cold data What Data Store Should I Use?
  25. 25. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  26. 26. COLLECT STORE PROCESS / ANALYZE Amazon Elasticsearch Service Amazon SQS Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS HotWarm SearchSQLNoSQLCacheFileMessage Stream Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENT S FILES MESSAGES STREAMS LoggingIoTApplicationsTransportMessaging Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams Hot
  27. 27. Tools and Frameworks Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift PROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR FastSlowFast BatchMessageInteractiveStreamML Amazon EC2 Amazon EC2 Batch - Minutes or hours on cold data Daily/weekly/monthly reports Interactive – Seconds on warm/cold data Self-service dashboards Messaging – Milliseconds or seconds on hot data Message/eventbuffering Streaming - Milliseconds or seconds on hot data Billing/fraud alerts, 1 minute metrics
  28. 28. Tools and Frameworks Batch Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Amazon Redshift, Amazon EMR (Presto, Spark) Messaging Amazon SQS application on Amazon EC2 Streaming Micro-batch: Spark Streaming, KCL Real-time: Amazon Kinesis Analytics, Storm, AWS Lambda, KCL Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift PROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR FastSlowFast BatchMessageInteractiveStreamML Amazon EC2 Amazon EC2
  29. 29. What Analytics Technology Should I Use? Amazon Redshift Amazon EMR Presto Spark Hive Query latency Low Low Low High Durability High High High High Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes AWS managed Yes Yes Yes Yes Storage Native HDFS / S3 HDFS / S3 HDFS / S3 SQL compatibility High High Low (SparkSQL) Medium (HQL) Slow
  30. 30. What Streaming Analysis Technology Should I Use? Spark Streaming Apache Storm Kinesis KCL Application AWS Lambda Amazon SQS Apps Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes Micro-batch or Real-time Micro-batch Real-time Near-real-time Near-real-time Near-real-time AWS managed service Yes (EMR) No (EC2) No (KCL + EC2 + Auto Scaling) Yes No (EC2 + Auto Scaling) Scalability No limits ~ nodes No limits ~ nodes No limits ~ nodes No limits No limits Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Programming languages Java, Python, Scala Any language via Thrift Java, via MultiLang Daemon (.NET, Python, Ruby, Node.js) Node.js, Java, Python AWS SDK languages (Java, .NET, Python, …)
  31. 31. Simplify Big Data Processing COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost
  32. 32. Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Amazon SQS Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS HotHotWarm FastSlowFast BatchMessageInteractiveStreamML SearchSQLNoSQLCacheFileMessage Stream Amazon EC2 Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENT S FILES MESSAGES STREAMS LoggingIoTApplicationsTransportMessaging ETL Amazon Kinesis Firehose Amazon Kinesis Streams Amazon DynamoDB Streams
  33. 33. STORE CONSUMEPROCESS / ANALYZE Amazon QuickSight Analysis&visualizationNotebooksIDE Apps & Services API Applications & API Reporting, Exploratory and Ad-Hoc Analysis and Data Visualization Notebooks Data Science Business users Data scientist, developers COLLECT ETL
  34. 34. Design Patterns
  35. 35. Primitive: Batch Analysis & Enhancement process store Amazon S3 Amazon DynamoDB Amazon EMR Presto Spark SQL Amazon Redshift Amazon Quicksight AWS Marketplace
  36. 36. Primitive: Event Processing “Data Bus” Multiple stages Storage raises events to decouple processing stages Store Process Store Process
  37. 37. Primitive: Pub/Sub process store Amazon Kinesis AWS Lambda Amazon Kinesis S3 Connector Apache Spark Kinesis Client Library
  38. 38. Combining Primitives: Event Processing Network Combines data bus and pub/sub patterns to create asynchronous, continuously executing transformations Store Process Store Process Process Store Process Process Process
  39. 39. Primitive: Messaging “Fan Out” process store Amazon SQS AWS Lambda Amazon Kinesis
  40. 40. Real Time Processing Batch Analytics process store Amazon S3 Amazon Redshift Amazon EMR Presto Hive Pig Spark Real Time StorageAmazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES Amazon ML KCL AWS Lambda Storm Real Time & Batch Architecture Spark Streaming on Amazon EMR Applications Amazon Kinesis
  41. 41. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Zoltan Arvai, Big Data Architect July 7th, 2016 Leveraging S3 and EMR at Adbrain
  42. 42. About Adbrain • Adbrain is a market leader in providing Probabilistic Identity Solutions • At the core of our business we build an identity graph that represents connected devices on the planet • We enable our clients to understand and target their audience better
  43. 43. What does that mean? • Ingesting large amounts of log data from different data sources • Build a graph using statistical and machine learning algorithms • Extract a relevant user graph for our clients
  44. 44. Our main scenarios • Production graph generation and servicing our clients • Data Science team to run computations (Spark) • R&D work with ML • Data feed evaluation • Graph analytics • Data Engineering development workload
  45. 45. Our original setup • HDFS Cluster on 18 nodes with most of our hot data • Not leveraging co-location of data • Spark clusters are separated • Spark instance type optimized according to workload • Custom cluster provisioning solution to run Spark jobs • Built in-house • Required maintenance
  46. 46. challenge
  47. 47. What are the issues? • “Hadoop is a cheap Big Data solution on commodity hardware” … define cheap for a startup • Scaling Hadoop means more focus on infrastructure • Data grows really quickly (adding a new country, data feed, etc.)
  48. 48. What do we need? • Better scale • Cheaper solution • Less focus on infrastructure • Reduced risk
  49. 49. EMR and S3 to the rescue! • S3 is significantly cheaper compared to HDFS • We use infrequent access and Glacier storage as well • EMR is a fully managed solution • Up to date Spark distributions • Support of Spot instances • Managed Hadoop when / if needed Amazon S3 Amazon EMR
  50. 50. The implementation challenge
  51. 51. Some technical challenges • Spark 1.3.1 did not work well with S3 and Parquet files • Writing to S3 using the default ParquetOutputComitter is slow • Calculating output summary metadata is slow • S3 is not consistent by default (it’s not a file system) • S3 can be slower than HDFS
  52. 52. Solutions • Upgrade to Spark 1.5.1+ • Using DirectParquetOutputCommitter with speculation disabled • parquet.enable.summary-metadata = false • Enable S3 Consistent View in EMR! • (Remember the DynamoDB tables!) • In case of a multi-step Spark workflow, utilize HDFS on the Spark Cluster
  53. 53. The end result • Decreasing costs by 70%+ • Focusing on development instead of commodities (hardware/versioning/patching) • Eliminated risks
  54. 54. Thank you! aws.amazon.com/big-data
  55. 55. Remember to complete your evaluations!

×