Successfully reported this slideshow.
Your SlideShare is downloading. ×

Fast Track to Your Data Lake on AWS

Loading in …3

Check these out next

1 of 21 Ad

More Related Content

Slideshows for you (20)

Similar to Fast Track to Your Data Lake on AWS (20)


More from Amazon Web Services (20)

Recently uploaded (20)


Fast Track to Your Data Lake on AWS

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. March 16, 2017 Fast Track to Your Data Lake on AWS John Mallory, Business Development
  2. 2. Data has gravity …easier to move processing to the data 4k/8k Genomics Seismic Financial Logs IoT
  3. 3. Data has Business Value
  4. 4. Challenges with Legacy Data Architectures • Can’t move data across silos • Can’t afford to keep all of the data • Can’t scale with dynamic data and real-time processing • Can’t scale management of data • Can’t find the people who know how to configure and manage complex infrastructure • Can’t afford the investments to keep refreshing infrastructure and data centers
  5. 5. Enter Data Lake Architectures Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data. Benefits of a Data Lake • All Data in One Place • Quick Ingest • Storage vs Compute • Schema on Read
  6. 6. 1&2: Consolidate (Data) & Separate (Storage & Compute) •S3 as the data lake storage tier; not a single analytics tool like Hadoop or a data warehouse •Decoupled storage and compute is cheaper and more efficient to operate •Decoupled storage and compute allow us to evolve to clusterless architectures (i.e. Lambda, Athena & Glue) •Do not build data silos in Hadoop or the EDW •Gain flexibility to use all the analytics tools in the ecosystem around S3 & future proof the architecture
  7. 7. Designed for 11 9s of durability • Multiple Encryption Options • Robust/Highly Flexible Access Controls Durable Secure High performance  Multiple upload  Range GET  Scalable Throughput  Store as much as you need  Scale storage and compute independently  Scale without limits  Affordable Scalable  Amazon EMR  Amazon Redshift  Amazon DynamoDB  Amazon Athena  Amazon Rekognition  Amazon Glue Integrated  Simple REST API  AWS SDKs  Read-after-create consistency  Event notification  Lifecycle policies  Simple Management Tools  Hadoop compatibility Easy to use Why Choose Amazon S3 for data lake?
  8. 8. “For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren’t able to do before, and that is priceless.” - Steve Randich, CIO Case Study: Re-architecting Compliance What FINRA needed • Infrastructure for its market surveillance platform • Support of analysis and storage of approximately 75 billion market events every day • Store 5PB of historical data for analysis & training Why they chose AWS • Fulfillment of FINRA’s security requirements • Ability to create a flexible platform using dynamic clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3 Benefits realized • Increased agility, speed, and cost savings • Estimated savings of $10-20m annually by using AWS
  9. 9. Encryption ComplianceSecurity  Identity and Access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Private VPC endpoints to Amazon S3  SSL endpoints  Server Side Encryption (SSE-S3)  S3 Server Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-side Encryption  Buckets access logs  Lifecycle Management Policies  Access Control Lists (ACLs)  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3 etc. 3: Implement the Right Security Controls
  10. 10. AWS Snowball & Snowmobile • Accelerate PBs with AWS-provided appliances • 50, 80, 100 TB models • 100PB Snowmobile AWS Storage Gateway • Instant hybrid cloud • Up to 120 MB/s cloud upload rate (4x improvement), and 4: Choose the Right Ingestion Methods Amazon Kinesis Firehose • Ingest device streams directly into AWS data stores AWS Direct Connect • COLO to AWS • Use native copy tools Native/ISV Connectors • Sqoop, Flume, DistCp • Commvault, Veritas, etc Amazon S3 Transfer Acceleration • Move data up to 300% faster using AWS’s private network
  11. 11. 5: Catalog Your Data S3 Put data in S3 Amazon DynamoDB Amazon Elasticsearch Service Metadata What is in the data lake? Documents the data lake Summary statistics Classification Data Sources Search capabilities Glue Coming Mid-year
  12. 12. Glue automates the undifferentiated heavy-lifting of ETL  Cataloging data sources  Identifying data formats and data types  Generating Extract, Transform, Load code  Executing ETL jobs; managing dependencies  Handling errors  Managing and scaling resources Amazon Glue – in Preview
  13. 13. S3 Standard S3 Standard - Infrequent Access Amazon Glacier Active data Archive dataInfrequently accessed data Milliseconds Minutes to HoursMilliseconds $0.021/GB/mo $0.004/GB/mo$0.0125/GB/mo 6: Keep More Data
  14. 14. 7: Use Athena for Ad Hoc Data Exploration Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  15. 15. Athena is Serverless • No Infrastructure or administration • Zero Spin up time • Transparent upgrades
  16. 16. Query Data Directly from Amazon S3 • No loading of data • Query data in its raw format • Athena supports multiple data formats • Text, CSV, TSV, JSON, weblogs, AWS service logs • Or convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No ETL required • Stream data directly from Amazon S3
  17. 17. 8: Use the Right Data Formats • Pay by the amount of data scanned per query • Use Compressed Columnar Formats • Parquet • ORC • Easy to integrate with wide variety of tools Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  18. 18. 9: Choose the Right Tools Amazon Redshift Enterprise Data Warehouse Amazon EMR Hadoop/Spark Amazon Athena Clusterless SQL Amazon Glue Clusterless ETL Amazon Aurora Managed Relational Database Amazon Machine Learning Predictive Analytics Amazon Quicksight Business Intelligence/Visualization Amazon ElasticSearch Service ElasticSearch Amazon ElastiCache Redis In-memory Datastore Amazon DynamoDB Managed NoSQL Database Amazon Rekognition & Amazon Polly Image Recognition & Text-to-Speech AI APIs Amazon Lex Voice or Text Chatbots
  19. 19. A Sample Data Lake Pipeline Ad-hoc access to data using Athena Athena can query aggregated datasets as well
  20. 20. Amazon S3 Data Lake Amazon Kinesis Streams & Firehose Hadoop / Spark Streaming Analytics Tools Amazon Redshift Data Warehouse Amazon DynamoDB NoSQL Database AWS Lambda Spark Streaming on EMR Amazon Elasticsearch Service Relational Database Amazon EMR Amazon Aurora Amazon Machine Learning Predictive Analytics Any Open Source Tool of Choice on EC2 AWS Data Lake Analytic Capabilities Data Science Sandbox Visualization / Reporting Apache Storm on EMR Apache Flink on EMR Amazon Kinesis Analytics Serving Tier Clusterless SQL Query Amazon Athena DataSourcesTransactionalData Amazon Glue Clusterless ETL Amazon ElastiCache Redis
  21. 21. Use S3 as the storage repository for your data lake, instead of a Hadoop cluster or data warehouse Decoupled storage and compute is cheaper and more efficient to operate Decoupled storage and compute allow us to evolve to clusterless architectures like Athena Do not build data silos in Hadoop or the Enterprise DW Gain flexibility to use all the analytics tools in the ecosystem around S3 & future proof the architecture 10: Evolve as Needed

Editor's Notes

  • As content quality improves and the need to suppprt multiple ways of viewing it prolifirate, we are facing the challenge of content gravity.
    While it’s relatively easy to process the media, it’s becoming exceedingly difficult to move it around and store it. For example moving from HD to 4K and eventually 8K content may result in an increase of storage footprint on the order of 10x or more.
    Storage is not the only challenge here, as the contnent weighs more it’s more difficult to quickly and cost effectively transfer it to affiliates and partners in the supply chain. The conclusion is that you should strive to keep the data as close as possible to sufficient processing resources.
  • The native features of S3 are exactly what you want from a Data Lake
    Replication across AZ’s for high availability and durability
    Massively parallel and scalable
    Storage scales independent of compute
    Low storage cost at < $0.025/GB

    This is nearly impossible to achieve with a fixed database cluster
    The Financial Industry Regulatory Authority (FINRA), one of the largest independent securities regulators in the U.S., was established to help watch and regulate financial trading practices.

    To respond to rapidly changing market dynamics, FINRA moved its market surveillance platform to AWS to analyze and store approximately 75 billion market events every day. FINRA selected AWS because it offered the right services while fulfilling the company’s security requirements. By using dynamic clusters (Hadoop, Hive, and HBase), and services such as Amazon EMR and Amazon S3, FINRA was able to create a flexible platform that could adapt to changing market dynamics.

    By using AWS, FINRA has been able to increase agility, speed and cost savings while allowing them to operate at scale. The company estimates it will save $10 to $20 million annually by using AWS.
  • AWS has a broad set of capabilities that make security easy
    With all your data in S3 you have a variety of encryption options
    Client Side
    Server Side
    Encryption with KMS Keys
    You can extend encryption to a 3rd party provider
    We integrate with HSM as well
    IAM offers the ability to create users and roles for those users which can restrict access to only those capabilities you allow
    You can set S3 bucket policies for IAM users
    S3 has a private VPC endpoint so you don’t need to exit your VPC via a NAT gateway
    And you have native features such as setting Lifecycle policies for your S3 data as well as bucket access logs.
  • EBS: Raised max throughput to 320 MB/sec (PIOPS) and 160 MB/sec (GP2), plus larger & faster ssd volumes (raised max vol size from 1 TB to 16 TB)

    Snowball: Physical storage device by AWS to accelerate PB-scale data transfer with AWS-provided appliances
    Kinesis Firehose: Ingest data streams directly into AWS data stores (S3 and Redshift). You can use Amazon Kinesis to ingest data from hundreds of thousands of sensors processing hundreds of terabytes of data per hour.
    Zero administration: Capture and deliver streaming data into S3, Redshift, and other destinations without writing any applications or managing infrastructure.
    Direct-to-Data Store Integration Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations.
    Seamless Elasticity: Seamlessly scales to match data throughput without intervention.

  • Show me all my customer data
    Search important – how to discover what is there, where it is,etc
    (Glue will replace later)
    Is this one step too far?
    (benefits of an AWS data lake slide, data governance what it is interms of index,catalog, and manage your data rather than nuts and bolts of data catalog.
    Use topic of data governance itself
    ElasticSearch is also used for querying the data lake itself - load processed data into Elasticsearch (integrated with Hadoop workflow in a data lake?) ask Bob Taylor about integrating index search element into Hadoop -
  • Across the board, we provide 3 storage options with 3 different performance characteristics and price points. On the left, we have S3 Standard which is our high performance object storage for the internet, designed for very active, hot workloads. Data in S3 Standard is available in milliseconds and costs $0.03/GB/month (starting at). On the right hand side, we have Glacier, our cold storage service designed for long term archival and infrequently accessed data. Data in Glacier has a 3-5 hour access latency and Glacier costs $0.007/GB/month (starting at). Between the hot and cold options, we have a “warm” option – S3 infrequent access designed for data you plan to access maybe a few times a year or what we think of as “active archive”. S3-IA costs $0.0125/GB/mo (starting at). From an archiving perspective, customers typically use S3IA and Glacier together.

    Just a quick note terminology – S3 stores data in buckets and each piece of data is an object; Glacier stores data in vaults (equivalent of S3 buckets) and each piece of data is called an archive (similar to object). You will hear me use bucket/vault/object/archive later on.
  • You simply put your Data in S3 and submit SQL against it
  • For a datalake, Athena won’t be the only application reading the data. ORC and Parquet were chosen because they are open source and are available for use with other analytics tools.

    You can use a few lines of Pyspark code, running on Amazon EMR, to convert your files to Parquet for the best performance and cost

    When you create a table for Athena, you are essentially just creating metadata and, as you run queries, the schema is applied to the data.

    Data is streamed to Athena from S3, it is not copies and there is no ETL. This makes Athena ideal for customers using S3 as Data Lake

    extraction, transformation, and load

    No loading of data required. Query data where it lives.
  • Quicksight can talk to Athena using a JDBC driver