Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

STG206_Big Data Data Lakes and Data Oceans

778 views

Published on

Increasingly, valuable customer data sources are dispersed among on-premises data centers, SaaS providers, partners, third-party data providers, and public datasets. Building a data lake on AWS offers a foundation for storing on-premises, third-party, and public datasets cost effectively with high performance. This workshop introduces AWS tools and technologies you can use to analyze and extract value from petabyte-scale datasets, including Amazon Athena and Amazon Redshift Spectrum.

STG206_Big Data Data Lakes and Data Oceans

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Big Data: Data Lakes and Data Oceans L e x C r o s e t t , P r i n c i p a l S o l u t i o n s A r c h i t e c t S a j e e M a t h e w , P r i n c i p a l S o l u t i o n s A r c h i t e c t E r i c k D a m e , E n t e r p r i s e S o l u t i o n s A r c h i t e c t J o h n M a l l o r y , B u s i n e s s D e v e l o p m e n t M a n a g e r November 28, 2017
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What to Expect from the Workshop • Data lake and analytics review (30 min.) • Set up a data lake using an AWS solution (30 min.) • Add information to the data lake (30 min.) • Perform lightweight analysis with AWS big data tools (1 hour)
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Introduction to Data Lake Concepts
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unlocking Data Most companies and organizations are embarking on ambitious innovation initiatives to unlock their data The data already exists but goes unused or is locked away from complimentary data sets in isolated data silos
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges with Legacy Data Architectures • Can’t move data across silos • Can’t deal with dynamic data and real-time processing • Can’t deal with format diversity and change rate • Complex ETL processes • Difficult to find people with adequate skills to configure and manage these systems • Can’t integrate with the explosion of available social and behavior tracking data
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Legacy Data Architectures Exist as Isolated Data Silos Hadoop Cluster SQL Database Data Warehouse Appliance
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Legacy Data Architectures Are Monolithic Multiple layers of functionality all on a single cluster CPU Memory HDFS Storage CPU Memory HDFS Storage CPU Memory HDFS Storage Hadoop Master Node
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Evolution of Data Architectures 2009: Decoupled EMR architecture CPU Memory Hadoop Master Node CPU Memory CPU Memory Improvements • Decoupled storage and compute • Scale CPU and memory resources independently and up and down • Only pay for the 500 TB dataset (not 3X) • Multi-physical facility replication via Amazon S3 • Multiple clusters can run in parallel against shared data in Amazon S3 • Each job gets its own optimized cluster. For example, Spark on memory intensive, Hive on CPU intensive, HBase on I/O intensive, and so on Constraints • Still have a cluster to provision and manage • Must expose EMR cluster to SQL users via Hive, Presto, and so on S3 as HDFS
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Evolution of Data Architectures 2012: Amazon Redshift—cloud DW Improvements Constraints • Still have to load data into a schema Leader node Compute node 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC BI tools SQL clientsAnalytics tools Compute node Compute node JDBC/ODBC • Automated installation, patching, backups • No servers to manage and maintain • MPP columnar relational database • $1,000/TB/year • Accessible to any ODBC or JDBC BI Tool
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Enter Data Lake Architectures Data lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake—All Data is in One Place Analyze all of your data, from all of your sources, in one stored location “Why is the data distributed in many locations? Where is the single source of truth?”
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake—Quick Ingest Quickly ingest data without needing to force it into a predefined schema “How can I collect data quickly from various sources and store it efficiently?”
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake—Storage vs. Compute Separating your storage and compute allows you to scale each component as required “How can I scale up with the volume of data being generated?”
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake—Schema on Read “Is there a way I can apply multiple analytics and processing frameworks to the same data?” A data lake enables ad-hoc analysis by applying schemas on read, not write
  15. 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Approach to Data Lake
  16. 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 is the Data Lake
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance  Multiple upload  Range GET  Scalable throughput  Store as much as you need  Scale storage and compute independently  No minimum usage commitments Scalable  Amazon EMR  Amazon Redshift Spectrum  Amazon DynamoDB  Amazon Athena  AWS Glue  Amazon Rekognition  Amazon Macie Integrated  Simple REST API  AWS SDKs  Simple management tools  Event notification  Lifecycle policies Easy to use Why Amazon S3 for a Data Lake?
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of an AWS Amazon S3 Data Lake Fixed cluster data lake AWS Amazon S3 data lake • Limited to only the single tool contained on the cluster (for example, Hadoop or data warehouse or Cassandra). Use cases and ecosystem tools change rapidly. • Expensive to add nodes to add storage capacity • Expensive to replicate data against node loss • Complexity in scaling local storage capacity • Long refresh cycles to add additional storage equipment • Decouple storage and compute by making S3 object based storage, not a fixed tool to manage the data lake • Flexibility to use any and all tools in the ecosystem. The right tool for the job. • Catalog, transform, and query in place • Future-proof your architecture. As new use cases and tools emerge you can plug and play current best of breed.
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building a Data Lake on AWS Kinesis Firehose Athena Query Service
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Processing and Analytics Real-time Batch AI and Predictive BI and Data Visualization Transactional and RDBMS AWS Lambda Apache Storm on EMR Apache Flink on EMR Spark Streaming on EMR Elasticsearch Service Kinesis Analytics, Kinesis Streams DynamoDB NoSQL DB Relational Database Aurora EMR Hadoop, Spark, Presto Amazon Redshift Data Warehouse Amazon Athena Query Service Amazon Lex Speech recognition Amazon Rekognition Amazon Polly Text to speech Machine Learning Predictive analytics Kinesis Streams and Firehose
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Further Evolution of Data Lake Architectures Today: clusterless Improvements • No cluster/infrastructure to manage • Business users and analysts can write SQL without having to provision a cluster or touch infrastructure • Pay by the query • Zero administration • Process data where it lives Constraints • Limited to SQL, Hive, and Spark jobs today • More frameworks to come! SQL Interface in web browser Amazon Athena for SQL S3 Data Lake AWS Glue for ETL S3 Data Lake Spark and Hive Interface in web browser
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Open Data Formats—Free Your Data! • Parquet, ORC, Avro, JSON, CSV, others • Allows multiple analytic tools to operate on same data • Proprietary data = countdown to next migration
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use an Optimal Combination of Highly Interoperable Services Amazon QuickSight 4. Visualize all your data
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Summary of AWS Analytics, Database, and AI Tools Amazon Redshift Enterprise Data Warehouse Amazon EMR Hadoop/Spark Amazon Athena Clusterless SQL AWS Glue Clusterless ETL Amazon Aurora Managed Relational Database Amazon Machine Learning Predictive Analytics Amazon QuickSight Business Intelligence/Visualization Amazon Elasticsearch Service Elasticsearch Amazon ElastiCache Redis In-memory Datastore Amazon DynamoDB Managed NoSQL Database Amazon Rekognition Deep Learning-based Image Recognition Amazon Lex Voice or Text Chatbots
  25. 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Innovations to Produce More Efficient Data Lake Architectures
  26. 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Direct Connect AWS Snowball ISV Connectors Amazon Kinesis Firehose Amazon S3 Transfer Acceleration AWS Storage Gateway Data Ingestion into Amazon S3
  27. 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Firehose Load massive volumes of streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch Zero administration: capture and deliver streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch without writing an application or managing infrastructure Direct-to-data store integration: batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 seconds using simple configurations Seamless elasticity: seamlessly scales to match data throughput without intervention Capture and submit streaming data to Firehose Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into Amazon S3, Amazon Redshift, and Amazon Elasticsearch
  28. 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Encryption ComplianceSecurity  Identity and access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Private VPC endpoints to Amazon S3  Amazon S3 object tagging to manage access policies  SSL endpoints  Server-side encryption (SSE-S3)  S3 server-side encryption with provided keys (SSE-C, SSE-KMS)  Client-side encryption  Buckets access logs  Lifecycle management policies  Access Control Lists (ACLs)  Versioning and MFA deletes  Certifications—HIPAA, PCI, SOC 1/2/3, etc. Implement the Right Cloud Security Controls
  29. 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lake Best Practices • Use Amazon S3 as the storage repository for your data lake, instead of a Hadoop cluster or data warehouse • Decoupled storage and compute is cheaper and more efficient to operate • Decoupled storage and compute allow us to evolve to clusterless architectures like Amazon Athena and AWS Glue • Do not build data silos in Hadoop or the Enterprise DW • Use granular encryption, roles, and access controls to build a secure, multi-tenant centralized data platform • Gain flexibility to use all the analytics tools in the ecosystem around Amazon S3 and future-proof the architecture
  30. 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Workshop Process • Log in to your AWS account • Apply account credit • Open the workshop guide • Find and run the Data Lake Foundation on AWS • Once complete, load the Behavioral Risk Factor Surveillance system (BRFSS) data into Amazon S3 following the workshop instructions • Follow the analysis steps in the workshop guide • Run Delete Stack to remove resources when done
  31. 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Questions?
  32. 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!

×