Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Data Lakes and Analytics on AWS

238 views

Published on

In this session, we show you how to understand what data you have, how to drive insights, and how to make predictions using purpose-built AWS services. Learn about the common pitfalls of building data lakes and discover how to successfully drive analytics and insights from your data. Also learn how services such as Amazon S3, AWS Glue, Amazon Redshift, Amazon Athena, Amazon EMR, Amazon Kinesis, and Amazon ML services work together to build a successful data lake for various roles, including data scientists and business users.

  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Building Data Lakes and Analytics on AWS

  1. 1. Build Data Lakes and Analytics on AWS: Patterns Best Practices
  2. 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data: Different forms of challenges VisualizationVariability Volume Velocity Variety Veracity Value
  3. 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges are often driven by: https://www.promptcloud.com https://john-popelaars.blogspot.com https://ww.signiant.com https://www.linkedin.com/pulse/world-today-data-rich-information-poor- guru-p-mohapatra-pmp/ Data growth faster than ever Data variety is increasing
  4. 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Data Lake helps address this Quickly ingest and store any type of data Insights and security, together … Run the right tool for the right job without manually copying data around
  5. 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes from AWS Analytics Machine Learning Real-time data movement Traditional Data Lake on AWS data movement Ingestion Intelligence Storage Catalog Variety of ingestion tools Decoupled analytics from storage/catalog
  6. 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What data do I have?
  7. 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What data do I have? Gartner: “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient.” ”Metadata Is the Fish Finder in Data Lake” Data Lake on AWS Storage | Archival Storage | Data Catalog
  8. 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue components Job AuthoringData Catalog Job Execution Apache Hive Metastore compatible Integrated with AWS services Automatic crawl and discover data Discover Auto-generates ETL code Python and Apache Spark Edit, debug, and share Develop Serverless execution Flexible scheduling Monitoring and alerting Deploy
  9. 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What can crawlers discover? IAM Role AWS Glue Crawler Databases Amazon Redshift Amazon S3 JDBC Connection Object Connection Built-in classifiers MySQL MariaDB PostreSQL Aurora Oracle Amazon Redshift Avro Parquet ORC XML JSON & JSONPaths AWS CloudTrail BSON Logs (Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) < ALWAYS GROWING…> Create additional custom classifiers Amazon DynamoDB NoSQL Connection
  10. 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. But I have my own data formats …? − There is a custom classifier for that … Row-Based GROK Classifier A grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. XML XML Classifier XML tag that defines a table row in the XML document. JSON JSON Classifier JSON path to the object, array, or value that defines a row of the table being created. Type the name in either dot or bracket JSON syntax using AWS Glue supported operators
  11. 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Other ways of populating the catalog Call the AWS Glue CreateTable API Create table manually DDL statement (in Amazon Athena or Amazon EMR) Apache Hive Metastore AWS GLUE ETL AWS GLUE DATA CATALOG Import from Apache Hive Metastore
  12. 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How do I hydrate my data lake?
  13. 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How do I drive value? Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time data movementTraditional data movement
  14. 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ingest data based on the type of data Open and comprehensive • Data movement from on-premises datacenters • Dedicated network connection • Secure appliances • Ruggedized shipping container • Database migration • Gateway that lets applications write to the cloud • Data movement from real-time sources • Connect devices to AWS • Real-time data streams • Real-time video streams AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data movement from real-time sources Data movement from your datacenters Amazon S 3 Amazon Gl ac ier AWS Gl u e
  15. 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time data movement and data lakes on AWS Amazon Kinesis Data Firehose AWS Glue Data Catalog Amazon S3 Data Data Lake on AWS Amazon Kinesis Data Streams Data definitionKinesis Agent Apache Kafka AWS SDK LOG4J Flume Fluentd AWS Mobile SDK Kinesis Producer Library
  16. 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. IMPORTANT: Ingest data in its raw form … Open and comprehensive Amazon S 3 Amazon Gl ac ier AWS Gl u e • Store the data in its raw form: • BEFORE • Transforming • Analyzing • Manipulating • Doing … anything … to it CSV ORC Grok Avro Parquet JSON • This becomes your source of record you can always go back to … • Lifecycle policies allow you to shift it to warm and cold storage.
  17. 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tiered storage to optimize price / performance Lowest cost • Tiered storage to optimize price/performance • Amazon S3 Standard • Amazon S3 Standard—Infrequent Access • Amazon S3 One Zone—Infrequent Access • Amazon Glacier • Migrate between tiers based on lifecycle policies • Store data at $0.023*/GB/month with Amazon S3 • Store data at $0.004*/GB/month with Amazon Glacier Amazon S3 Standard Amazon S3 Standard Infrequent Access Amazon S3 One Zone-IA Amazon Glacier Active Infrequent Archive
  18. 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Datasets in the Lake? Raw datasets – immutable datasets that you can always go back to. • Abstract out the complexities of how the data is stored through the catalog and SerDes Optimizing Analytics and Machine Learning: Curated datasets – query-optimized for consumption across wide number of tools
  19. 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Preparing raw data for consumption Raw data stored in Data Lake: Preparation: Normalized Partitioned Compressed Storage Optimized Extract – Load – Transform Raw Ingestion Curated DataSets Data Catalog ELT
  20. 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which tool should I use to analyze my data?
  21. 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How do I drive value? Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine Learning Real-time dataTraditional movementdata movement
  22. 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Different tools for different users… Business Reporting Data Catalog Central Storage SagemakerMachine Learning/Deep Learning Data Scientists Data Engineer
  23. 23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena – interactive analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) $ SQL Query instantly Zero setup cost; just point to Amazon S3 and start querying Pay per query Pay only for queries run; save 30%–90% on per- query costs through compression Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with Amazon QuickSight
  24. 24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR – big data processing Analytics and ML at scale 19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more Enterprise-grade security $ Latest versions Updated with the latest open source frameworks within 30 days of release Low cost Flexible billing with per- second billing, Amazon EC2 Spot, Reserved Instances, and Auto Scaling to reduce costs 50%-80% Use Amazon S3 storage Process data directly in the Amazon S3 data lake securely with high performance using the EMRFS connector Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Data Lake 100110000100101011100 1010101110010101000 00111100101100101 010001100001
  25. 25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hadoop / Spark Analytics on AWS YARN (Hadoop Resource Manager) NoSQLMachine learning Real-timeInteractiveScriptBatch Data Lake on AWS Amazon S3 Amazon EMR Managed Hadoop / Spark Object storage
  26. 26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fitting this into the Common Data Catalog Amazon S3 Interactive Spark cluster Amazon EMR Amazon EMR EMRFS HDFS Transient ETL job Source of Truth EMRFS HDFS Describes the data MySQL DB instance Unifieddataview AWS Glue Data Catalog Stores the data …
  27. 27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift – data warehousing Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost Massively parallel, scale from gigabytes to petabytes Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance $ Inexpensive As low as $1,000 per terabyte per year, 1/10 the cost of traditional data warehouse solutions; start at $0.25 per hour Open file formats Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3
  28. 28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data warehouse … Amazon Redshift Data Warehouse Relational data Gigabytes to petabytes scale Reporting and analysis Schema defined prior to data load AWS Glue ETL On Prem Amazon QuickSight Existing or new BI tool Redshift COPY
  29. 29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. A Data Lake is not an Enterprise Data Warehouse Complementary to EDW (not replacement) EDW can be sourced from Data Lake Schema on read (no predefined schemas) Schema on write (predefined schemas) Structured/semi-structured/Unstructured data Structured data only Fast ingestion of new data/content Time consuming to introduce new content Data Science + Prediction/Advanced Analytics + BI use cases BI use cases Data at low level of detail/granularity Data at summary/aggregated level of detail Loosely defined SLAs Tight SLAs (production schedules) Flexibility in tools (open source/tools for advanced analytics) Limited flexibility in tools (SQL only) Elastic storage and compute capacity – decoupled Explicitly sized environments, compute and storage scaled in linearly Data Lake EDW
  30. 30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift Spectrum Extend the data warehouse to exabytes of data in Amazon S3 data lake Amazon S3 Data Lake Amazon Redshift data Amazon Redshift Spectrum query engine Exabyte Redshift SQL queries against Amazon S3 Join data across Redshift and Amazon S3 Scale compute and storage separately Stable query performance and unlimited concurrency CSV, ORC, Grok, Avro, & Parquet data formats Pay only for the amount of data scanned
  31. 31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. A m a z o n R e d s h i f t S p e c t r u m Q u e r y y o u r D a t a L a k e Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon Redshift Spectrum Scale-out serverless compute AWS Glue Data Catalog COPY commands Hot data Query directly on Data Lake
  32. 32. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes extend the traditional data warehouse Data warehouse Business intelligence OLTP ERP CRM LOB • Relational and nonrelational data • TBs–EBs scale • Diverse analytical engines • Low-cost storage & analytics Devices Web Sensors Social Data lake Big data processing, real-time, machine learning
  33. 33. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning & Big Data
  34. 34. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data driving Machine Learning Better Decisions Object Storage Databases Data warehouse Streaming analytics BI Hadoop Spark/Presto Elasticsearch Better Products Machine Learning Deep Learning/ AI More Users More Data Click stream User activity Generated content Purchases Clicks Likes Sensor data
  35. 35. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agility in Machine Learning Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine Learning Real-time dataOn-premises movementdata movement
  36. 36. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. In Summary…
  37. 37. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Core Tenants • Data lakes and data warehouses complement each other • Loose coupling, but highly performant • Storage, analytics, metadata management, etc.. • Future-proof your analytics • Choosing the best tool for the job • Elasticity and multiple clusters for dedicated purposes • Replace capacity planning with a consumption model • Don’t forget metadata management
  38. 38. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use the right storage tier and data format Data structure → Fixed schema, JSON, key-value Access patterns → Store data in the format you will access it Data characteristics → Hot, warm, cold Cost → Right cost
  39. 39. Thank you!

×