Successfully reported this slideshow.
Your SlideShare is downloading. ×

Big Data & Data Lakes Building Blocks

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 32 Ad

Big Data & Data Lakes Building Blocks

Traditional data storage and analytic tools no longer provide the agility and flexibility required to deliver relevant business insights. That’s why organizations are shifting to a data lake architecture. This approach allows you to store massive amounts of data in a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse organizational groups. In this session, we’ll assemble a data lake using services such as Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue.

Traditional data storage and analytic tools no longer provide the agility and flexibility required to deliver relevant business insights. That’s why organizations are shifting to a data lake architecture. This approach allows you to store massive amounts of data in a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse organizational groups. In this session, we’ll assemble a data lake using services such as Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Big Data & Data Lakes Building Blocks (20)

Advertisement

More from Amazon Web Services (20)

Big Data & Data Lakes Building Blocks

  1. 1. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Johan Broman Manager, Solutions Architecture – AWS Nordics Theo Hultberg Director of Technology - Burt Big Data and Data Lakes Building Blocks
  2. 2. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data Drives Better Decision Making *Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence Data lake leaders who were highly efficient in capturing a diversity of data and making it accessible to their organization in a timely fashion outperformed their peers by 9% in organic revenue growth.* 24% 15% Organic revenue growth Leaders Followers
  3. 3. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Traditionally, Analytics Looked Like This OLTP ERP CRM LOB Data warehouse Business intelligence Relational data TBs-PBs scale Schema defined before data load Operational reporting and on demand Large initial capex + $10K–$50K / TB / Year
  4. 4. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Isolated data silos Hadoop Cluster SQL Database Data Warehouse Appliance
  5. 5. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data. Enter Data Lake Architectures
  6. 6. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data Lakes on AWS Unmatched durability and availability at exabyte scale Comprehensive security, compliance, and audit capabilities Object-level controls Usage and cost analysis insight into your data Most ways to bring data in Twice as many partner integrations Data lake A m a z o n S 3 A m a z o n G l a c i e r A W S G l u e Machine Learning Analytics Internet of Things Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams Kinesis Video Streams
  7. 7. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance § Multipart upload § Range GET § Store as much as you need § Scale storage and compute independently § No minimum usage commitments Scalable § Amazon Redshift / Spectrum § Amazon EMR § Amazon Athena § Amazon DynamoDB Integrated § Simple REST API § AWS SDKs § Read-after-create consistency § Event notification § Lifecycle policies Easy to use Why Amazon S3 for the Data Lake?
  8. 8. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data Lakes on AWS Data lake A m a z o n S 3 A m a z o n G l a c i e r A W S G l u e A m a z o n S a g e M a k e r A W S D e e p L e a r n i n g A M I s A m a z o n R e k o g n i t i o n A m a z o n L e x A W S D e e p L e n s A m a z o n C o m p r e h e n d A m a z o n T r a n s l a t e A m a z o n T r a n s c r i b e A m a z o n P o l l y Machine Learning Analytics Internet of Things (IoT) A W S I o T C o r e A W S G r e e n g r a s s A W S I o T A n a l y t i c s A m a z o n F r e e R T O S A W S I o T 1 - C l i c k A W S I o T B u t t o n A W S I o T D e v i c e M a n a g e m e n t A W S I o T D e v i c e D e f e n d e r A m a z o n A t h e n a A m a z o n E M R A m a z o n R e d s h i f t A m a z o n E l a s t i c s e a r c h S e r v i c e A m a z o n K i n e s i s A m a z o n Q u i c k S i g h t
  9. 9. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data Lakes Extend the Traditional Approach Relational and non-relational data TBs-EBs scale Schema defined during analysis Diverse analytical engines to gain insights Designed for low-cost storage and analytics OLTP ERP CRM LOB Data warehouse Business intelligence Data lake 100110000100101011100 101010111001010100001 011111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine learning DW queries Big data processing Interactive Real-time
  10. 10. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Storing is not enough. Data needs to be discoverable. Dark data are the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Gartner CRM ERP Data warehouse Mainframe data Web Social Log files Machine data Semi- structured Unstructured “ ”
  11. 11. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. AWS Glue: Data Catalog Make data discoverable Automatically discovers data and stores schema Catalog makes data searchable and available for ETL Catalog contains table and job definitions Computes statistics to make queries efficient Compliance AWS Glue Data Catalog Discover data and extract schema
  12. 12. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data preparation accounts for ~80% of the work. Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  13. 13. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. AWS Glue: ETL Service Make ETL scripting and deployment easy Automatically generates ETL code Code is customizable with Python and Spark Endpoints provided to edit, debug, & test code Jobs are scheduled or event-based Serverless
  14. 14. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Amazon Athena: Interactive Analysis $ SQL Query Instantly Zero setup cost; just point to Amazon S3 and start querying. Pay per query Pay only for queries run; save 30–90% on per- query costs through compression. Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types. Easy Serverless: zero infrastructure, zero administration Integrated with Amazon QuickSight. Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
  15. 15. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Amazon Athena
  16. 16. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Amazon EMR: Big Data Processing $ Latest versions Updated with the latest open source frameworks within 30 days of release. Low cost Flexible billing with per- second billing, EC2 spot, reserved instances and auto-scaling to reduce costs 50–80%. Use S3 storage Process data directly in the Amazon S3 data lake securely with high performance using the EMRFS connector. Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, & cluster tuning. Data Lake 10011000010010101110 01010101110010101000 00111100101100101 010001100001 Analytics and ML at scale Nineteen open-source projects: Apache Hadoop, Spark, HBase, Presto, and more Enterprise-grade security
  17. 17. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Amazon EMR
  18. 18. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data Lakes Extend the Traditional Approach Relational and non-relational data TBs-EBs scale Schema defined during analysis Diverse analytical engines to gain insights Designed for low-cost storage and analytics OLTP ERP CRM LOB Data warehouse Business intelligence Data lake 100110000100101011100 101010111001010100001 011111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine learning DW queries Big data processing Interactive Real-time
  19. 19. Theo Hultberg director of technology @iconara
  20. 20. volume variety
  21. 21. S3 ETL
  22. 22. S3
  23. 23. ? S3
  24. 24. S3ETL APIAthena
  25. 25. APISELECT "app", "account", SUM("logged_in_users") FROM "usage" WHERE "date" BETWEEN DATE '2018-05-01' AND DATE '2018-05-16' GROUP BY 1, 2 Athena
  26. 26. S3ETL APIAthena
  27. 27. S3 Glue Aurora APIETL Redshift Athena
  28. 28. thank you! @iconara
  29. 29. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Thank you!

×