Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena

3,629 views

Published on

"Learn how to architect a data lake where different teams within your organization can publish and consume data in a self-service manner. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users - from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways.

In this talk, we will dive deep into assembling a data lake using Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue. The session will feature Mohit Rao, Architect and Integration lead at Atlassian, the maker of products such as JIRA, Confluence, and Stride. First, we will look at a couple of common architectures for building a data lake. Then we will show how Atlassian built a self-service data lake, where any team within the company can publish a dataset to be consumed by a broad set of users."

ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and Amazon Athena R o h a n D h u p e l i a , A n a l y t i c s P l a t f o r m M a n a g e r , A t l a s s i a n A b h i s h e k S i n h a , S e n i o r P r o d u c t M a n a g e r , A m a z o n A t h e n a A B D 3 1 8
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Characteristics of a data lake • The basic components of building a data lake and services corresponding to each • Example 1: Building a data lake to unify real-time and batch data processing needs • Example 2: The Atlassian self-service data lake
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs D i v e r s i f i e d c o n s u m e r s Data Scientists Business Analyst External Consumers Applications
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs D i v e r s i f i e d c o n s u m e r s Data Scientists Business Analyst External Consumers Applications M u l t i p l e a c c e s s m e c h a n i s m s API Access BI Tools Notebooks
  6. 6. Characteristics of a data lake Future ProofFlexible Access Dive in Anywhere Collect Anything
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 as the data lake
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Simplified architectural view Amazon S3 Ingestion mechanism Data sources Transactions Web logs / cookies ERP Connected devices Process Consume
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. There are lots of ingestion tools Amazon S3 Process Consume S3 Transfer Acceleration Data sources Transactions Web logs / cookies ERP Connected devices
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Variety of data processing tools Amazon S3 Consume S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. And multiple ways to consume the data Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE Amazon API Gateway Programmatic Access
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Because data is not prefect AWS Lambda Trigger-based Code Execution AWS Glue Event based Server-less ETL engine Amazon EMR Spark and Hive running on EMR Because data is not never prefect Clean Transform Concatenate Convert to better formats Schedule transformations Event-driven transformations Transformations expressed as code
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ETL when you need it Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Metadata? AWS Glue Data Catalog Central Metadata Catalog for the data lake One per account Allows you to share metadata between Amazon Athena, Amazon Redshift Spectrum, EMR & JDBC sources We added a few extensions:  Search over metadata for data discovery  Connection info – JDBC URLs, credentials  Classification for identifying and parsing files  Versioning of table metadata as schemas evolve and other metadata are updated
  15. 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Catalog Crawlers AWS Glue Data Catalog - Crawlers Helping Catalog your data Crawlers automatically build your Data Catalog and keep it in sync Automatically discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless – only pay when crawler runs
  16. 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Data Catalog
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Table schema Table properties Data statistics Nested fields Data Catalog – Table Details
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. List of table versionsCompare schema versions Data Catalog: Version Control
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Automatically register available partitions Table partitions Automatic Partition Detection
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. A central metadata store for your lake Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access AWS Glue Data Catalog Hive-compatible Metastore
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time (instream processing) Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access AWS Glue Data Catalog Hive-compatible Metastore Spark Streaming & Flink on EMR AmazonKinesis Analytics
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. W r i t e o n c e , c a t a l o g o n c e , r e a d m u l t i p l e , E T L A n y w h e r e Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access AWS Glue Data Catalog Hive-compatible Metastore Spark Streaming & Flink on EMR AmazonKinesis Analytics
  23. 23. Characteristics of a data lake Future ProofFlexible Access Dive in Anywhere Collect Anything
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let’s take an example 1. What is going on with a specific sensor 2. Daily Aggregations (device, inefficiencies, average temperature) 3. A real-time view of how many sensors are showing inefficiencies 1. Scale 2. Highly availability 3. Less management overhead 4. Pay what I need Business Questions Operations Record-level dataSensor/IOT device
  25. 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let’s push this data into a Kinesis Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena
  26. 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  27. 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Querying it in Amazon Athena Either Create a Crawler to auto-generate schema OR Write a DDL on the Athena console/API/ JDBC/ODBC driver Start Querying Data
  28. 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query daily aggregates in Amazon Athena Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average”
  29. 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Job Serverless, event-driven execution Data is written out to S3 Output table is automatically created in Amazon Athena
  30. 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query daily aggregates in Amazon Athena Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average”
  31. 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kinesis Analytics for in-stream analytics Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average” Amazon S3 Kinesis Analytics Kinesis Firehose “results”
  32. 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KPI - Overall device daily inefficiency" SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) ) AS all_device_avg_inefficiency, date FROM awsblogsgluedemo.daily_avg_inefficiency GROUP BY date; Top 10 most inefficient devices - event-level granularity SELECT col0 AS "uuid", col1 AS "deviceid", col2 AS "devicets", col3 AS "temp", col4 AS "settemp", col5 AS "pct_inefficiency" FROM awsblogsgluedemo.results ORDER BY pct_inefficiency DESC limit 10; “raw” table with raw data Top 20 most active devices SELECT deviceid, COUNT(*) AS num_events FROM awsblogsgluedemo."raw" GROUP BY deviceid ORDER BY num_events DESC Events by Device ID SELECT uuid, devicets,deviceid, temp FROM awsblogsgluedemo."raw" WHERE deviceid = 1 ORDER BY devicets DESC; “daily-agg” table with daily aggregation “result” table
  33. 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Overall architecture Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average” Amazon S3 Kinesis Analytics Kinesis Firehose “results”
  34. 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Characteristics  Scale to hundreds of thousands of data sources  Virtually infinite storage scalability  Real-time and batch processing layers  Interactive queries  Highly available and durable  Pay only for what you use X No servers to manage
  35. 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Very easy to try – existing template
  36. 36. ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA Building the Atlassian Data Lake
  37. 37. Meetings DecisionsMentions FilesConvosReactions Planning & tracking Messaging & communicate Organizing projects Content collaboration Code collaboration Software Teams IT TeamsMarketing Teams Finance TeamsHR Teams ATLASSIAN OVERVIEW
  38. 38. Socrates The Atlassian Data Lake Image courtesy of © Bar Harel, CC BY-SA 4.0, Wikimedia Commons
  39. 39. The numbers 500+ TBs 1B+ Events 100 Integrations 1000 Internal UsersStored in the data lake Ingested into the data lake daily Providing analytical events Using the data lake daily
  40. 40. Data lake services
  41. 41. Ingest Moving away from pull-based ingestion
  42. 42. Challenges with pull-based ingestion Complex DisruptiveBrittle Various technologies to maintain Analytics extracts strain sourcing systems As sources change the pipelines break and need updating
  43. 43. Our Ingestion Journey Late 2015 Socrates (Data Lake) Web CRM Billing Product Kinesis REST JDBC GraphQL
  44. 44. Our Ingestion Journey Early 2016 Socrates (Data Lake) Web CRM Billing Product Micro Services Kinesis REST JDBC Webhook ODBC SFTP GraphQL
  45. 45. Our Ingestion Journey Late 2016 Socrates (Data Lake) Web CRM Billing Product Micro Services
  46. 46. Our Ingestion Journey Early 2017 Socrates (Data Lake) Web CRM Billing Product Micro Services Other Micro Services Other Enterprise Systems StreamHub (Enterprise Bus)
  47. 47. Event-Driven Architecture Schema Registry What is StreamHub? Producers and subscribers integrate via events Validates that messages are compatible
  48. 48. How do we land it?
  49. 49. atlassian-socrates-raw-landed/ └── avi:jira:created:comment/ └── day=2017-10-10/ ├── events-13:20:15.479940.json.gz ├── events-13:21:23.479940.json.gz ├── events-13:21:52.479940.json.gz ├── events-13:23:37.479940.json.gz ├── events-13:23:56.479940.json.gz ├── events-13:24:15.479940.json.gz ├── events-13:24:21.479940.json.gz ├── events-13:25:34.479940.json.gz └── events-13:26:13.479940.json.gz
  50. 50. atlassian-socrates-raw-published-stg1/ ├── avi:jira:created:comment/ ├── day=2017-10-10 └── <sub-partition> │ ├── events-part01.snappy.parquet │ ├── events-part02.snappy.parquet │ ├── events-part03.snappy.parquet │ └── events-part04.snappy.parquet └── <sub-partition> ├── events-part05.snappy.parquet ├── events-part06.snappy.parquet ├── events-part07.snappy.parquet └── events-part08.snappy.parquet
  51. 51. atlassian-socrates-raw-published-stg2/ ├── avi:jira:created:comment/ ├── day=2017-10-10 └── business_key_1 │ └── events-part01.snappy.parquet └── business_key_2 └── events-part01.snappy.parquet
  52. 52. Prepare Cleansing and transforming our data
  53. 53. Challenges with preparation Cluster Management Re-Inventing the Wheel Data Engineering Bottleneck Clusters could be hard to upgrade and attribute costs to jobs Lots of time spent re- implementing patterns to perform transformations Teams would rely on us to help them with their data transformation needs
  54. 54. Airflow RAW /UNALTERED JOB SCOPED CLUSTERS PREPARED /TRANSFORMED CRM/Billing Product/Web Aggregated / Derived Dimensional Model User Defined Extracts Support/Ops Account / Chargeback Upscale Quarantine
  55. 55. Airflow DAG Copy logs for debugging Spin up a dedicated EMR cluster Shutdown EMR cluster
  56. 56. Transformation as a Service TaaS
  57. 57. Organize Storing, securing, and governing our data
  58. 58. Challenges with organizing data Security Categorizing DataTeams want flexibility How can we provision buckets for teams who don’t want to face the AWS console head- on? How can we structure our data lake in a way that will scale well? How do we give teams flexibility on how they organize themselves?
  59. 59. Areas of the data lake Landed Raw Modeled Self-Serve Unaltered, Unformatted, Unmasked Optimized, Partitioned, Masked Conformed dimensions, Standardized facts, aggregated/derived value BYO Data, User/Team managed
  60. 60. Request a Schema…
  61. 61. Self-Service Schemas What gets provisioned Provisions the components • Create a S3 bucket, tagged to the user • Create an a schema in our metastore(s) • Create an Active Directory group We call them Zones We use to call them “Playgrounds” but often they were used for production loads e.g. zone_marketing Use Vault to control access rights • A tool that manages secrets • Creates a temporary IAM user (2 hours) • Passes the credentials to the user
  62. 62. Self-Service Schemas How users interact $ vault auth -method=ldap username=<ad_username> Password (will be hidden): <ad_password> ... token_policies: [zone-marketing-write zone-marketing-read] $ vault read aws/creds/zone-marketing-write Key Value --- ----- lease_id aws/creds/zone-marketing-write/e1x2a3m4p5l6e7 lease_duration 25h0m0s lease_renewable true access_key AKIAISANEXAMPLEKEYID secret_key 1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e security_token <nil> Authenticate against Vault Retrieve your credentials
  63. 63. Self-Service Schemas How users interact $ aws configure AWS Access Key ID [None]: AKIAISANEXAMPLEKEYID AWS Secret Access Key [None]:1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e $ aws s3 cp examplefile s3://atlassian-zone-bucketname Apply Credentials List your bucket $ aws s3 ls s3://atlassian-zone-marketing/ PRE example_directory/ PRE another_example_directory/ 2016-12-08 13:21:35 0 example_text_file.txt 2016-09-27 12:24:48 0 example_csv_file.csv Upload your file
  64. 64. Discover Finding, understanding, and exploring data
  65. 65. Challenges with data discovery Managing query engines Finding dataTeams want options Query engine usage is unpredictable, doing a bad job blocks analysts Difficult to know which table to trust or to use for what purpose Different visualizations tools better suit different needs
  66. 66. Visual Layer Interactive Layer Metastore Layer Storage Layer Raw Buckets Model Buckets Zone Buckets (Self-Service) Hive Metastore AWS Glue Metastore Amazon Athena Presto EMR Spark/Hive EMR Tableau R Shiny Zeppelin Notebooks Redash
  67. 67. After: Amazon AthenaBefore: Presto • Many failed queries • Difficulties upgrading • Hard to secure • Ability to attribute costs • Less infrastructure/operational overhead • Not paying for what we don’t use • Uses bucket security policies
  68. 68. Challenges with Amazon Athena No AD Authentication Cost ManagementEarly Adopter Pains Only access via JDBC to begin with using keys Costs need to be monitored to spot any unusual spikes There wasn’t parity with Presto to begin with
  69. 69. Visualization Stack Tableau R Shiny Zeppelin Notebooks Redash Interactive exploration on core data sets and corporate dashboards Web apps and standalone dashboards Web based notebooks Quick queries and visualizations on all data
  70. 70. Search the Data Catalog
  71. 71. Key Takeaways It’s not just flicking on a switch AWS helps you move up the value chain You can’t just turn on AWS components and have an instant data lake Using AWS helps you focus on areas where you can be adding value
  72. 72. ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA Thank you!

×