Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliat...
What to expect from this short talk
• Data Lake concept
• Data Lake - Important Capabilities
• AMGEN’s Data Lake initiativ...
Data Lake Concept
What is a Data Lake?
Data Lake is a new and increasingly
popular way to store and analyze
massive volumes and heterogenous...
Benefits of a Data Lake – Quick Ingest
Quickly ingest data
without needing to force it into a
pre-defined schema.
“How can...
Benefits of a Data Lake – All Data in One Place
“Why is the data distributed in
many locations? Where is the
single source...
Benefits of a Data Lake – Storage vs Compute
Separating your storage and compute
allows you to scale each component as
req...
Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the s...
Important Capabilities of a
“Data Lake”
Important components of a Data Lake
Ingest and Store Catalogue & Search Protect & Secure Access & User
Interface
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
COLLECT INGEST/STORE CONSUME...
Ingest and Store
Ingest real time and batch data
Support for any type of data at scale
Durable
Low cost
Use S3 as Data Substrate – Apply to Compute as Needed
EMR Kinesis
Redshift DynamoDB RDS
Data Pipeline
Spark Streaming Stor...
Amazon S3 as your cluster’s persistent data store
Amazon S3
Separate compute and storage
Resize and shut down Analytics
Co...
AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis
Firehose
S3 Transfer
Acceleration
AWS Storage
Gateway
Data I...
Metadata lake
Used for summary statistics and data
Classification management
Simplified model for data discovery &
governa...
Catalogue & Search Architecture
Data Collectors
(EC2,ECS)
AWS LambdaS3 Bucket
AWS Lambda
Metadata Index
(DynamoDB)
AWS Ela...
Access Control - Authentication &
Authorization
Data protection – Encryption
Logging and Monitoring
Protect and Secure
Encryption ComplianceSecurity
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (AC...
Exposes the data lake to customers
Programmatically query catalogue
Expose search API
Ensures that entitlements are respec...
API & UI Architecture
API Gateway
AWS Lambda Metadata IndexUsers
User
Management
Static Website
Putting It All Together
Common Add-On Capability to Data Lake
Backend ETL Organizational
Control Gates for
Dataset Access
Data Correlation
Identif...
Data Lake is a Journey
There are Multiple implementation Methods
for Building a Data Lake
• Using a Combination of Special...
Data Lake Evolution @ Amgen
SAURAV MAHANTI
Senior Manager - Information Systems
One Data Lake
One code-base
Multiple Locations
(Cloud / On-Premise)
Common scalable
infrastructure
Multiple Business
Funct...
A CONCEPTUAL VIEW OF THE DATA LAKE
DATA LAKE PLATFORM
Common Tools and Capabilities
Innovative new tools and
capabilities ...
HIGH LEVEL COMPONENT ARCHITECTURE of the data lake
Procure/Ingest
DataIngestion
Adapter
Managed
Infrastructure
Self-servic...
DATA PROCESSING AND STORAGE LAYERProcure/Ingest
DataIngestion
Adapter
Managed
Infrastructure
Self-service
Adapter
Applicat...
PROCURE AND INGEST - Pre-built and configurable common components to load any type
of data into the Lake
Procure/Ingest
Da...
CURATE AND ENRICHMENT
Enhance the value of the data by linking and cataloging
Reference Data
linkage – Connect
Datasets to...
CENTRALIZED ANALYTICS TOOLKIT - provides reporting and analytics solutions for end-
users to access the data in the Lake
A...
Business Impact
Marketed product
defense
Clinical trial speed &
outcomes
Design & Analytic Toolbox
Pre-calculated Cohorts ...
That’s a lot of Work! –
Where do I even Start?!
Data Lake Solution
Introducing:
Presented by: Dario Rivera – AWS Solutions Architect
Built by: Sean Senior – AWS Solutions...
Data Lake Solution
Package of Code –
Deployed via CloudFormation
into your AWS Account
Architecture Overview
Data Lake Server-less Composition
• Amazon S3 (2 buckets)
• Primary data lake content storage
• Static Website Hosting
• A...
Demo of
AWS Data Lake Solution
http://tinyurl.com/DataLakeSolution
All of this and it costs less than a $1 /
hour to run the Data Lake Solution
* Excluding Storage and Analytics Environment...
Data Lake Solution will be available End of Q4.
Available via AWS Answers
https://aws.amazon.com/answers/
Thank you!
Neeraj Verma – rajverma@amazon.com
Saurav Mahanti – smahanti@amgen.com
Dario Rivera – darior@amazon.com
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
Upcoming SlideShare
Loading in …5
×

AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)

6,599 views

Published on

For discovery-phase research, life sciences companies have to support infrastructure that processes millions to billions of transactions. The advent of a data lake to accomplish such a task is showing itself to be a stable and productive data platform pattern to meet the goal. We discuss how to build a data lake on AWS, using services and techniques such as AWS CloudFormation, Amazon EC2, Amazon S3, IAM, and AWS Lambda. We also review a reference architecture from Amgen that uses a data lake to aid in their Life Science Research.

Published in: Technology

AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Neeraj Verma – AWS Solutions Architect Saurav Mahanti – Senior Manager – Information Systems Dario Rivera – AWS Solutions Architect November 28, 2016 How to Build a Big Data Analytics Data Lake
  2. 2. What to expect from this short talk • Data Lake concept • Data Lake - Important Capabilities • AMGEN’s Data Lake initiative • How to Build a Data Lake in your AWS account
  3. 3. Data Lake Concept
  4. 4. What is a Data Lake? Data Lake is a new and increasingly popular way to store and analyze massive volumes and heterogenous types of data in a centralized repository.
  5. 5. Benefits of a Data Lake – Quick Ingest Quickly ingest data without needing to force it into a pre-defined schema. “How can I collect data quickly from various sources and store it efficiently?”
  6. 6. Benefits of a Data Lake – All Data in One Place “Why is the data distributed in many locations? Where is the single source of truth ?” Store and analyze all of your data, from all of your sources, in one centralized location.
  7. 7. Benefits of a Data Lake – Storage vs Compute Separating your storage and compute allows you to scale each component as required “How can I scale up with the volume of data being generated?”
  8. 8. Benefits of a Data Lake – Schema on Read “Is there a way I can apply multiple analytics and processing frameworks to the same data?” A Data Lake enables ad-hoc analysis by applying schemas on read, not write.
  9. 9. Important Capabilities of a “Data Lake”
  10. 10. Important components of a Data Lake Ingest and Store Catalogue & Search Protect & Secure Access & User Interface
  11. 11. Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift COLLECT INGEST/STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams BatchMessageInteractiveStreamML SearchSQLNoSQLCacheFileQueueStream Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors and IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENTS FILES MESSAGES STREAMS Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI LoggingIoTApplicationsTransportMessaging ETL Many tools to Support the Data Analytics LifeCycle
  12. 12. Ingest and Store Ingest real time and batch data Support for any type of data at scale Durable Low cost
  13. 13. Use S3 as Data Substrate – Apply to Compute as Needed EMR Kinesis Redshift DynamoDB RDS Data Pipeline Spark Streaming Storm Amazon S3 Import/Export Snowball Highly Durable Low Cost Scalable Storage
  14. 14. Amazon S3 as your cluster’s persistent data store Amazon S3 Separate compute and storage Resize and shut down Analytics Compute Environments with no data loss Point multiple compute clusters at same data in Amazon S3
  15. 15. AWS Direct Connect AWS Snowball ISV Connectors Amazon Kinesis Firehose S3 Transfer Acceleration AWS Storage Gateway Data Ingestion into Amazon S3
  16. 16. Metadata lake Used for summary statistics and data Classification management Simplified model for data discovery & governance Catalogue & Search
  17. 17. Catalogue & Search Architecture Data Collectors (EC2,ECS) AWS LambdaS3 Bucket AWS Lambda Metadata Index (DynamoDB) AWS Elasticsearch Service Extract Search Fields Put Object Object Created, Object Deleted Put Item Update Stream Update Index
  18. 18. Access Control - Authentication & Authorization Data protection – Encryption Logging and Monitoring Protect and Secure
  19. 19. Encryption ComplianceSecurity  Identity and Access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Query string authentication  Private VPC endpoints to Amazon S3  SSL endpoints  Server Side Encryption (SSE-S3)  S3 Server Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-side Encryption  Buckets access logs  Lifecycle Management Policies  Access Control Lists (ACLs)  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3 etc. Implement the right controls
  20. 20. Exposes the data lake to customers Programmatically query catalogue Expose search API Ensures that entitlements are respected API & User Interface
  21. 21. API & UI Architecture API Gateway AWS Lambda Metadata IndexUsers User Management Static Website
  22. 22. Putting It All Together
  23. 23. Common Add-On Capability to Data Lake Backend ETL Organizational Control Gates for Dataset Access Data Correlation Identification via Machine Learning Dataset Updates to Dependent Compute Environments ETL json
  24. 24. Data Lake is a Journey There are Multiple implementation Methods for Building a Data Lake • Using a Combination of Specialized/Open Source Tools from Various Providers to build a Customized Solution • Use various managed services from AWS such as S3, Amazon ElasticSearch Service, DynamoDB, Cognito, Lambda, etc as a Core Data Lake Solution • Using a DataLake-as-a-Service Solution
  25. 25. Data Lake Evolution @ Amgen SAURAV MAHANTI Senior Manager - Information Systems
  26. 26. One Data Lake One code-base Multiple Locations (Cloud / On-Premise) Common scalable infrastructure Multiple Business Functions Shared Features
  27. 27. A CONCEPTUAL VIEW OF THE DATA LAKE DATA LAKE PLATFORM Common Tools and Capabilities Innovative new tools and capabilities are reused across all functions (e.g. search, data processing & storage, visualization tools)Functions can manage their own data, while contributing to the common data layer Business applications are built to meet specific information needs, from simple data access, data visualization, to complex statistical/predictive models Manufacturing Data BatchGenealogyVisualization PDSelf-ServiceAnalytics InstrumentBusDataSearch Real World Data FDASentinelAnalytics EpiProgrammer’sWorkbench PatientPopulationAnalytics Commercial Data USFieldSalesReporting GlobalForecasting USMarketingAnalytics Best Practices Awards Bio IT World 2016 Winner
  28. 28. HIGH LEVEL COMPONENT ARCHITECTURE of the data lake Procure/Ingest DataIngestion Adapter Managed Infrastructure Self-service Adapter Application Accelerator Curate and Enrichment Self-Service Data Integration Tools IS Owned Automated Integration Data Catalog Storage and Processing User Data Workspace Raw Mastered Apps Elastic Compute Capability Analytics Toolkit Analytical Toolkit Data Science Toolkit BI Reporting User Specific Apps Access Portal Centralized Analytics Toolkit provides different reporting and analytics solutions for end-users to access the data on the Data Lake Curate and Enrichment Layer Enhances the value and usability of the data through automation, cataloging and linking Data processing & storage layer is the core of the Enterprise Data Lake that enables its scale and flexibility Procure/Ingest a collection of tools and technologies to easily move data to and from the Data Lake Reference Data Linkage
  29. 29. DATA PROCESSING AND STORAGE LAYERProcure/Ingest DataIngestion Adapter Managed Infrastructure Self-service Adapter Application Accelerator Curate and Enrichment Self-Service Data Integration Tools IS Owned Automated Integration Data Catalog Centralized Analytics Toolkit Analytical Toolkit Data Science Toolkit BI Reporting User Specific Apps Centralized Access Portal Centralized Analytics Toolkit provides different reporting and analytics solutions for end-users to access the data on the Data Lake Curate and Enrichment Layer provides different reporting and analytics solutions for end-users to access the data on the Data Lake Data processing & storage layer is the core of the Enterprise Data Lake that enables its scale and flexibility Procure/Ingest a collection of tools and technologies to easily move data to and from the Enterprise Data Lake The combination of Hadoop HDFS and Amazon S3 provides the right cost/performance balance with unlimited scalability while maintaining security and encryption at the data file level Powerful execution engines like YARN, Map Reduce and SPARK bring the “compute to the data” HIVE and Impala provide SQL over structured data; HBASE is used for NoSQL/Transactional jobs Solr is used for search capabilities over documents MarkLogic and Neo4J provide semantic and graph capabilities Amazon RedShift and Amazon EMR low-cost elastic computing with the ability to spin up clusters for data processing and metrics calculations Storage and Processing User Data Workspace Raw Mastered Apps Elastic Compute Capability Mark Logic
  30. 30. PROCURE AND INGEST - Pre-built and configurable common components to load any type of data into the Lake Procure/Ingest DataIngestion Adapter Managed Infrastructure Self-service Adapter Application Accelerator Curate and Enrichment Self-Service Data Integration Tools IS Owned Automated Integration Data Catalog Storage and Processing User Data Workspace Raw Mastered Apps Elastic Compute Capability Centralized Analytics Toolkit Analytical Toolkit Data Science Toolkit BI Reporting User Specific Apps Centralized Access Portal Centralized Analytics Toolkit provides different reporting and analytics solutions for end-users to access the data on the Data Lake Curate and Enrichment Layer provides different reporting and analytics solutions for end-users to access the data on the Data Lake Data processing & storage layer is the core of the Enterprise Data Lake that enables its scale and flexibilityCloud Data Integration tools like SnapLogic enable data analysts end users to build data pipelines into the Data Lake using pre-built connectors to various cloud hosted services like Box and make it easy to move data set between HDFS, S3, sFTP and fileshares Structured Data Ingestion - A common component for scheduled production data loads of incremental or full data. It uses Python and native Hadoop tools like Scoop and HIVE for efficiency and speed. Real-Time Data Ingestion for real time or streaming data. It uses the Kafka messaging queue, Spark streaming, Hbase and Java. Unstructured Data Pipeline Morphlines Document Pipeline for document ingestion, text processing and indexing. It uses Morphlines, Lily indexer and HBase.
  31. 31. CURATE AND ENRICHMENT Enhance the value of the data by linking and cataloging Reference Data linkage – Connect Datasets to Ontologies and Vocabularies to get more relevant and better results Data Catalog is an enterprise- wide metadata catalog that stores, describes, indexes, and shows how to access any registered data asset Self-Service Data Integration Tools Analytical Subject Areas Data Catalog Curate and Enrichment Reference Data Linkage Analytical Subject Areas – Build targeted applications using Data Integration tools like SnapLogic or deploy packaged applications or data marts that transform the data in the Lake for consumption by end user tools PDSelf-ServiceAnalytics FDASentinelAnalytics USMarketingAnalytics
  32. 32. CENTRALIZED ANALYTICS TOOLKIT - provides reporting and analytics solutions for end- users to access the data in the Lake Analytics Toolkit Analytical Toolkit Data Science Toolkit BI Reporting User Specific Apps Access Portal Data Science toolkit enables analysts and data scientists use tools like SAS, R, Jupyter (Python notebooks) to connect to their analytics sandbox within the Lake and submit distributed computing jobs Reporting and Business Intelligence tools like MicroStrategy and Cognos can be deployed either directly on the Lake or on derived Analytical Subject Areas Analytics and Visualization tools are the most common methods used to query and analyze data in the Lake Focused applications that target a specific use-case can be built using open source products and deployed to a specific user community through the portal Data Lake Portal provides a user-friendly, mobile-enabled and secure portal to all the end-user applications
  33. 33. Business Impact Marketed product defense Clinical trial speed & outcomes Design & Analytic Toolbox Pre-calculated Cohorts for All Pipeline & Marketed Medicines Real World Data Platform: Shared capability used by multiple business functions from R&D and Commercial RWD Data Lake – Claims, EMR+ Superior processing speed enables simultaneous processing of terabytes of data Datasets Converted to Common OMOP Data Model Asia US EU ROW Descriptive Rapid RWD Query Targeted Demand Therapy areas Study Design Advanced Analytics Spotfire, R, SAS, Achilles Product value Benefit:Risk Evidence-based research
  34. 34. That’s a lot of Work! – Where do I even Start?!
  35. 35. Data Lake Solution Introducing: Presented by: Dario Rivera – AWS Solutions Architect Built by: Sean Senior – AWS Solutions Builder
  36. 36. Data Lake Solution Package of Code – Deployed via CloudFormation into your AWS Account
  37. 37. Architecture Overview
  38. 38. Data Lake Server-less Composition • Amazon S3 (2 buckets) • Primary data lake content storage • Static Website Hosting • Amazon API Gateway (1 RESTful API) • Amazon DynamoDB (7 Tables) • Amazon Cognito Your User Pools (1 User Pool) • AWS Lambda • 5 microservice functions, • 1 Amazon API Gateway custom authorizer function, • 1 Amazon Cognito User Pool event trigger function [Pre sign-up, Post confirmation] • Amazon Elasticsearch Service (1 cluster) • AWS IAM (7 policies, 7 roles)
  39. 39. Demo of AWS Data Lake Solution http://tinyurl.com/DataLakeSolution
  40. 40. All of this and it costs less than a $1 / hour to run the Data Lake Solution * Excluding Storage and Analytics Environment Costs
  41. 41. Data Lake Solution will be available End of Q4. Available via AWS Answers https://aws.amazon.com/answers/
  42. 42. Thank you! Neeraj Verma – rajverma@amazon.com Saurav Mahanti – smahanti@amgen.com Dario Rivera – darior@amazon.com

×