SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
Preparing Data for the Lake
John Mallory
johmallo@amazon.com
Characteristics of a data lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 as the data lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Simplified architectural view
Amazon S3
Ingestion
mechanism
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Process Consume
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
There are lots of ingestion tools
Amazon S3
Process Consume
S3 Transfer
Acceleration
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Variety of data processing tools
Amazon S3
Consume
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
And multiple ways to consume the data
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
Amazon API Gateway
Programmatic Access
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Because data is not prefect
AWS Lambda
Trigger-based Code
Execution
AWS Glue
Event based Server-less ETL
engine
Amazon EMR
Spark and Hive running on
EMR
Because data is not never prefect
Clean
Transform
Concatenate
Convert to better formats
Schedule transformations
Event-driven transformations
Transformations expressed as
code
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ETL when you need it
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata?
AWS Glue Data Catalog
Central Metadata Catalog for the data lake
One per account
Allows you to share metadata between
Amazon Athena, Amazon Redshift
Spectrum, EMR & JDBC sources
We added a few extensions:
§ Search over metadata for data
discovery
§ Connection info – JDBC URLs,
credentials
§ Classification for identifying and
parsing files
§ Versioning of table metadata as
schemas evolve and other metadata
are updated
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Catalog Crawlers
AWS Glue Data Catalog - Crawlers
Helping Catalog your data
Crawlers automatically build your Data
Catalog and keep it in sync
Automatically discover new data, extracts
schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom
classifiers using Grok expression
Run ad hoc or on a schedule; serverless – only
pay when crawler runs
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Data Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Table schema
Table properties
Data statistics
Nested fields
Data Catalog – Table Details
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
List of table versionsCompare schema versions
Data Catalog: Version Control
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Automatically register available partitions
Table
partitions
Automatic Partition Detection
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
A central metadata store for your lake
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time (instream processing)
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore
Spark Streaming
& Flink on EMR
Amazon Kinesis
Analytics
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
W r i t e o n c e , c a t a l o g o n c e , r e a d m u l t i p l e , E T L A n y w h e r e
Amazon S3
S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon Athena
Interactive Query
Amazon EMR
Managed Hadoop & Spark
Amazon Redshift + Spectrum
Petabyte-scale Data
Warehousing
Amazon Elasticsearch
Real-time log analytics & search
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
Amazon QuickSight
Fast, easy to use, cloud BI
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
Programmatic Access
AWS Glue Data Catalog
Hive-compatible Metastore
Spark Streaming
& Flink on EMR
Amazon Kinesis
Analytics
Characteristics of a data lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let’s take an example
1. What is going on with a specific
sensor
2. Daily Aggregations (device,
inefficiencies, average temperature)
3. A real-time view of how many
sensors are showing inefficiencies
1. Scale
2. Highly availability
3. Less management overhead
4. Pay what I need
Business Questions
Operations
Record-level dataSensor/IOT device
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let’s push this data into a Kinesis
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Querying it in Amazon Athena
Either Create a Crawler to
auto-generate schema
OR
Write a DDL on the Athena
console/API/ JDBC/ODBC
driver
Start Querying Data
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query daily aggregates in Amazon Athena
Kinesis Firehose
Amazon S3
Amazon S3
Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Job
Serverless, event-driven execution
Data is written out to S3
Output table is automatically
created in Amazon Athena
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query daily aggregates in Amazon Athena
Kinesis Firehose
Amazon S3
Amazon S3
Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis Analytics for in-stream analytics
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
Amazon S3
Kinesis Analytics Kinesis Firehose
“results”
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KPI - Overall device daily inefficiency"
SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) )
AS all_device_avg_inefficiency, date
FROM awsblogsgluedemo.daily_avg_inefficiency
GROUP BY date;
Top 10 most inefficient devices - event-level granularity
SELECT col0 AS "uuid", col1 AS "deviceid", col2 AS
"devicets", col3 AS "temp", col4 AS "settemp", col5 AS
"pct_inefficiency" FROM awsblogsgluedemo.results ORDER BY
pct_inefficiency DESC limit 10;
“raw” table with raw data
Top 20 most active devices
SELECT
deviceid, COUNT(*) AS num_events
FROM awsblogsgluedemo."raw"
GROUP BY deviceid
ORDER BY num_events DESC
Events by Device ID
SELECT uuid, devicets,deviceid,
temp
FROM awsblogsgluedemo."raw"
WHERE deviceid = 1
ORDER BY devicets DESC;
“daily-agg” table with daily
aggregation
“result” table
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Overall architecture
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
“raw-time-series”
Amazon S3
“daily-average”
Amazon S3
Kinesis Analytics Kinesis Firehose
“results”
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
C h a r a c t e r i s t i c s
ü Scale to hundreds of thousands of data sources
ü Virtually infinite storage scalability
ü Real-time and batch processing layers
ü Interactive queries
ü Highly available and durable
ü Pay only for what you use
No servers to manage
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
aws.amazon.com/activate
Everything and Anything Startups
Need to Get Started on AWS

More Related Content

What's hot

Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
Amazon Web Services
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
Amazon Web Services
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWS
Amazon Web Services
 
Building a modern data platform in AWS
Building a modern data platform in AWSBuilding a modern data platform in AWS
Building a modern data platform in AWS
Amazon Web Services
 
Introducing S3 Batch Operations: Managing Billions of Objects in Amazon S3 at...
Introducing S3 Batch Operations: Managing Billions of Objects in Amazon S3 at...Introducing S3 Batch Operations: Managing Billions of Objects in Amazon S3 at...
Introducing S3 Batch Operations: Managing Billions of Objects in Amazon S3 at...
Amazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Amazon Web Services
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
Amazon Web Services
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
Amazon Web Services
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
Amazon Web Services
 
Real Time Data Ingestion & Analysis - AWS Summit Sydney 2018
Real Time Data Ingestion & Analysis - AWS Summit Sydney 2018Real Time Data Ingestion & Analysis - AWS Summit Sydney 2018
Real Time Data Ingestion & Analysis - AWS Summit Sydney 2018
Amazon Web Services
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200
Amazon Web Services
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
Vasu S
 
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar SeriesIntroduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
Amazon Web Services
 
Building Data Lakes & Analytics on AWS
Building Data Lakes & Analytics on AWSBuilding Data Lakes & Analytics on AWS
Building Data Lakes & Analytics on AWS
AWS Summits
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
Vasu S
 
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
Amazon Web Services
 
Welcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution OverviewWelcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution Overview
Amazon Web Services
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Amazon Web Services
 

What's hot (20)

Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWS
 
Building a modern data platform in AWS
Building a modern data platform in AWSBuilding a modern data platform in AWS
Building a modern data platform in AWS
 
Introducing S3 Batch Operations: Managing Billions of Objects in Amazon S3 at...
Introducing S3 Batch Operations: Managing Billions of Objects in Amazon S3 at...Introducing S3 Batch Operations: Managing Billions of Objects in Amazon S3 at...
Introducing S3 Batch Operations: Managing Billions of Objects in Amazon S3 at...
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Real Time Data Ingestion & Analysis - AWS Summit Sydney 2018
Real Time Data Ingestion & Analysis - AWS Summit Sydney 2018Real Time Data Ingestion & Analysis - AWS Summit Sydney 2018
Real Time Data Ingestion & Analysis - AWS Summit Sydney 2018
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200
 
Qubole on AWS - White paper
Qubole on AWS - White paper Qubole on AWS - White paper
Qubole on AWS - White paper
 
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar SeriesIntroduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Analyzing Streams
Analyzing StreamsAnalyzing Streams
Analyzing Streams
 
Building Data Lakes & Analytics on AWS
Building Data Lakes & Analytics on AWSBuilding Data Lakes & Analytics on AWS
Building Data Lakes & Analytics on AWS
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
 
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
 
Welcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution OverviewWelcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution Overview
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 

Similar to Preparing Data for the Lake: Data Analytics Week SF

Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
Amazon Web Services
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
Amazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
Amazon Web Services
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
Adir Sharabi
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
Amazon Web Services LATAM
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
Amazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
Amazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
Amazon Web Services
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Amazon Web Services
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
Amazon Web Services
 
Leveraging Cloud Analytics to Support Data-Driven Decisions
Leveraging Cloud Analytics to Support Data-Driven DecisionsLeveraging Cloud Analytics to Support Data-Driven Decisions
Leveraging Cloud Analytics to Support Data-Driven Decisions
Amazon Web Services
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin Briskman
Sameer Kenkare
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Amazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Amazon Web Services
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Amazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Amazon Web Services
 
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 

Similar to Preparing Data for the Lake: Data Analytics Week SF (20)

Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Leveraging Cloud Analytics to Support Data-Driven Decisions
Leveraging Cloud Analytics to Support Data-Driven DecisionsLeveraging Cloud Analytics to Support Data-Driven Decisions
Leveraging Cloud Analytics to Support Data-Driven Decisions
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin Briskman
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Preparing Data for the Lake: Data Analytics Week SF

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft Preparing Data for the Lake John Mallory johmallo@amazon.com
  • 2. Characteristics of a data lake Future Proof Flexible Access Dive in Anywhere Collect Anything
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 as the data lake
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Simplified architectural view Amazon S3 Ingestion mechanism Data sources Transactions Web logs / cookies ERP Connected devices Process Consume
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. There are lots of ingestion tools Amazon S3 Process Consume S3 Transfer Acceleration Data sources Transactions Web logs / cookies ERP Connected devices
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Variety of data processing tools Amazon S3 Consume S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. And multiple ways to consume the data Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE Amazon API Gateway Programmatic Access
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Because data is not prefect AWS Lambda Trigger-based Code Execution AWS Glue Event based Server-less ETL engine Amazon EMR Spark and Hive running on EMR Because data is not never prefect Clean Transform Concatenate Convert to better formats Schedule transformations Event-driven transformations Transformations expressed as code
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ETL when you need it Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Metadata? AWS Glue Data Catalog Central Metadata Catalog for the data lake One per account Allows you to share metadata between Amazon Athena, Amazon Redshift Spectrum, EMR & JDBC sources We added a few extensions: § Search over metadata for data discovery § Connection info – JDBC URLs, credentials § Classification for identifying and parsing files § Versioning of table metadata as schemas evolve and other metadata are updated
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Catalog Crawlers AWS Glue Data Catalog - Crawlers Helping Catalog your data Crawlers automatically build your Data Catalog and keep it in sync Automatically discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless – only pay when crawler runs
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Data Catalog
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Table schema Table properties Data statistics Nested fields Data Catalog – Table Details
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. List of table versionsCompare schema versions Data Catalog: Version Control
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Automatically register available partitions Table partitions Automatic Partition Detection
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. A central metadata store for your lake Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access AWS Glue Data Catalog Hive-compatible Metastore
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time (instream processing) Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access AWS Glue Data Catalog Hive-compatible Metastore Spark Streaming & Flink on EMR Amazon Kinesis Analytics
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. W r i t e o n c e , c a t a l o g o n c e , r e a d m u l t i p l e , E T L A n y w h e r e Amazon S3 S3 Transfer Acceleration Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search Data sources Transactions Web logs / cookies ERP Connected devices Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE API Gateway Programmatic Access AWS Glue Data Catalog Hive-compatible Metastore Spark Streaming & Flink on EMR Amazon Kinesis Analytics
  • 19. Characteristics of a data lake Future Proof Flexible Access Dive in Anywhere Collect Anything
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let’s take an example 1. What is going on with a specific sensor 2. Daily Aggregations (device, inefficiencies, average temperature) 3. A real-time view of how many sensors are showing inefficiencies 1. Scale 2. Highly availability 3. Less management overhead 4. Pay what I need Business Questions Operations Record-level dataSensor/IOT device
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let’s push this data into a Kinesis Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Querying it in Amazon Athena Either Create a Crawler to auto-generate schema OR Write a DDL on the Athena console/API/ JDBC/ODBC driver Start Querying Data
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query daily aggregates in Amazon Athena Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average”
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Job Serverless, event-driven execution Data is written out to S3 Output table is automatically created in Amazon Athena
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query daily aggregates in Amazon Athena Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average”
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kinesis Analytics for in-stream analytics Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average” Amazon S3 Kinesis Analytics Kinesis Firehose “results”
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KPI - Overall device daily inefficiency" SELECT ( SUM(daily_avg_inefficiency)/COUNT(*) ) AS all_device_avg_inefficiency, date FROM awsblogsgluedemo.daily_avg_inefficiency GROUP BY date; Top 10 most inefficient devices - event-level granularity SELECT col0 AS "uuid", col1 AS "deviceid", col2 AS "devicets", col3 AS "temp", col4 AS "settemp", col5 AS "pct_inefficiency" FROM awsblogsgluedemo.results ORDER BY pct_inefficiency DESC limit 10; “raw” table with raw data Top 20 most active devices SELECT deviceid, COUNT(*) AS num_events FROM awsblogsgluedemo."raw" GROUP BY deviceid ORDER BY num_events DESC Events by Device ID SELECT uuid, devicets,deviceid, temp FROM awsblogsgluedemo."raw" WHERE deviceid = 1 ORDER BY devicets DESC; “daily-agg” table with daily aggregation “result” table
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Overall architecture Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “daily-average” Amazon S3 Kinesis Analytics Kinesis Firehose “results”
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. C h a r a c t e r i s t i c s ü Scale to hundreds of thousands of data sources ü Virtually infinite storage scalability ü Real-time and batch processing layers ü Interactive queries ü Highly available and durable ü Pay only for what you use No servers to manage
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft aws.amazon.com/activate Everything and Anything Startups Need to Get Started on AWS