"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day Kyiv 2019

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
K Y I V
06.11.19
Building a Modern Data
Platform in the Cloud
Alex Casalboni
Sr. Technical Evangelist
Amazon Web Services
@alex_casalboni

About me
• Software Engineer & Web Developer
• Worked in a startup for 4.5 years
• ServerlessDays Organizer
• AWS Customer since 2013

S U M M I T
bit.ly/AWSDataLakeDemo

Organizations that successfully
generate business value from their
data, will outperform their peers. An
Aberdeen survey saw organizations
who implemented a Data Lake
outperforming similar companies by
9% in organic revenue growth.*
24%
15%
Leaders Followers
Organic revenue growth
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
To Become a Leader, Data is Your Differentiator

Data variety and data volumes are increasing rapidly
Multiple Consumers and Applications
Ingest
Discover
Catalog
Understand
Curate
Find insights

Purpose-built
engines
Right tool for the job

Collect Store Analyze
Amazon Kinesis
Firehose
AWS Direct
Connect
Amazon
Snowball
Amazon Kinesis
Analytics
Amazon Kinesis
Streams
Amazon S3 Amazon Glacier
Amazon
CloudSearch
Amazon RDS,
Amazon Aurora
Amazon
Dynamo DB
Amazon
Elasticsearch
Amazon EMR
Amazon
Redshift
Amazon
QuickSight
AWS Database
Migration Service AWS Glue
Amazon
Athena
Amazon
SageMaker

Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence • Relational data
• TBs–PBs scale
• Schema defined prior to data load
• Operational reporting and ad hoc
• Large initial CAPEX + $10K–$50K/TB/Year

“A data lake is a centralized repository that
allows you to store all your structured and
unstructured data at any scale”

Collect analyze
semi-structured unstructured
Decoupled
ingestion
on-read
warehouses

exabyte scale
once
many tools
Open formats

S3
ElasticsearchGlueDynamoDB
Catalog & search
Cognito
API
Gateway
API/UI
Athena QuickSight
Redshift
Spectrum
Analytics & processing
LambdaKinesis
Streams
Kinesis
Firehose
Direct
Connect
Ingest
AWS
IoT
KMS CloudTrailIAM Macie
Security & auditing

CHALLENGE
Need to create constant feedback loop
for designers
Gain up-to-the-minute understanding
of gamer satisfaction to guarantee
gamers are engaged, thus resulting in
the most popular game played in the
world
Fortnite | 125+ million players

time
Capture, process, and
store video streams for
analytics
Load data streams into
AWS data stores
Analyze data streams with
SQL
Build custom applications
that analyze data streams
Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

Amazon S3:
Buffered files
Kinesis
Agent
Record
producers Amazon Redshift:
Table loads
Amazon Elasticsearch Service:
Domain loads
Amazon S3:
Source record backup
Transformed recordsPut Records
Kinesis Firehose:
Delivery stream

Amazon S3:
Buffered files
Kinesis
Agent
Record
producers Amazon Redshift:
Table loads
Amazon Elasticsearch Service:
Domain loads
Amazon S3:
Source record backup
Transformed recordsPut Records
Kinesis Firehose:
Delivery stream
AWS Lambda:
Transformations &
enrichment
Raw Transformed

Open-source standards (Apache)
Parquet, ORC, etc.
Optimize Performance
Optimize Costs
Analytical queries

Storing is Not Enough, Data Needs to Be Discoverable
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for other
purposes (for example, analytics,
business relationships and
direct monetizing).
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
“
”Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data

Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
80%

&
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
Data & schema automatic discovery
Generates customizable code for ETL
Schedule and run ETL jobs periodically
Serverless model

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Crawlers automatically build your
data catalog and keep it in sync
Automatically discover new data & extract
schema definitions
Detect schema changes and version tables
Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom
classifiers using Grok expression
Run ad hoc or on a schedule; serverless – only
pay when crawler runs
AWS Glue Crawlers
Crawlers
Automatically catalog your data

AWS Lake Formation (join the preview)
Build, secure, and manage a data lake in days
Build a data lake in days,
not months
Build and deploy a fully
managed data lake with a few
clicks
Enforce security policies
across multiple services
Centrally define security,
governance, and auditing policies in
one place and enforce those policies
for all users and all applications
Combine different
analytics approaches
Empower analyst and data scientist
productivity, giving them self-
service discovery and safe access to
all data from a single catalog

User-Defined Functions
• Bring your own functions & code
• Execute without provisioning servers
Processing and Querying In Place
Fully Managed Process & Query
AWS
Glue
Amazon
Athena
Amazon
Redshift
Amazon
SageMaker
AWS
Lambda

Query S3 using standard SQL (Presto as distributed engine)
Serverless - No infrastructure to set up or manage
Multiple data format support – Define Schema on Demand
$
Query Instantly Pay per query Open Easy

Data scanned: 169.53GB (of 2.2TB)
Query duration: 44.66 seconds
Cost: $0.85
($5/TB or $0.005/GB)
SELECT gram, year, sum(count)
FROM ngram
WHERE gram = 'just say no'
GROUP BY gram, year
ORDER BY year ASC;
registry.opendata.aws/google-ngrams

Amazon QuickSight
easy
Empower
everyone
Seamless
connectivity
Fast analysis Serverless

JSON Payload Example for each event
{
"r": 255,
"g": 0,
"b": 0,
"c": "Red",
"device": {
"id": "4992157",
"browser": "Chrome",
"browserVersion": "72.0.3626.109",
"os": "Mac OS",
"isMobile": false,
"isMobileIOS": false,
"isMobileAndroid": false
},
"dt": {
"year": 2019,
"month": 4,
"day": 17,
"hour": 16,
"minutes": 30,
"seconds": 47,
"millis": 725
},
"id": 1551116627725,
"region": "Europe",
"awsExperience": "1-3 Years",
"awsServiceArea": "Management Tools"
}

Demo Architecture
Amazon CloudFront
Amazon Cognito
Amazon S3
Web App
Users Amazon Kinesis
Data Firehose
Amazon AthenaAWS Glue Amazon
QuickSight
Client
Mobile
client
AWS SDK
S3 Bucket
AWS Cloud
Region

Thank you!
Alex Casalboni
@alex_casalboni

"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day Kyiv 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to "Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day Kyiv 2019

Similar to "Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day Kyiv 2019 (20)

More from Provectus

More from Provectus (20)

Recently uploaded

Recently uploaded (20)

"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day Kyiv 2019