AWS Dev Day Kyiv 2019
Track: Analytics & Machine Learning
Session: "Building a Modern Data platform in the Cloud"
Speaker: Alex Casalboni, AWS Technical Evangelist
Level: 300
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
Video: https://youtu.be/HIDnAG9AxZo
4. Organizations that successfully
generate business value from their
data, will outperform their peers. An
Aberdeen survey saw organizations
who implemented a Data Lake
outperforming similar companies by
9% in organic revenue growth.*
24%
15%
Leaders Followers
Organic revenue growth
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
To Become a Leader, Data is Your Differentiator
5. Data variety and data volumes are increasing rapidly
Multiple Consumers and Applications
Ingest
Discover
Catalog
Understand
Curate
Find insights
7. Collect Store Analyze
Amazon Kinesis
Firehose
AWS Direct
Connect
Amazon
Snowball
Amazon Kinesis
Analytics
Amazon Kinesis
Streams
Amazon S3 Amazon Glacier
Amazon
CloudSearch
Amazon RDS,
Amazon Aurora
Amazon
Dynamo DB
Amazon
Elasticsearch
Amazon EMR
Amazon
Redshift
Amazon
QuickSight
AWS Database
Migration Service AWS Glue
Amazon
Athena
Amazon
SageMaker
8. Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence • Relational data
• TBs–PBs scale
• Schema defined prior to data load
• Operational reporting and ad hoc
• Large initial CAPEX + $10K–$50K/TB/Year
9.
10. “A data lake is a centralized repository that
allows you to store all your structured and
unstructured data at any scale”
14. CHALLENGE
Need to create constant feedback loop
for designers
Gain up-to-the-minute understanding
of gamer satisfaction to guarantee
gamers are engaged, thus resulting in
the most popular game played in the
world
Fortnite | 125+ million players
15.
16. time
Capture, process, and
store video streams for
analytics
Load data streams into
AWS data stores
Analyze data streams with
SQL
Build custom applications
that analyze data streams
Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
17. Amazon S3:
Buffered files
Kinesis
Agent
Record
producers Amazon Redshift:
Table loads
Amazon Elasticsearch Service:
Domain loads
Amazon S3:
Source record backup
Transformed recordsPut Records
Kinesis Firehose:
Delivery stream
18. Amazon S3:
Buffered files
Kinesis
Agent
Record
producers Amazon Redshift:
Table loads
Amazon Elasticsearch Service:
Domain loads
Amazon S3:
Source record backup
Transformed recordsPut Records
Kinesis Firehose:
Delivery stream
AWS Lambda:
Transformations &
enrichment
Raw Transformed
22. Storing is Not Enough, Data Needs to Be Discoverable
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for other
purposes (for example, analytics,
business relationships and
direct monetizing).
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
“
”Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data
23. Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
80%
24. &
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
Data & schema automatic discovery
Generates customizable code for ETL
Schedule and run ETL jobs periodically
Serverless model
26. AWS Lake Formation (join the preview)
Build, secure, and manage a data lake in days
Build a data lake in days,
not months
Build and deploy a fully
managed data lake with a few
clicks
Enforce security policies
across multiple services
Centrally define security,
governance, and auditing policies in
one place and enforce those policies
for all users and all applications
Combine different
analytics approaches
Empower analyst and data scientist
productivity, giving them self-
service discovery and safe access to
all data from a single catalog
27.
28. User-Defined Functions
• Bring your own functions & code
• Execute without provisioning servers
Processing and Querying In Place
Fully Managed Process & Query
AWS
Glue
Amazon
Athena
Amazon
Redshift
Amazon
SageMaker
AWS
Lambda
29. Query S3 using standard SQL (Presto as distributed engine)
Serverless - No infrastructure to set up or manage
Multiple data format support – Define Schema on Demand
$
Query Instantly Pay per query Open Easy
30.
31. Data scanned: 169.53GB (of 2.2TB)
Query duration: 44.66 seconds
Cost: $0.85
($5/TB or $0.005/GB)
SELECT gram, year, sum(count)
FROM ngram
WHERE gram = 'just say no'
GROUP BY gram, year
ORDER BY year ASC;
registry.opendata.aws/google-ngrams