How AWS saved my first data lake journey
Hyun Joong Kim
− A story of a college student in the Korean startup scene −
Presented by
ABOUT ME
Hyun Joong Kim
3
- Member of the KRUG/ AUSG
- Senior at Hanyang University
- Department of Information System
- Former Intern at MyMusicTaste
- Data team
- Former Intern at Ebay Korea
- Data Platform team
AGENDA
About AUSG1.
2.
3.
MyMusicTaste
Data Lake
4
4. Retro
5. Closing
About AUSG
AWSKrug University Student Group
1
Who are we?
6
AUSG Members
7
AUSG Members
8
AUSG Members
9
Studies
10
Services
11
Seminars
12
Seminars
13
Amathon, the hackathon
14
Amathon, the hackathon
15
Amathon, the hackathon
16
MyMusicTaste
2
MyMusicTaste(1)
1
8
MyMusicTaste(2)
19
MyMusicTaste(3)
20
MyMusicTaste(4)
21
Demo
22
Data Lake
23
Background
24
Motivation
25
•In-development: lighter, modular, less barriers
• System-wide dependencies —> Per Service dependencies
• Decoupled parallel development/deployment
• More cool stuff!
•In-production: lighter, scalable, fault tolerant
• Single Point of Failure —> Distributed fault tolerance
• Whole application auto-scaling —> scale individual services as needed
Motivation
26
•Diversified data sources
• ElasticSearch
• DynamoDB
• PostgreSQL
• S3
• Redis
• …
..and more!
Use Cases & Characteristics
27
•Must be able to ingest and store all types of data
• Internal: transactional data, application logs, operational data
• External: programmatically and manually extracted meta data
•[Close-to-]real-time representation (for platform)
•Must be mapped to [loose]schema and queryable
•Must be source for further ETL and analytics operations
•Must have full production: secure, testing, logging, etc
Data Lake @ MMT
3
Overall Architecture
Data Lake Query Layer
29
Amazon S3 (storage for streams, snapshots, raw, pre- &
post-transformed, and final query layer data)
Amazon Glue Catalog (Hive metastore)
Amazon Athena (Presto SQL queries)
Periscope Data (dashboards, charts, visualizations)
So what is this 'Glue'
30
Architecture at a Glance
Retros
4
Retro: what went well
33
•Glue catalog & Hive QL
•Automated schema discovery with Glue crawlers
•Read & write to S3 + Parquet
Retro: What could be improved
34
•Glue :(
• Bookmarks are black-boxed and demonstrate some non-deterministic issues
• Development and maintenance of Glue scripts is clunky
• Cost!!! …minimal monthly cost for running one job every 30mins:
•Streaming from Aurora PostgreSQL
2 DPUs 5 DPUs 10 DPUs
$211.20048 $528.00048 $1055.99952
Future Work: Spark + EMR
35
•Control the scaling & lifecycle of our EMR resources,
and reduce cost drastically WHILE increasing load
•Faster development
•Better management options
Future Work: E[CWL]K Framework
36
Future Work: Warehousing & Analytics
37
Closing
5
SERVICE ISSUES
39
- Exposure to services for college students
- Seminars, hands-on-labs that helped
- Expose of architecture and services
- Not a 100% fit but did help know where the pieces belonged to
- Slack page for KRUG
- Community members that are interested in such projects
- Active QnA that helped in situations
Personal thoughts
40
- Would love to have more members join
- Make a bigger pool of enthusiastic students
- AWS is great, but there is no magic
- A bigger network of people who are interested in technology in
general
- Study the core of how things inside AWS work the way they do
Reference
41
1. https://datafloq.com/read/what-is-a-data-lake-what-are-the-benefits/2589
2. https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
3. https://docs.aws.amazon.com/aws-technical-content/latest/building-data-lakes/amazon-s3-data-lake-storage-
platform.html
4. https://aws.amazon.com/glue/
5. http://calculator.s3.amazonaws.com/index.html
www.mymusictaste.com
Special Thanks To Paul Elliot,
data lead at MyMusicTaste
Hyun joong

Hyun joong