Big Data Journey in HK01

-0 1

Mole WONG
Director, Data Management
HK01 Company Limited
1 - - -

1997 – 2016
From Learning à Teaching
2016 - 2017
Senior Software Engineer
2017
Senior Software Engineer
2017 - Present
Architecting, Research, Development à Management
whoami
2

Outline
• HK01 – more than a media company
• Data – Ingress à Analytics à M.L.
3

• A growing HK Internet Company
• Start with Digital Media
• Platformization
• Construct our own infrastructure with OSS

WHY:
Understand our users.
Provide actionable insights.
How:
Data driven - to steer our
product direction.
What:
Data team & data -
definition, ingress, process, insight.
6

7
A team of 13: Data Engineers + Data Scientists + Data Analysts + PM

AWS API
Gateway w/
Lambda
(in JS)
AWS Kinesis
Firehose
& Analytics
AWS S3
AWS Redshift
WEB
APP
Tracker API
Infrastructure v1.0
AWS Managed Services Purpose
Kinesis Analytics Real-time ETL
Kinesis Firehose Batch write to S3 and RedShift
S3 Data ground truth
RedShift Data Warehouse (or Trash bin)
9

AWS API
Gateway w/
Lambda
(in JS)
AWS Kinesis
Firehose
& Analytics
AWS S3
AWS Redshift
WEB
APP
Tracker API
Infrastructure v1.0 - Usage
10
EDA
Data Products
Data Reporting

Infrastructure v1.0 - Problem
11
$ psql mole-redshift
> connection limit 500 exceeded for non-
bootstrap users
$ fO_o
Our team used to called this the “rainbow rangers”

Workload splitting, but…
• Data migration & synchronization
• Scalability
• $_$
Infrastructure v1.1 & its problems
12
AWS API
Gateway w/
Lambda
(in JS)
AWS Kinesis
Firehose
& Analytics
AWS S3
AWS Redshift Clusters
WEB
APP
Tracker API
Production
1
2 Ad-hoc
queries
3 Reporting

Realtime Parquet
Transformation
WEB
APP
Tracker
API
Using Parquet – Columnar Storage Format
Two-phase transformation:
1. Automatic Parquet transformation with Firehose
2. S3 Object Creation Trigger à Lambda à Python Handler
Infrastructure v2.0
14
AWS
Redshift
Spectrum

Parquet Transformation
15
Btw, we are deploying
Lambda using Serverless.
Apache Parquet & Apache Arrow
1. Parquet: Columnar data on disk
2. Arrow: Columnar data on memory
Reference: https://arrow.apache.org/docs/python/parquet.html

Parquet Transformation
16
Write and
upload to S3
Read and use
Parquet with
pandas
Lambda Function Handler:
- Enrich, cleanse, or transform data with Pandas
- Write data back to S3
Reference: https://arrow.apache.org/docs/python/parquet.html

17
Events when an
article is read

WEB
APP
Tracker
API
Computation Overhead Improvement
1. Shifting the disk I/O load to S3;
2. S3 distcp (distributed copy) to achieve Spark I/O speedup;
3. RedShift Spectrum allows computation to scale automatically;
Infrastructure v2.0
18
Realtime Parquet
Transformation
AWS
Redshift
Spectrum

Dual Pipelines
20
ETL Pipeline
- PySpark on EMR
M.L. Pipeline
- Dockerized execution
- Mostly Python libraries
Common:
- Data in Parquet format
- Scheduled by Airflow
ETL Pipeline
Machine Learning
Pipeline

Dual Pipelines
21
Pros
- Running ETL and M.L.
jobs independently!
- Team members no longer
compete for resources;
Cons
- Heavy DevOps jobs;
- Hard to implement upsert
operations;
ETL Pipeline
Machine Learning
Pipeline

Goal: to enrich user experience with article recommendations
Pipelines
24
ETL: Video Data ETL: Article Data
ETL: User Engagement ETL: Data imports
M.L.: Article Topic Modeling
M.L.: User Reading Habits à Collaborative Filtering

Objective:
- Find clusters of published articles;
- Understand which cluster(s) a user loves;
When a new article is published:
- Predict which cluster the new article is in;
- Recommend that article to targeted users;
Topic Modeling
25

Objective:
- Find clusters of published articles;
Why not using tags:
- Quality & quantity
Topic Modeling
26
A tag which is too specific?
An article with no tags?
An article with too many tags?
A tag which is too common?

Topic Modeling
27
Article
Tokenization
tfidf
Modeling
Challenge: HK01 is a Chinese media company
https://pypi.org/project/jieba/
An unsupervised learning, clustering
problem.
We use Keras AutoEncoder instead of LDA (a story behind it)

Recommendation Feed
28
To recommend a list of articles to users based on
- The popularity of an article
- The reading preference of users
Objective:
- To minimize the time a user finds his preferred article
What we know:
- User history: a list of articles a user read
- List of articles to recommend

Implementation
- https://github.com/benfred/implicit
- ALS: Alternating Least Square
- Optimized for implicit feedback
Collaborative Filtering
29

30Reference: https://github.com/benfred/implicit

50:50 A/B Test Results:
- 1st slot click-through rate: + 100%
- Overall page view: + 30%
Discussions:
- Content Strategy: promote target articles?
- Content Discovery: promote newly-published
articles?
- Content Aging: demote old articles
Collaborative Filtering
31

• Build a big data platform that scales;
• Parquet on S3: a key enabler;
• Pipelines:
• ELT tasks:
• I/O intensive (many joins)
• Let Spark w/ EMR do the heavy duties
• Machine learning tasks:
• Memory intensive (matrix operations)
• Dockerized executions (on K8S in the future)
Key Takeaways
32

Big Data Journey in HK01

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Journey in HK01

Similar to Big Data Journey in HK01 (20)

Recently uploaded

Recently uploaded (20)

Big Data Journey in HK01