SlideShare a Scribd company logo
1 of 45
Analytics Infrastructure @ Viki 
Grokking Engineering 
Dec 2014
Talk Outline 
• Introduction 
• Background + Problems 
• Data Architecture 
– Data Collection & Storage 
– Data Processing & Aggregation 
– Data Presentation & Vizualization 
– Real-time dashboard and alerts 
• Other Comments
What’s Viki? 
Link
Youtube - A Typical Web Application 
• Daily/weekly registered users by different platforms, countries? 
• How many video uploads do we have everyday?
Youtube - A Typical Web Application 
• Daily/weekly registered users by different platforms, countries? 
• How many video uploads do we have everyday?
Behavioral Data? (vs Transactional Data) 
• Transactional Data 
 Mission-critical data (e.g user accounts, bookings, payments) 
 Often fixed schema 
 Lower volume 
 Transaction control 
• Behavioral Data 
 Logging data (e.g. page view, video start, ad impression) 
 Often semi-structure (JSON) 
 Huge volume 
 No transaction control
Data Infrastructure
Data Infrastructure 
1.Collect and Store Data 
2.Centralize and Process Data 
3.Present and Vizualize Data
1. Collect & Store Data 
{ 
"origin":"tv_show_show", "app_ver":"2.9.3.151”, 
"uuid":"80833c5a760597bf1c8339819636df04”, 
"user_id":"5298933u”, 
"vs_id":"1008912v-1380452660-7920”, 
"app_id":"100004a”, "event":”video_play", 
"timed_comment":"off”, "stream_quality":"variable”, 
"bottom_subtitle":"en", "device_size":"tablet”, 
"feature":"auto_play", "video_id":"1008912v”, 
”subtitle_completion_percent":"100”, 
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846”, 
"ip":"99.232.169.246”, "country":"ca”, 
"city_name":"Toronto”, "region_name":"ON” 
} 
• Samples: page view, video start, 
ad impression, etc. 
• Behavioural Data 
 Semi-structured (JSON) 
 Massive Volume (100M+/day) 
 Does not fit traditional RDBMS 
databases
1. Collect & Store Data 
• fluentd 
 Scalable 
 Extensible 
 Forward data to Hadoop, MongoDB, PostgreSQL etc.
Hydration System 
• Inject time-sensitive information into events
Hydration System
2. Centralizing & Processing Data
2. Centralizing & Processing Data 
• Centralizing All Data Sources 
• Cleaning Data 
• Transforming Data 
• Managing Job Dependencies
2. Centralizing & Processing Data 
• Centralizing All Data Sources 
• Cleaning Data 
• Transforming Data 
• Managing Job Dependencies
Getting All Data To 1 Place 
thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1 
thor db:cp --source A --destination B –t reporting.video_plays --increment
b) Click-stream Data (Hadoop)  Analytics DB: 
{"origin":"tv_show_show", "app_ver":"2.9.3.151", 
"uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", 
"vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, 
"event":”video_play","timed_comment":"off”, "stream_quality":"variable”, 
"bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", 
"video_id":"1008912v", ”subtitle_completion_percent":"100", 
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, 
"country":"ca", "city_name":"Toronto”, "region_name":"ON"} 
… 
date source partner event video_id country cnt 
2013-09-29 ios viki video_play 1008912v ca 2 
2013-09-29 android viki video_play 1008912v us 18 
… 
Hadoop 
PostgreSQL 
Aggregation (Hive) 
Export Output / Sqoop
SELECT 
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, 
v['source'], 
v['partner'], 
v['event'], 
v['video_id'], 
v['country'], 
COUNT(1) as cnt 
FROM events 
WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30') 
AND v['event'] = 'video_play' 
GROUP BY 
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), 
v['source'], 
v['partner'], 
v['event'], 
v['video_id'], 
v['country']; 
Simple Aggregation SQL
But… 
The Data Is Not Clean! 
Event properties and names change as we 
develop: 
Old Version: {"user_id": "152", "country_code":"sg" } 
{"user_id": "152u”, "country": "sg" } 
New Version:
SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, 
v['app_id'] AS `app_id`, 
CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' 
WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' 
WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' 
WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' 
WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' 
WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' 
WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' 
ELSE LOWER( v['partner'] ) 
END AS `partner`, 
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' 
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' 
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' 
WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' 
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' 
ELSE TRIM( v['source'] ) 
END AS `source` , 
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) 
ELSE NULL END ) AS `country` , 
COALESCE ( v['device_size'] ,v['device'] ) AS `device`, 
COUNT( 1 ) AS `cnt` 
FROM events 
WHERE time >= 1380326400 AND time <= 1380412799 
AND v['event'] = 'video_play' 
GROUP BY 
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'], 
CASE WHEN v['app_ver'] LIKE '%_ax' 
THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' 
THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' 
THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' 
THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' 
THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' 
THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' 
THEN 'samsung_viki_premiere' 
ELSE LOWER( v['partner'] ) 
END , 
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' 
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' 
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) 
THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' 
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' 
ELSE TRIM( v['source'] ) END, 
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) 
ELSE NULL END ), 
COALESCE ( v['device_size'] ,v['device'] ); 
(Not so) simple Aggregation SQL 
Hadoop
UPDATE "reporting"."cl_main_2013_09" 
SET source = 'embed', partner = ’partner1' 
WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1') 
UPDATE "reporting"."cl_main_2013_09" 
SET app_id = '100105a' 
WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a') 
UPDATE reporting.cl_main_2013_09 
SET user_id = user_id || 'u’ 
WHERE RIGHT(user_id, 1) ~ '[0-9]’ 
UPDATE "reporting"."cl_main_2013_09" 
SET app_id = '100106a' 
WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a') 
UPDATE reporting.cl_main_2013_09 
SET source = 'raynor', partner = 'viki', app_id = '100000a’ 
WHERE event = 'pv’ 
AND source IS NULL 
AND partner IS NULL 
AND app_id IS NULL 
…post-import cleanup 
PostgreSQL 
Cleaning Up Data Takes Lots of Time
Transforming Data 
• Centralizing All Data Sources 
• Cleaning Data 
• Transforming Data 
• Managing Job Dependencies
Transforming Data 
… 
Table A 
Table B 
… 
Analytics DB (PostgreSQL)
a) Reducing Table Size By Dropping Dimension (Aggregation) 
date source partner event video_id country cnt 
2013-09-29 ios viki video_play 1v ca 2 
2013-09-29 ios viki video_play 2v ca 18 
… 
date source partner event country cnt 
2013-09-29 ios viki video_play ca 20 
… 
PostgreSQL 
20M records 
4M records 
video_plays_with_video_id 
video_plays
b) Injecting Extra Fields For Analysis 
1 n 
id title 
1c Game of Thrones 
2c How I Met Your 
Mother 
… 
PostgreSQL 
id title num_videos 
1c Game of 
Thrones 
30 
2c How I Met 
Your Mother 
16 
… 
shows videos 
shows shows
Injecting Extra Fields For Analysis 
containers containers 
id title 
1c Game of Thrones 
2c My Girlfriend Is A 
Gumiho 
… 
PostgreSQL 
id title video_count 
1c Game of 
Thrones 
30 
2c My Girlfriend 
Is A Gumiho 
16 
… 
1 n 
containers videos
Chunk Tables By Month 
… 
video_plays_2013_06 
video_plays_2013_07 
video_plays_2013_08 
video_plays_2013_09 
video_plays (parent table) 
ALTER TABLE video_plays_2013_09 INHERIT 
video_plays; 
ALTER TABLE video_plays_2013_09 
ADD CONSTRAINT CHECK 
date >= '2013-09-01' 
AND date < '2013-10-01';
Managing Job Dependency 
• Centralizing All Data Sources 
• Cleaning Data 
• Transforming Data 
• Managing Job Dependencies
Managing Job Dependency 
… 
Job A 
Job B 
… 
Analytics DB (PostgreSQL)
Managing Job Dependency 
… 
tableA 
tableB 
… 
Analytics DB (PostgreSQL)
Azkaban 
Cron dependency 
management 
(Viki Cron Dependency Graph)
3. Data Presentation and Visualization
Query Reports
Summary report 
• Higher level view of metrics 
• See changes over time 
• (screen shot)
Data Explorer 
“The world is your oyster”
4. Real Time Infrastructure
Real Time Infrastructure (Apache 
Storm)
Real Time Dashboard
Alerts 
Know when the house is burning down!
Then Global Content Source and 
Consumption
Our Technology Stack 
• Languages/Frameworks 
– Ruby, Rails, Python, Go, JavaScript, NodeJS 
– Fluentd (Log collector) 
– Java, Apache Storm, Kestrel 
• Databases 
– PostgreSQL, MongoDB, Redis 
– Hadoop/Hive, Amazon Redshift 
– Amazon Elastic MapReduce
Hadoop vs. Amazon Redshift 
• Hadoop is a big-data storage and processing engine 
platform 
– HDFS: data-storage layer 
– YARN: resource management 
– MapReduce/Pig/Hive/Spark: processing layer 
• Amazon Redshift (MPP, massively parallel processing) 
– Columnar-storage database. Meant for analytics purpose. 
– OLAP – Online Analytics Processing 
– Examples: Vertica, Amazon Redshift, Parracel
Recap
Thank You! 
http://bit.ly/viki-datawarehouse 
http://engineering.viki.com/blog/2014/data-warehouse-and-analytics-infrastructure-at-viki/ 
engineering.viki.com 
huy@viki.com

More Related Content

What's hot

AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesDoiT International
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOAltinity Ltd
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters MongoDB
 
Getting Started with MongoDB Using the Microsoft Stack
Getting Started with MongoDB Using the Microsoft Stack Getting Started with MongoDB Using the Microsoft Stack
Getting Started with MongoDB Using the Microsoft Stack MongoDB
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsAndrew Morgan
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
MongoDB Best Practices for Developers
MongoDB Best Practices for DevelopersMongoDB Best Practices for Developers
MongoDB Best Practices for DevelopersMoshe Kaplan
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewDoiT International
 
Webinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and ScaleWebinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and ScaleMongoDB
 
Using Aggregation for Analytics
Using Aggregation for Analytics Using Aggregation for Analytics
Using Aggregation for Analytics MongoDB
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...Codemotion
 
ReadConcern and WriteConcern
ReadConcern and WriteConcernReadConcern and WriteConcern
ReadConcern and WriteConcernMongoDB
 
Webinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineWebinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineMongoDB
 
Relational databases for BigData
Relational databases for BigDataRelational databases for BigData
Relational databases for BigDataAlexander Tokarev
 
Keynote: New in MongoDB: Atlas, Charts, and Stitch
Keynote: New in MongoDB: Atlas, Charts, and StitchKeynote: New in MongoDB: Atlas, Charts, and Stitch
Keynote: New in MongoDB: Atlas, Charts, and StitchMongoDB
 
Android writing and reading from firebase
Android writing and reading from firebaseAndroid writing and reading from firebase
Android writing and reading from firebaseOsahon Gino Ediagbonya
 
A Step by Step Introduction to the MySQL Document Store
A Step by Step Introduction to the MySQL Document StoreA Step by Step Introduction to the MySQL Document Store
A Step by Step Introduction to the MySQL Document StoreDave Stokes
 
User Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBKai Sasaki
 

What's hot (20)

AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 
Getting Started with MongoDB Using the Microsoft Stack
Getting Started with MongoDB Using the Microsoft Stack Getting Started with MongoDB Using the Microsoft Stack
Getting Started with MongoDB Using the Microsoft Stack
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
MongoDB Best Practices for Developers
MongoDB Best Practices for DevelopersMongoDB Best Practices for Developers
MongoDB Best Practices for Developers
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
 
Webinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and ScaleWebinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and Scale
 
Using Aggregation for Analytics
Using Aggregation for Analytics Using Aggregation for Analytics
Using Aggregation for Analytics
 
Cloud DWH deep dive
Cloud DWH deep diveCloud DWH deep dive
Cloud DWH deep dive
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
 
ReadConcern and WriteConcern
ReadConcern and WriteConcernReadConcern and WriteConcern
ReadConcern and WriteConcern
 
Webinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineWebinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage Engine
 
Relational databases for BigData
Relational databases for BigDataRelational databases for BigData
Relational databases for BigData
 
Keynote: New in MongoDB: Atlas, Charts, and Stitch
Keynote: New in MongoDB: Atlas, Charts, and StitchKeynote: New in MongoDB: Atlas, Charts, and Stitch
Keynote: New in MongoDB: Atlas, Charts, and Stitch
 
Android writing and reading from firebase
Android writing and reading from firebaseAndroid writing and reading from firebase
Android writing and reading from firebase
 
A Step by Step Introduction to the MySQL Document Store
A Step by Step Introduction to the MySQL Document StoreA Step by Step Introduction to the MySQL Document Store
A Step by Step Introduction to the MySQL Document Store
 
User Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDBUser Defined Partitioning on PlazmaDB
User Defined Partitioning on PlazmaDB
 

Similar to Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Viki Big Data Meetup 2013_10
Viki Big Data Meetup 2013_10Viki Big Data Meetup 2013_10
Viki Big Data Meetup 2013_10ishanagrawal90
 
NSA for Enterprises Log Analysis Use Cases
NSA for Enterprises   Log Analysis Use Cases NSA for Enterprises   Log Analysis Use Cases
NSA for Enterprises Log Analysis Use Cases WSO2
 
Building Your First App with MongoDB Stitch
Building Your First App with MongoDB StitchBuilding Your First App with MongoDB Stitch
Building Your First App with MongoDB StitchMongoDB
 
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKGDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKNate Wiger
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSSmartNews, Inc.
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics PlatformSrinath Perera
 
Tutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB StitchTutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB StitchMongoDB
 
Connecting Your Customers – Building Successful Mobile Games through the Powe...
Connecting Your Customers – Building Successful Mobile Games through the Powe...Connecting Your Customers – Building Successful Mobile Games through the Powe...
Connecting Your Customers – Building Successful Mobile Games through the Powe...Amazon Web Services
 
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...Naoki (Neo) SATO
 
Patterns and Practices for Event Design With Adam Bellemare | Current 2022
Patterns and Practices for Event Design With Adam Bellemare | Current 2022Patterns and Practices for Event Design With Adam Bellemare | Current 2022
Patterns and Practices for Event Design With Adam Bellemare | Current 2022HostedbyConfluent
 
PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time AnalyticsAnil Madan
 
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...MSDEVMTL
 
MongoDB Stich Overview
MongoDB Stich OverviewMongoDB Stich Overview
MongoDB Stich OverviewMongoDB
 
Operational Intelligence with WSO2 BAM
Operational Intelligence with WSO2 BAM Operational Intelligence with WSO2 BAM
Operational Intelligence with WSO2 BAM WSO2
 
Cloud Computing for Business - The Road to IT-as-a-Service
Cloud Computing for Business - The Road to IT-as-a-ServiceCloud Computing for Business - The Road to IT-as-a-Service
Cloud Computing for Business - The Road to IT-as-a-ServiceJames Urquhart
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming AppsWSO2
 
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence ArchitectureMongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence ArchitectureMongoDB
 
CQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspectiveCQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspectiveMaria Gomez
 

Similar to Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen (20)

Viki Big Data Meetup 2013_10
Viki Big Data Meetup 2013_10Viki Big Data Meetup 2013_10
Viki Big Data Meetup 2013_10
 
NSA for Enterprises Log Analysis Use Cases
NSA for Enterprises   Log Analysis Use Cases NSA for Enterprises   Log Analysis Use Cases
NSA for Enterprises Log Analysis Use Cases
 
Building Your First App with MongoDB Stitch
Building Your First App with MongoDB StitchBuilding Your First App with MongoDB Stitch
Building Your First App with MongoDB Stitch
 
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKGDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics PlatformWSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
WSO2Con EU 2016: An Introduction to the WSO2 Analytics Platform
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 
Tutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB StitchTutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB Stitch
 
Connecting Your Customers – Building Successful Mobile Games through the Powe...
Connecting Your Customers – Building Successful Mobile Games through the Powe...Connecting Your Customers – Building Successful Mobile Games through the Powe...
Connecting Your Customers – Building Successful Mobile Games through the Powe...
 
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
 
Patterns and Practices for Event Design With Adam Bellemare | Current 2022
Patterns and Practices for Event Design With Adam Bellemare | Current 2022Patterns and Practices for Event Design With Adam Bellemare | Current 2022
Patterns and Practices for Event Design With Adam Bellemare | Current 2022
 
PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time Analytics
 
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...
Stephane Lapointe, Frank Boucher & Alexandre Brisebois: Les micro-services et...
 
MongoDB Stich Overview
MongoDB Stich OverviewMongoDB Stich Overview
MongoDB Stich Overview
 
Operational Intelligence with WSO2 BAM
Operational Intelligence with WSO2 BAM Operational Intelligence with WSO2 BAM
Operational Intelligence with WSO2 BAM
 
Cloud Computing for Business - The Road to IT-as-a-Service
Cloud Computing for Business - The Road to IT-as-a-ServiceCloud Computing for Business - The Road to IT-as-a-Service
Cloud Computing for Business - The Road to IT-as-a-Service
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
 
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence ArchitectureMongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
MongoDB in the Middle of a Hybrid Cloud and Polyglot Persistence Architecture
 
CQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspectiveCQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspective
 

Recently uploaded

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 

Recently uploaded (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 

Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

  • 1. Analytics Infrastructure @ Viki Grokking Engineering Dec 2014
  • 2. Talk Outline • Introduction • Background + Problems • Data Architecture – Data Collection & Storage – Data Processing & Aggregation – Data Presentation & Vizualization – Real-time dashboard and alerts • Other Comments
  • 4. Youtube - A Typical Web Application • Daily/weekly registered users by different platforms, countries? • How many video uploads do we have everyday?
  • 5. Youtube - A Typical Web Application • Daily/weekly registered users by different platforms, countries? • How many video uploads do we have everyday?
  • 6.
  • 7. Behavioral Data? (vs Transactional Data) • Transactional Data  Mission-critical data (e.g user accounts, bookings, payments)  Often fixed schema  Lower volume  Transaction control • Behavioral Data  Logging data (e.g. page view, video start, ad impression)  Often semi-structure (JSON)  Huge volume  No transaction control
  • 9. Data Infrastructure 1.Collect and Store Data 2.Centralize and Process Data 3.Present and Vizualize Data
  • 10. 1. Collect & Store Data { "origin":"tv_show_show", "app_ver":"2.9.3.151”, "uuid":"80833c5a760597bf1c8339819636df04”, "user_id":"5298933u”, "vs_id":"1008912v-1380452660-7920”, "app_id":"100004a”, "event":”video_play", "timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet”, "feature":"auto_play", "video_id":"1008912v”, ”subtitle_completion_percent":"100”, "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846”, "ip":"99.232.169.246”, "country":"ca”, "city_name":"Toronto”, "region_name":"ON” } • Samples: page view, video start, ad impression, etc. • Behavioural Data  Semi-structured (JSON)  Massive Volume (100M+/day)  Does not fit traditional RDBMS databases
  • 11. 1. Collect & Store Data • fluentd  Scalable  Extensible  Forward data to Hadoop, MongoDB, PostgreSQL etc.
  • 12. Hydration System • Inject time-sensitive information into events
  • 14. 2. Centralizing & Processing Data
  • 15. 2. Centralizing & Processing Data • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  • 16. 2. Centralizing & Processing Data • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  • 17. Getting All Data To 1 Place thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1 thor db:cp --source A --destination B –t reporting.video_plays --increment
  • 18. b) Click-stream Data (Hadoop)  Analytics DB: {"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"} … date source partner event video_id country cnt 2013-09-29 ios viki video_play 1008912v ca 2 2013-09-29 android viki video_play 1008912v us 18 … Hadoop PostgreSQL Aggregation (Hive) Export Output / Sqoop
  • 19. SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['source'], v['partner'], v['event'], v['video_id'], v['country'], COUNT(1) as cnt FROM events WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30') AND v['event'] = 'video_play' GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['source'], v['partner'], v['event'], v['video_id'], v['country']; Simple Aggregation SQL
  • 20. But… The Data Is Not Clean! Event properties and names change as we develop: Old Version: {"user_id": "152", "country_code":"sg" } {"user_id": "152u”, "country": "sg" } New Version:
  • 21. SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['app_id'] AS `app_id`, CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END AS `partner`, CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END AS `source` , LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ) AS `country` , COALESCE ( v['device_size'] ,v['device'] ) AS `device`, COUNT( 1 ) AS `cnt` FROM events WHERE time >= 1380326400 AND time <= 1380412799 AND v['event'] = 'video_play' GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'], CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END , CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END, LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ), COALESCE ( v['device_size'] ,v['device'] ); (Not so) simple Aggregation SQL Hadoop
  • 22. UPDATE "reporting"."cl_main_2013_09" SET source = 'embed', partner = ’partner1' WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1') UPDATE "reporting"."cl_main_2013_09" SET app_id = '100105a' WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a') UPDATE reporting.cl_main_2013_09 SET user_id = user_id || 'u’ WHERE RIGHT(user_id, 1) ~ '[0-9]’ UPDATE "reporting"."cl_main_2013_09" SET app_id = '100106a' WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a') UPDATE reporting.cl_main_2013_09 SET source = 'raynor', partner = 'viki', app_id = '100000a’ WHERE event = 'pv’ AND source IS NULL AND partner IS NULL AND app_id IS NULL …post-import cleanup PostgreSQL Cleaning Up Data Takes Lots of Time
  • 23. Transforming Data • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  • 24. Transforming Data … Table A Table B … Analytics DB (PostgreSQL)
  • 25. a) Reducing Table Size By Dropping Dimension (Aggregation) date source partner event video_id country cnt 2013-09-29 ios viki video_play 1v ca 2 2013-09-29 ios viki video_play 2v ca 18 … date source partner event country cnt 2013-09-29 ios viki video_play ca 20 … PostgreSQL 20M records 4M records video_plays_with_video_id video_plays
  • 26. b) Injecting Extra Fields For Analysis 1 n id title 1c Game of Thrones 2c How I Met Your Mother … PostgreSQL id title num_videos 1c Game of Thrones 30 2c How I Met Your Mother 16 … shows videos shows shows
  • 27. Injecting Extra Fields For Analysis containers containers id title 1c Game of Thrones 2c My Girlfriend Is A Gumiho … PostgreSQL id title video_count 1c Game of Thrones 30 2c My Girlfriend Is A Gumiho 16 … 1 n containers videos
  • 28. Chunk Tables By Month … video_plays_2013_06 video_plays_2013_07 video_plays_2013_08 video_plays_2013_09 video_plays (parent table) ALTER TABLE video_plays_2013_09 INHERIT video_plays; ALTER TABLE video_plays_2013_09 ADD CONSTRAINT CHECK date >= '2013-09-01' AND date < '2013-10-01';
  • 29. Managing Job Dependency • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  • 30. Managing Job Dependency … Job A Job B … Analytics DB (PostgreSQL)
  • 31. Managing Job Dependency … tableA tableB … Analytics DB (PostgreSQL)
  • 32. Azkaban Cron dependency management (Viki Cron Dependency Graph)
  • 33. 3. Data Presentation and Visualization
  • 35. Summary report • Higher level view of metrics • See changes over time • (screen shot)
  • 36. Data Explorer “The world is your oyster”
  • 37. 4. Real Time Infrastructure
  • 38. Real Time Infrastructure (Apache Storm)
  • 40. Alerts Know when the house is burning down!
  • 41. Then Global Content Source and Consumption
  • 42. Our Technology Stack • Languages/Frameworks – Ruby, Rails, Python, Go, JavaScript, NodeJS – Fluentd (Log collector) – Java, Apache Storm, Kestrel • Databases – PostgreSQL, MongoDB, Redis – Hadoop/Hive, Amazon Redshift – Amazon Elastic MapReduce
  • 43. Hadoop vs. Amazon Redshift • Hadoop is a big-data storage and processing engine platform – HDFS: data-storage layer – YARN: resource management – MapReduce/Pig/Hive/Spark: processing layer • Amazon Redshift (MPP, massively parallel processing) – Columnar-storage database. Meant for analytics purpose. – OLAP – Online Analytics Processing – Examples: Vertica, Amazon Redshift, Parracel
  • 44. Recap
  • 45. Thank You! http://bit.ly/viki-datawarehouse http://engineering.viki.com/blog/2014/data-warehouse-and-analytics-infrastructure-at-viki/ engineering.viki.com huy@viki.com

Editor's Notes

  1. Theme of the talk: We have all this data at viki, how do we collect it, make it usable and then derive biz value from it. Other cool technologies that we use.
  2. What are the predominant countries of origin for our content (Korea)? How many Registered Users do we have and what is the growth trajectory? What videos are translated into what languages and what is the percent completion rate? What are our top performing shows (video starts) in the last 3 months? By geographic region? What are the sources of our Registered Users (how do they find us, embed referral traffic, Google search, etc.)? What videos are translated into what languages and what are the corresponding consumption patterns?
  3. What are the predominant countries of origin for our content (Korea)? How many Registered Users do we have and what is the growth trajectory? What videos are translated into what languages and what is the percent completion rate? What are our top performing shows (video starts) in the last 3 months? By geographic region? What are the sources of our Registered Users (how do they find us, embed referral traffic, Google search, etc.)? What videos are translated into what languages and what are the corresponding consumption patterns?
  4. Website database tables: a user registration web flow would persist information provided by the user in a PostgreSQL database table. Accounting data: highly transactional data integrity requirements are high, Enterprise Resource Planning (ERP) and General Ledger (GL) accounting systems. where the structure of the data is predefined and fixed and the integrity of which can be reasonably relied upon. s of which can be distributed in a Hadoop cluster
  5. Add example of an event JSON
  6. Simple? Client libraries We collect the events and put them in a queue Structured and unstructured data
  7. Simple? Client libraries We collect the events and put them in a queue Structured and unstructured data
  8. Simple? Client libraries We collect the events and put them in a queue Structured and unstructured data
  9. Centralizing All Data Sources Data Cleanliness Data Transformation Managing Job Dependencies
  10. Centralizing All Data Sources Data Cleanliness Data Transformation Managing Job Dependencies
  11. To effectively run queries on our data, we need to bring all the data into the same database. In this case we choose Postgres since all our databases are already in Postgres. Anyone here knows Postgres? It’s like mysql, but it’s better. We’ve built command line tools to copy tables from database to database. So the following command copies all tables in public schema of gaia database to our analytics database, and give them a separate schema. In PG, what schema means is something like namespace for tables.
  12. Take a look at 1 sample event being stored in Hadoop in semi-structured JSON form, you have a video play event for that video id running on an ipad device, coming from an autoplay feature, from Toronto, Ontario, Canada. That’s a hell lot of dimensions. We want to aggregate and select a subset of dimensions to port into PG. The Hadoop Provider we uses (Treasure Data) has a feature that allows you to specify a destination data storage (in this case Postgres), it’ll execute the Hadoop job and write the results into the selected database. It’s the equivalent of using Sqoop to bulk export data into Postgres.
  13. As we develop, our data changes, we make mistake, we forgot to set a variable somewhere, we change our data structure. So the new data gets mixed up with the old data. And to make meaningful, and the simple query becomes not so simple.
  14. Centralizing All Data Sources Data Cleanliness Data Transformation Managing Job Dependencies
  15. Once all our data are in Postgres, we start to perform transformation/aggregation to them, depending on various different purposes.
  16. For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20. And that reduces the table size.
  17. For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20. And that reduces the table size.
  18. For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20. And that reduces the table size.
  19. We also chunk our data tables by month, so that when the new month comes, you don’t touch the old months’ data. This also reduce the index size and make it easier to archive your old data. When we first implemented this, we didn’t know how to query cross-month, so we have to write complicated query (like UNION), sometimes we even have to load the data into memory and process them. But then we found out out this awesome feature in Postgres called Table Inheritance. It lets you define a parent table with a bunch of children. And you just need to query the parent table, and depending on your query, it’ll find out the correct children tables to hit.
  20. Centralizing All Data Sources Data Cleanliness Data Transformation Managing Job Dependencies
  21. Can anyone tell me what this means? Ok no one can. That’s exactly my point. At some point, our daily job workflow grew so complicated that it’s becoming hard to use crontab to manage them.
  22. Can anyone tell me what this means? Ok no one can. That’s exactly my point. At some point, our daily job workflow grew so complicated that it’s becoming hard to use crontab to manage them.
  23. There is too many reports! I want to see the high level metrics all in one place
  24. Enabling the product and business folks to “write” their own queries
  25. How do you process 1k events a second? Scalable and distributed Guaranteed Message Passing Fault Tolerant
  26. How do you process 1k events a second? Scalable and distributed Guaranteed Message Passing Fault Tolerant
  27. We try to detect peaks and valleus in our real time data, and send out alerts every hour if any.
  28. We are always exploring new technologies and finding the best tool for the job. The real reason is we get bored ;) I wasn’t hired as a rails developer, I was hired as a developer 