Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Analytics Infrastructure @ Viki 
Grokking Engineering 
Dec 2014
Talk Outline 
• Introduction 
• Background + Problems 
• Data Architecture 
– Data Collection & Storage 
– Data Processing...
What’s Viki? 
Link
Youtube - A Typical Web Application 
• Daily/weekly registered users by different platforms, countries? 
• How many video ...
Youtube - A Typical Web Application 
• Daily/weekly registered users by different platforms, countries? 
• How many video ...
Behavioral Data? (vs Transactional Data) 
• Transactional Data 
 Mission-critical data (e.g user accounts, bookings, paym...
Data Infrastructure
Data Infrastructure 
1.Collect and Store Data 
2.Centralize and Process Data 
3.Present and Vizualize Data
1. Collect & Store Data 
{ 
"origin":"tv_show_show", "app_ver":"2.9.3.151”, 
"uuid":"80833c5a760597bf1c8339819636df04”, 
"...
1. Collect & Store Data 
• fluentd 
 Scalable 
 Extensible 
 Forward data to Hadoop, MongoDB, PostgreSQL etc.
Hydration System 
• Inject time-sensitive information into events
Hydration System
2. Centralizing & Processing Data
2. Centralizing & Processing Data 
• Centralizing All Data Sources 
• Cleaning Data 
• Transforming Data 
• Managing Job D...
2. Centralizing & Processing Data 
• Centralizing All Data Sources 
• Cleaning Data 
• Transforming Data 
• Managing Job D...
Getting All Data To 1 Place 
thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1 
thor db:c...
b) Click-stream Data (Hadoop)  Analytics DB: 
{"origin":"tv_show_show", "app_ver":"2.9.3.151", 
"uuid":"80833c5a760597bf1...
SELECT 
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, 
v['source'], 
v['partner'], 
v['event'], 
v['video_id'], 
v['...
But… 
The Data Is Not Clean! 
Event properties and names change as we 
develop: 
Old Version: {"user_id": "152", "country_...
SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, 
v['app_id'] AS `app_id`, 
CASE WHEN v['app_ver'] LIKE '%_ax' T...
UPDATE "reporting"."cl_main_2013_09" 
SET source = 'embed', partner = ’partner1' 
WHERE app_id = '100105a' AND (source != ...
Transforming Data 
• Centralizing All Data Sources 
• Cleaning Data 
• Transforming Data 
• Managing Job Dependencies
Transforming Data 
… 
Table A 
Table B 
… 
Analytics DB (PostgreSQL)
a) Reducing Table Size By Dropping Dimension (Aggregation) 
date source partner event video_id country cnt 
2013-09-29 ios...
b) Injecting Extra Fields For Analysis 
1 n 
id title 
1c Game of Thrones 
2c How I Met Your 
Mother 
… 
PostgreSQL 
id ti...
Injecting Extra Fields For Analysis 
containers containers 
id title 
1c Game of Thrones 
2c My Girlfriend Is A 
Gumiho 
…...
Chunk Tables By Month 
… 
video_plays_2013_06 
video_plays_2013_07 
video_plays_2013_08 
video_plays_2013_09 
video_plays ...
Managing Job Dependency 
• Centralizing All Data Sources 
• Cleaning Data 
• Transforming Data 
• Managing Job Dependencie...
Managing Job Dependency 
… 
Job A 
Job B 
… 
Analytics DB (PostgreSQL)
Managing Job Dependency 
… 
tableA 
tableB 
… 
Analytics DB (PostgreSQL)
Azkaban 
Cron dependency 
management 
(Viki Cron Dependency Graph)
3. Data Presentation and Visualization
Query Reports
Summary report 
• Higher level view of metrics 
• See changes over time 
• (screen shot)
Data Explorer 
“The world is your oyster”
4. Real Time Infrastructure
Real Time Infrastructure (Apache 
Storm)
Real Time Dashboard
Alerts 
Know when the house is burning down!
Then Global Content Source and 
Consumption
Our Technology Stack 
• Languages/Frameworks 
– Ruby, Rails, Python, Go, JavaScript, NodeJS 
– Fluentd (Log collector) 
– ...
Hadoop vs. Amazon Redshift 
• Hadoop is a big-data storage and processing engine 
platform 
– HDFS: data-storage layer 
– ...
Recap
Thank You! 
http://bit.ly/viki-datawarehouse 
http://engineering.viki.com/blog/2014/data-warehouse-and-analytics-infrastru...
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Upcoming SlideShare
Loading in …5
×

Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

833 views

Published on

Details: http://www.grokkingengineering.org/

Published in: Data & Analytics
  • Be the first to comment

Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

  1. 1. Analytics Infrastructure @ Viki Grokking Engineering Dec 2014
  2. 2. Talk Outline • Introduction • Background + Problems • Data Architecture – Data Collection & Storage – Data Processing & Aggregation – Data Presentation & Vizualization – Real-time dashboard and alerts • Other Comments
  3. 3. What’s Viki? Link
  4. 4. Youtube - A Typical Web Application • Daily/weekly registered users by different platforms, countries? • How many video uploads do we have everyday?
  5. 5. Youtube - A Typical Web Application • Daily/weekly registered users by different platforms, countries? • How many video uploads do we have everyday?
  6. 6. Behavioral Data? (vs Transactional Data) • Transactional Data  Mission-critical data (e.g user accounts, bookings, payments)  Often fixed schema  Lower volume  Transaction control • Behavioral Data  Logging data (e.g. page view, video start, ad impression)  Often semi-structure (JSON)  Huge volume  No transaction control
  7. 7. Data Infrastructure
  8. 8. Data Infrastructure 1.Collect and Store Data 2.Centralize and Process Data 3.Present and Vizualize Data
  9. 9. 1. Collect & Store Data { "origin":"tv_show_show", "app_ver":"2.9.3.151”, "uuid":"80833c5a760597bf1c8339819636df04”, "user_id":"5298933u”, "vs_id":"1008912v-1380452660-7920”, "app_id":"100004a”, "event":”video_play", "timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet”, "feature":"auto_play", "video_id":"1008912v”, ”subtitle_completion_percent":"100”, "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846”, "ip":"99.232.169.246”, "country":"ca”, "city_name":"Toronto”, "region_name":"ON” } • Samples: page view, video start, ad impression, etc. • Behavioural Data  Semi-structured (JSON)  Massive Volume (100M+/day)  Does not fit traditional RDBMS databases
  10. 10. 1. Collect & Store Data • fluentd  Scalable  Extensible  Forward data to Hadoop, MongoDB, PostgreSQL etc.
  11. 11. Hydration System • Inject time-sensitive information into events
  12. 12. Hydration System
  13. 13. 2. Centralizing & Processing Data
  14. 14. 2. Centralizing & Processing Data • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  15. 15. 2. Centralizing & Processing Data • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  16. 16. Getting All Data To 1 Place thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1 thor db:cp --source A --destination B –t reporting.video_plays --increment
  17. 17. b) Click-stream Data (Hadoop)  Analytics DB: {"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"} … date source partner event video_id country cnt 2013-09-29 ios viki video_play 1008912v ca 2 2013-09-29 android viki video_play 1008912v us 18 … Hadoop PostgreSQL Aggregation (Hive) Export Output / Sqoop
  18. 18. SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['source'], v['partner'], v['event'], v['video_id'], v['country'], COUNT(1) as cnt FROM events WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30') AND v['event'] = 'video_play' GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['source'], v['partner'], v['event'], v['video_id'], v['country']; Simple Aggregation SQL
  19. 19. But… The Data Is Not Clean! Event properties and names change as we develop: Old Version: {"user_id": "152", "country_code":"sg" } {"user_id": "152u”, "country": "sg" } New Version:
  20. 20. SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['app_id'] AS `app_id`, CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END AS `partner`, CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END AS `source` , LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ) AS `country` , COALESCE ( v['device_size'] ,v['device'] ) AS `device`, COUNT( 1 ) AS `cnt` FROM events WHERE time >= 1380326400 AND time <= 1380412799 AND v['event'] = 'video_play' GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'], CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END , CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END, LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ), COALESCE ( v['device_size'] ,v['device'] ); (Not so) simple Aggregation SQL Hadoop
  21. 21. UPDATE "reporting"."cl_main_2013_09" SET source = 'embed', partner = ’partner1' WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1') UPDATE "reporting"."cl_main_2013_09" SET app_id = '100105a' WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a') UPDATE reporting.cl_main_2013_09 SET user_id = user_id || 'u’ WHERE RIGHT(user_id, 1) ~ '[0-9]’ UPDATE "reporting"."cl_main_2013_09" SET app_id = '100106a' WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a') UPDATE reporting.cl_main_2013_09 SET source = 'raynor', partner = 'viki', app_id = '100000a’ WHERE event = 'pv’ AND source IS NULL AND partner IS NULL AND app_id IS NULL …post-import cleanup PostgreSQL Cleaning Up Data Takes Lots of Time
  22. 22. Transforming Data • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  23. 23. Transforming Data … Table A Table B … Analytics DB (PostgreSQL)
  24. 24. a) Reducing Table Size By Dropping Dimension (Aggregation) date source partner event video_id country cnt 2013-09-29 ios viki video_play 1v ca 2 2013-09-29 ios viki video_play 2v ca 18 … date source partner event country cnt 2013-09-29 ios viki video_play ca 20 … PostgreSQL 20M records 4M records video_plays_with_video_id video_plays
  25. 25. b) Injecting Extra Fields For Analysis 1 n id title 1c Game of Thrones 2c How I Met Your Mother … PostgreSQL id title num_videos 1c Game of Thrones 30 2c How I Met Your Mother 16 … shows videos shows shows
  26. 26. Injecting Extra Fields For Analysis containers containers id title 1c Game of Thrones 2c My Girlfriend Is A Gumiho … PostgreSQL id title video_count 1c Game of Thrones 30 2c My Girlfriend Is A Gumiho 16 … 1 n containers videos
  27. 27. Chunk Tables By Month … video_plays_2013_06 video_plays_2013_07 video_plays_2013_08 video_plays_2013_09 video_plays (parent table) ALTER TABLE video_plays_2013_09 INHERIT video_plays; ALTER TABLE video_plays_2013_09 ADD CONSTRAINT CHECK date >= '2013-09-01' AND date < '2013-10-01';
  28. 28. Managing Job Dependency • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  29. 29. Managing Job Dependency … Job A Job B … Analytics DB (PostgreSQL)
  30. 30. Managing Job Dependency … tableA tableB … Analytics DB (PostgreSQL)
  31. 31. Azkaban Cron dependency management (Viki Cron Dependency Graph)
  32. 32. 3. Data Presentation and Visualization
  33. 33. Query Reports
  34. 34. Summary report • Higher level view of metrics • See changes over time • (screen shot)
  35. 35. Data Explorer “The world is your oyster”
  36. 36. 4. Real Time Infrastructure
  37. 37. Real Time Infrastructure (Apache Storm)
  38. 38. Real Time Dashboard
  39. 39. Alerts Know when the house is burning down!
  40. 40. Then Global Content Source and Consumption
  41. 41. Our Technology Stack • Languages/Frameworks – Ruby, Rails, Python, Go, JavaScript, NodeJS – Fluentd (Log collector) – Java, Apache Storm, Kestrel • Databases – PostgreSQL, MongoDB, Redis – Hadoop/Hive, Amazon Redshift – Amazon Elastic MapReduce
  42. 42. Hadoop vs. Amazon Redshift • Hadoop is a big-data storage and processing engine platform – HDFS: data-storage layer – YARN: resource management – MapReduce/Pig/Hive/Spark: processing layer • Amazon Redshift (MPP, massively parallel processing) – Columnar-storage database. Meant for analytics purpose. – OLAP – Online Analytics Processing – Examples: Vertica, Amazon Redshift, Parracel
  43. 43. Recap
  44. 44. Thank You! http://bit.ly/viki-datawarehouse http://engineering.viki.com/blog/2014/data-warehouse-and-analytics-infrastructure-at-viki/ engineering.viki.com huy@viki.com

×