Tu Pham - CTO @ Eway
Google Cloud Next
-- Surabaya, Indonesia - 06/2019 --
End to End Business Intelligence
on Google Cloud
About Me - CTO at Eway JSC
- Google Developer Expert on Cloud
Platform
- 8 years experience on Big data and
Cloud Computing
- Open source contributor, blogger,
father
2
3
4
5
6
7
8
9
10
11
In Affiliate Marketing, We Are Partners With
- Indonesia
- Go-Jek
- Bukalapak
- Traveloka
- The World
- Lazada
- Shopee
- Aliexpress
- Adcombo
- Leadbit
12
> 5M
Transactions
in 2018
13
>100M dollar
Gross
merchandise
volume in
2018
14
15
16
When You
Have (Big)
Data
17
How Do We
Use This
Data
18
Use Case
- Reporting
- Business Analytics
- Operational Analytics
- Product Features
- System Monitoring
19
- Reporting to Partners, Advertisers,
Publishers, ...
Reporting
20
Business
Analytics
- Analyzing Growth, Users behavior,
Sign up funnels, Sign up referrals
21
Operational
Analytics
- Analyzing Root cause analysis,
Latency analysis, Error analysis
22
Operational
Analytics
- Better Threshold alerts, Security
alerts, Capacity planning
23
Product
Features
- Product Features Top Products ,
Publisher challenge, A/B Testing
24
25
Sample: End-To-End Flow For Mining
User Behavior
26
How do we
collect this
data?
27
Step 1: GC Compute Engine Instances
Collect Raw Data
- Technology: Cloud Load Balancing, Compute Engine
- Why Cloud Load Balancing:
- TCP/UDP Load Balancing
- Seamless Autoscaling
- Scalable
- Why Compute Engine:
- High-Performance
- Scalable
- Low Cost
- Fast Networking
- Custom Machine Types 28
Step 1: GC Compute Engine Instances
Collect Raw Data
29
How do we
process this
data?
30
Step 2: GC Compute Engine Instances
Convert Raw Data To Apache Parquet Files
- Technology: Compute Engine, Parquet file format
- Why Parquet:
- Self-describing, columnar storage format
- Language-independent
- High query-performance
- Spark SQL is much faster with Parquet
- High compression (up to 70%)- less disk IO
31
Step 2: GC Compute Engine Instances
Convert Raw Data To Apache Parquet Files
32
Step 2: GC Compute Engine Instances
Convert Raw Data To Apache Parquet Files
33
- Technology: Compute Engine, Parquet file format, Cloud Storage
- Why Cloud Storage:
- Four storage classes
- Easy to integrate
- Object Lifecycle Management
- Fast Networking
Step 3: GC Compute Engine Upload Parquet
File To GC Cloud Storage
34
Step 3: GC Compute Engine Upload Parquet
File To GC Cloud Storage
35
Step 3: GC Compute Engine Upload Parquet
File To GC Cloud Storage
36
Step 3: GC Compute Engine Upload Parquet
File To GC Cloud Storage
37
Step 3: GC Compute Engine Upload Parquet
File To GC Cloud Storage
38
How do we
visualize this
data
39
Step 4: Explore Dataset Using BI Tools
- Technology: DataPrep, Big Query, Grafana, PowerBI
40
Step 4: Explore Dataset Using BI Tools
- Technology: DataPrep
41
Step 4: Explore Dataset Using BI Tools
- Technology: BigQuery
42
Step 4b: Explore Dataset Using BI Tools
43
44
45
Step 4: Explore Dataset Using BI Tools
- Technology: Grafana
46
Step 4: Explore Dataset Using BI Tools
- Technology: PowerBI
47
Step 4b: Explore Dataset Using BI Tools
48
Step 5: Aggregate Data
> val df = spark.read.parquet(“/log/2017/07/user_engagement/1.snappy.parquet”)
49
> df.show()
+----------------------------+---------------------------+------------------------------+-----------------------------+---------+---------+------------------
+
| id | categoryId | topicId | userId | action | value | created |
+----------------------------+---------------------------+------------------------------+-----------------------------+---------+---------+------------------
+
|"100011125479181_..|"253397751448382"|"253397751448382_...| "100011125479181 "|"view"| "" |1490621079|
|"100004354358107_..|"253397751448382"|"253397751448382_...| "100004354358107"|"view"| "" |1490491531|
|"100014752680147_..|"253397751448382"|"253397751448382_...| "100014752680147"|"like"| "" |1490457109|
Step 5: Aggregate Data
50
> val df_group_count = df.groupBy("userId","categoryId", "action").count().show()
+--------------------------+---------------------------+----------+----------+
| userId | categoryId | action | count |
+--------------------------+---------------------------+----------+----------+
|"100011896037126"|"253397751448382"|"like" | 2 |
|"100010391178709"|"253397751448382"|"like" | 1 |
|"100011186707422"|"253397751448382"|"like" | 1 |
|"100012202096674"|"253397751448382"|"like" | 1 |
Step 5: Aggregate Data
51
Step 5: Aggregate Big Data
52
Step 5: Aggregate Big Data
Number of unique user per topic AVG User engagement per topic
53
Become
Geek
54
Where Are
The AI /
ML
55
Create
Your
Principles
Principles:
- KISS (Keep it simple, stupid)
- DRY (Don’t Repeat Yourself)
- Single Responsibility
- Low Cost
- Scalable
56
Be 1% better everyday
tips
Create your system
principles
Design system
architecture, data flow,
data model, data
structure first
Separate realtime and
batch flows
Separate data storage
strategies between data
types
Save the cost by
network cost, instances
cost, storage cost by
metric monitoring &
alert system 57
Thank You - Q&A
● Eway: https://eway.vn
● My Contact: tupp@eway.vn
58

End To End Business Intelligence On Google Cloud

  • 1.
    Tu Pham -CTO @ Eway Google Cloud Next -- Surabaya, Indonesia - 06/2019 -- End to End Business Intelligence on Google Cloud
  • 2.
    About Me -CTO at Eway JSC - Google Developer Expert on Cloud Platform - 8 years experience on Big data and Cloud Computing - Open source contributor, blogger, father 2
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    In Affiliate Marketing,We Are Partners With - Indonesia - Go-Jek - Bukalapak - Traveloka - The World - Lazada - Shopee - Aliexpress - Adcombo - Leadbit 12
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    How Do We UseThis Data 18
  • 19.
    Use Case - Reporting -Business Analytics - Operational Analytics - Product Features - System Monitoring 19
  • 20.
    - Reporting toPartners, Advertisers, Publishers, ... Reporting 20
  • 21.
    Business Analytics - Analyzing Growth,Users behavior, Sign up funnels, Sign up referrals 21
  • 22.
    Operational Analytics - Analyzing Rootcause analysis, Latency analysis, Error analysis 22
  • 23.
    Operational Analytics - Better Thresholdalerts, Security alerts, Capacity planning 23
  • 24.
    Product Features - Product FeaturesTop Products , Publisher challenge, A/B Testing 24
  • 25.
  • 26.
    Sample: End-To-End FlowFor Mining User Behavior 26
  • 27.
    How do we collectthis data? 27
  • 28.
    Step 1: GCCompute Engine Instances Collect Raw Data - Technology: Cloud Load Balancing, Compute Engine - Why Cloud Load Balancing: - TCP/UDP Load Balancing - Seamless Autoscaling - Scalable - Why Compute Engine: - High-Performance - Scalable - Low Cost - Fast Networking - Custom Machine Types 28
  • 29.
    Step 1: GCCompute Engine Instances Collect Raw Data 29
  • 30.
    How do we processthis data? 30
  • 31.
    Step 2: GCCompute Engine Instances Convert Raw Data To Apache Parquet Files - Technology: Compute Engine, Parquet file format - Why Parquet: - Self-describing, columnar storage format - Language-independent - High query-performance - Spark SQL is much faster with Parquet - High compression (up to 70%)- less disk IO 31
  • 32.
    Step 2: GCCompute Engine Instances Convert Raw Data To Apache Parquet Files 32
  • 33.
    Step 2: GCCompute Engine Instances Convert Raw Data To Apache Parquet Files 33
  • 34.
    - Technology: ComputeEngine, Parquet file format, Cloud Storage - Why Cloud Storage: - Four storage classes - Easy to integrate - Object Lifecycle Management - Fast Networking Step 3: GC Compute Engine Upload Parquet File To GC Cloud Storage 34
  • 35.
    Step 3: GCCompute Engine Upload Parquet File To GC Cloud Storage 35
  • 36.
    Step 3: GCCompute Engine Upload Parquet File To GC Cloud Storage 36
  • 37.
    Step 3: GCCompute Engine Upload Parquet File To GC Cloud Storage 37
  • 38.
    Step 3: GCCompute Engine Upload Parquet File To GC Cloud Storage 38
  • 39.
    How do we visualizethis data 39
  • 40.
    Step 4: ExploreDataset Using BI Tools - Technology: DataPrep, Big Query, Grafana, PowerBI 40
  • 41.
    Step 4: ExploreDataset Using BI Tools - Technology: DataPrep 41
  • 42.
    Step 4: ExploreDataset Using BI Tools - Technology: BigQuery 42
  • 43.
    Step 4b: ExploreDataset Using BI Tools 43
  • 44.
  • 45.
  • 46.
    Step 4: ExploreDataset Using BI Tools - Technology: Grafana 46
  • 47.
    Step 4: ExploreDataset Using BI Tools - Technology: PowerBI 47
  • 48.
    Step 4b: ExploreDataset Using BI Tools 48
  • 49.
    Step 5: AggregateData > val df = spark.read.parquet(“/log/2017/07/user_engagement/1.snappy.parquet”) 49
  • 50.
    > df.show() +----------------------------+---------------------------+------------------------------+-----------------------------+---------+---------+------------------ + | id| categoryId | topicId | userId | action | value | created | +----------------------------+---------------------------+------------------------------+-----------------------------+---------+---------+------------------ + |"100011125479181_..|"253397751448382"|"253397751448382_...| "100011125479181 "|"view"| "" |1490621079| |"100004354358107_..|"253397751448382"|"253397751448382_...| "100004354358107"|"view"| "" |1490491531| |"100014752680147_..|"253397751448382"|"253397751448382_...| "100014752680147"|"like"| "" |1490457109| Step 5: Aggregate Data 50
  • 51.
    > val df_group_count= df.groupBy("userId","categoryId", "action").count().show() +--------------------------+---------------------------+----------+----------+ | userId | categoryId | action | count | +--------------------------+---------------------------+----------+----------+ |"100011896037126"|"253397751448382"|"like" | 2 | |"100010391178709"|"253397751448382"|"like" | 1 | |"100011186707422"|"253397751448382"|"like" | 1 | |"100012202096674"|"253397751448382"|"like" | 1 | Step 5: Aggregate Data 51
  • 52.
    Step 5: AggregateBig Data 52
  • 53.
    Step 5: AggregateBig Data Number of unique user per topic AVG User engagement per topic 53
  • 54.
  • 55.
  • 56.
    Create Your Principles Principles: - KISS (Keepit simple, stupid) - DRY (Don’t Repeat Yourself) - Single Responsibility - Low Cost - Scalable 56
  • 57.
    Be 1% bettereveryday tips Create your system principles Design system architecture, data flow, data model, data structure first Separate realtime and batch flows Separate data storage strategies between data types Save the cost by network cost, instances cost, storage cost by metric monitoring & alert system 57
  • 58.
    Thank You -Q&A ● Eway: https://eway.vn ● My Contact: tupp@eway.vn 58