SlideShare a Scribd company logo
Free Classifieds
www.olx.com
Amazon Redshift at OLX Group
Advanced analytics and big data innovation at the
world’s largest online classifieds business
Dobo Radichkov | London, 5 July 2017
2
Contents
• Introduction to Naspers and OLX Group
• OLX capabilities powered by Redshift
• OLX Redshift technical best practices
• Q&A
3
Contents
• Introduction to Naspers and OLX Group
• OLX capabilities powered by Redshift
• OLX Redshift technical best practices
• Q&A
4
Introducing NASPERS
A $83B global internet
& entertainment group
and one of the largest
technology investors
in the world
$15BRevenue
$2BEarnings
130
Countries
1.5BAudience
reach
27.000
People
5
Introducing OLX GROUP: The world’s largest classifieds business
40
Countries
20+
Offices
3000
Employees
15+
Brands
6
OLX Group is a powerful global community
1.7B+ monthly visits
35B+ monthly page views
60M+ monthly listings
300M+ monthly active users
4.4
APP RATING
#1 app
22+ COUNTRIES
Mobile leader
People spend more than
twice as long in OLX apps
versus competitors
Scale
• 2 houses
• 2 cars
• 3 fashion items
• 3 mobile phones
Listed every second
• 40 countries
• 20+ offices
• 3,000 employees
• 15+ brands
Global footprint
7
Contents
• Introduction to Naspers and OLX Group
• OLX capabilities powered by Redshift
• OLX Redshift technical best practices
• Q&A
8
At OLX, Redshift powers 3 important business capabilities
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence
9
At OLX, Redshift powers 3 important business capabilities
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence
Enhance the user
experience and lifetime
value via personalised,
relevant, targeted and
unified omni-channel
user communications
10
Activity level (buying and selling)
Time
First-time
user
Returning
user
Loyal
user
(buying
and
selling)
Loyal
user
(across
multiple
categories)
Fan!Visitor
Increase the
customer
lifetime value…
… through the
‘right’ product,
marketing and
customer care
treatments
Fundamentally, CLM is all about fuelling retention-driven growth by
treating our customers the best way possible in their lifecycle stage
11
Trans-
actional
Website &
mobile app
Customer
care
Social /
3rd party
Customer
segmen-
tation
4
Insights &
analytics
3
Execution
6
Customer
treatments
5
Single
customer
view
2
Platforms
and data
1
The OLX CLM implementation is enabled by automated, data-driven,
targeted and personalised customer treatments
12
At OLX, Redshift powers 3 important business capabilities
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence
Enhance the user
experience and lifetime
value via personalised,
relevant, targeted and
unified omni-channel
user communications
Grow buyer engage-
ment, seller success
and transactions by
showing relevant &
personalised content to
each of our users
13
Context: Panamera is tackling the challenges of Search & Relevance
across 3 pillars of content discovery
Home page Search experience Recommendations
Show the most relevant
content to each of our users
• Personalised
content feed driving
buyer engagement
and seller success
driven by buyer
interests, social
relationship,
proximity, freshness
and ad / user quality
• Core search results
experience including
text matching, spell
checking, synonym
mapping, language-
specific optimisa-
tions, etc.
• Search auto-
complete, auto-
suggest, instant
results and curated
content
• Recommended
content (e.g. listings,
categories, search)
used to personalise
elements of the
buyer + seller user
journey(s) based on
past behaviour,
preferences and
activity
14
At OLX, Redshift powers 3 important business capabilities
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence
Enhance the user
experience and lifetime
value via personalised,
relevant, targeted and
unified omni-channel
user communications
Grow buyer engage-
ment, seller success
and transactions by
showing relevant &
personalised content to
each of our users
Empower the business
with robust BI data
platform, high-quality
executive reporting,
and actionable
customer insights
15
Contents
• Introduction to Naspers and OLX Group
• OLX capabilities powered by Redshift
• OLX Redshift technical best practices
• Q&A
16
Our guiding principle for big data development using Redshift
ü Use few but powerful technologies (and become world-class at them)
ü Keep architecture simple and minimise points of failure
ü Standardise, build on each other, foster continuous improvement
“Everything should be made as simple
as possible, but not simpler.”
Einstein
17
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
18
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
19
Let’s walk through our high-level data architecture step by step…
RDL
ODL
Master
64 × ds1.xl
CLM platform
100 × dc1.l
RDL
ODL
RDL+
ODL+
ADL+
Analyst
sandboxes
(read / write
access)
BI platform
64 × ds1.xl
ADL
CLM
APIs
Management
dashboards
Operational
dashboards
SCV ADL
ODL ODL
Ad hoc analytics
LiveSync Hydra Moderation CRMs APIs Crawlers …
Master data infrastructure Extended BI infrastructure
Data refreshed every ~3-5 hours Data refreshed every 24 hours
Load / unload Transformation / modelling Replication
Reporting platform
100 × dc1.l
Mktg
channels
20
LiveSync in-house technology enables dynamic synchronisation of
MySQL production databases to Redshift
LiveSync
Platform database
Live DB replica
(MySQL)
Lazarus
MySQL extractor
(Python)
S3 storage
Data lake
LiveSync
Redshift loader
(Python)
LiveSync
21
Ninja / Hydra in-house multi-tracking capability collects structured
clickstream data from each client device
LiveSync
Client device
(dekstop,
mobile, apps)
Ninja tracker
Client-side library
(JS, PHP,
Android, iOS)
Hydra tracker
Server-side
Java application
S3 storage
Data lake
Hydra
Ninja / Hydra
22
LiveSync and Hydra raw data form the backbone of our architecture
and are loaded in Raw Data Layer (RDL) of our Master Redshift cluster
LiveSync Hydra
RDL
Master
64 × ds1.xl
Load / unload Transformation / modelling Replication
LiveSync:
• ~1,000 tables
• ~100 billion records
• ~12 TB compressed storage
Hydra:
• ~400 tables
• ~400 billion records
• ~30 TB compressed storage
Raw data layer (RDL)
23
Raw data is transformed and modelled into the core Operational Data
Layer (ODL) used to feed all data applications
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
Stats:
• ~100 tables
• ~150 billion records
• ~5 TB compressed storage
Facts:
• Listings
• Listing liquidity
• Replies
• Revenue transactions
• Clickstream events
• Listing impressions
• …
Dimensions:
• Users
• Business units
• Geographies
• Categories
• Channels
• (~40 dimensions in total)
Operational data layer (ODL)
Load / unload Transformation / modelling Replication
24
From here, the ODL is replicated to each data application via our Rzeka
data replication in-house utility
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
Reporting platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
Rzeka replication utility
What is it?
• A fully-configurable Python utility enabling incremental
Redshift-to-Redshift data replication
Load / unload Transformation / modelling Replication
25
ODL is used as a basis to build the CLM Single Customer View
enabling CLM treatments and recommendations
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
CLM data platform
What does it do?
• User mapping, customer lifecycle segmentation,
recommendation & sort order algorithms, CLM treatments
generation & execution, CLM reporting & analytics
CLM
APIs
Mktg
channels
SCV
Load / unload Transformation / modelling Replication
Reporting platform
100 × dc1.l
26
Similarly, ODL is used to generate the Analytical Data Layer (ADL) in
our Reporting platform
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
SCV
Management
dashboards
ADL
BI platform
64 × ds1.xl
ADL
Triton mgmt. reporting
What is it?
• Data models and cubes for Top Management KPIs across
seller, buyer, liquidity, revenue and product activity
• Tableau dashboard implementation
Load / unload Transformation / modelling Replication
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l
27
The BI data warehouse sources additional raw data into an Extended
Raw Data Layer (RDL+)
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
SCV
Management
dashboards
ADL
RDL+
Moderation CRMs APIs Crawlers …
ADL
Extended Raw Data Layer (RDL+)
What is it?
• New raw data sources supporting extended management
and operational analytics across e.g. Performance
Marketing, CS, Competitor, Salesforce, etc.
Load / unload Transformation / modelling Replication
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l
28
The RDL+ is modelled into an Extended Operational & Analytical
Data Layers (ODL+ & ADL+) used to power operational reporting
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
SCV
Management
dashboards
ADL
RDL+
Moderation CRMs APIs Crawlers …
ADL
ODL+
ADL+
Operational
dashboards
ODL+ and ADL+
What is it?
• Data layers enabling
extended operational
reporting and
analysis
Load / unload Transformation / modelling Replication
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l
29
Ad hoc analysis enabled through read-only SQL endpoints and
read/write sandboxes inside the BI platform
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
SCV
Management
dashboards
ADL
RDL+
Moderation CRMs APIs Crawlers …
ADL
ODL+
ADL+
Operational
dashboards
Analyst
sandboxes
for ad hoc
analysis
Analyst
sandboxes
(read / write
access)
Ad hoc analytics
Load / unload Transformation / modelling Replication
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l
30
Detailed end-state OLX Group central data architecture
RDL
ODL
Master
64 × ds1.xl
CLM platform
100 × dc1.l
RDL
ODL
RDL+
ODL+
ADL+
Analyst
sandboxes
(read / write
access)
BI platform
64 × ds1.xl
ADL
CLM
APIs
Management
dashboards
Operational
dashboards
SCV ADL
ODL ODL
Ad hoc analytics
LiveSync Hydra Moderation CRMs APIs Crawlers …
Master data infrastructure Extended BI infrastructure
Data refreshed every ~3-5 hours Data refreshed every 24 hours
Load / unload Transformation / modelling Replication
Reporting platform
100 × dc1.l
Mktg
channels
31
Side note: With Amazon’s new Athena and Spectrum services we are
exploring new architectural possibilities & improvements
LiveSync Hydra Moderation CRMs APIs Crawlers …
Redshift
Athena metadata catalogue (using Hive DDL)
Athena JDBC endpoint
Presto distributed SQL engine
Spectrum
external S3
tables
Redshift
native
tables
Spectrum distributed SQL engine
JOIN
32
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
33
Data management framework
Code
Data
work-
flows
Data
layers
Clusters
34
Data management framework
Code
Data
work-
flows
Data
layers
Clusters
35
1 cluster = 1 capability = 1 owner ≈ 1 team
Master data
platform
64 × ds1.xl
CLM platform
100 × dc1.l
BI platform
64 × ds1.xl
Reporting
platform
100 × dc1.l
Tableau cluster
48 × dc1.l
36
Data management framework
Code
Data
work-
flows
Data
layers
Clusters
37
We typically organise our data in 3 data layers
RDL
ODL
ADL
Raw Data Layer
Raw disaggregated and
unprocessed clickstream and
production database data
Operational Data Layer
Clean, structured and standardised
dimensional model serving as
foundation for all data applications
Application Data Layer
Data models specific to each data application
– including CLM algorithms, recommenders,
BI metrics, OLAP cubes, etc.
38
Large scale clickstream data management
Optimise
table design
Partition
data
Create
abstraction
views
RDL
39
Large scale clickstream data management
Optimise
table design
Partition
data
Create
abstraction
views
Goal is to achieve best possible distribution, date range
WHERE performance, and compression effectiveness
RDL
40
Large scale clickstream data management
Optimise
table design
Partition
data
Create
abstraction
views
Goal is to achieve best possible distribution, date range
WHERE performance, and compression effectiveness
Option 1
Distribution key user_id
Sort key event_timestamp
Best possible
distribution
Date range
performance
Effective
compression
(poor compression
of text columns)
RDL
41
Large scale clickstream data management
Optimise
table design
Partition
data
Create
abstraction
views
Goal is to achieve best possible distribution, date range
WHERE performance, and compression effectiveness
Option 1 Option 2
Distribution key user_id user_id
Sort key event_timestamp user_id
Best possible
distribution
Date range
performance (always full scan)
Effective
compression
(poor compression
of text columns)
(1/2x table size)
RDL
42
Large scale clickstream data management
Optimise
table design
Partition
data
Create
abstraction
views
Goal is to achieve best possible distribution, date range
WHERE performance, and compression effectiveness
Option 1 Option 2 Option 3
Distribution key user_id user_id user_id
Sort key event_timestamp user_id
event_date,
user_id,
event_timestamp
Best possible
distribution
Date range
performance (always full scan)
Effective
compression
(poor compression
of text columns)
(1/2x table size) (1/2x table size)
RDL
43
Large scale clickstream data management
Examples:
• europe_android_201706
• latam_web_201703
• asia_ios_201706
• (~650 tables in total)
Benefits:
• Easily DROP older data when no longer needed
• Minimise use of DELETE / VACUUM operations
• Localise points of failure and ringfence data repairs
• Allow for platform and channel-specific table schema
Optimise
table design
Partition
data
Create
abstraction
views
We partition our clickstream data into 1 table per
PLATFORM × CHANNEL × MONTH combination
RDL
44
Large scale clickstream data management
Types of abstraction views
• Current month à europe_android_current_month
• Previous month à latam_web_previous_month
• Last X months (X = 3/6/12) à asia_ios_last_X_months
• All months à africa_android
• View creation is automated via Python script
• (~300 views in total)
Benefits:
• Abstraction of underlying partitioning mechanism
• Time-agnostic ETLs and analytical queries
Optimise
table design
Partition
data
Create
abstraction
views
We create abstraction VIEWs over individual tables
that UNION ALL the data into relevant groups
RDL
45
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
ODL
ADL
46
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
OLX operates in dozens of markets using many
different classifieds platforms built on different
technologies and using different production databases
It is impossible to ensure system id uniqueness à
Need robust mechanism for surrogate key modelling
ODL
ADL
47
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
OLX operates in dozens of markets using many
different classifieds platforms built on different
technologies and using different production databases
It is impossible to ensure system id uniqueness à
Need robust mechanism for surrogate key modelling
Option 1 Example
Use combination of
system ids that
guarantees uniqueness
Leads to nightmare
spaghetti SQL that is
difficult and
time-consuming to
write, read and maintain
SELECT ...
FROM odl.fact_listings f
JOIN odl.dim_categories d
ON f.platform_id = d.platform_id
AND f.country_id = d.country_id
AND f.brand_id = d.brand_id
AND f.category_id = d.category_id
AND f.category_level = d.category_level
JOIN ...
GROUP BY ...
ODL
ADL
48
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
OLX operates in dozens of markets using many
different classifieds platforms built on different
technologies and using different production databases
It is impossible to ensure system id uniqueness à
Need robust mechanism for surrogate key modelling
Option 1 Option 2
Use combination of
system ids that
guarantees uniqueness
Create globally unique
identifier (GUID) for
each dimension value
• Complex to implement
• Requires GUID
mapping tables to able
able to trace values
back to system ids
• Dimension keys lose
semantic meaning
ODL
ADL
49
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
OLX operates in dozens of markets using many
different classifieds platforms built on different
technologies and using different production databases
It is impossible to ensure system id uniqueness à
Need robust mechanism for surrogate key modelling
Option 1 Option 2 Option 3
Use combination of
system ids that
guarantees uniqueness
Create globally unique
identifier (GUID) for
each dimension value
Create smart &
persistent surrogate
keys that preserve
semantic meaning
ODL
ADL
50
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
SELECT ...
FROM fact_listings
JOIN dim_countries USING (country_sk) -- 'olx|eu|ua'
JOIN dim_categories USING (category_sk) -- 'olx|asia|in|5|84|1531'
JOIN dim_geographies USING (geography_sk) -- 'olx|eu|ua|17|194|194'
JOIN dim_channels USING (channel_sk) -- 'mobile_app|android'
JOIN dim_listing_status USING (listing_status_sk) -- 'inactive|mod|mod_removed'
JOIN dim_listing_types USING (listing_type_sk) -- 'private'
JOIN dim_listing_feeds USING (listing_feed_sk) -- 'normal'
JOIN dim_listing_net USING (listing_net_sk) -- 'net|mod>live>eod'
JOIN dim_currencies USING (currency_sk) -- 'aed'
JOIN dim_users USING (user_sk)
--'olx|latam|pe|platform|email|freddy@gmail.com'
JOIN ...
GROUP BY ...
Example
Benefits:
• Simple and guaranteed JOINs
• ‘Readable’ key values
• Negligible impact on query performance (= Redshift rocks!)
ODL
ADL
51
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
We model most dimensions as hierarchical dimensions
ODL
ADL
Approach:
• Extract parent-child relationship from system dimensions
• For non-system dimensions, define child-parent relationship in
configuration tables
• Create hierarchical dimensions using generic SQL hierarchy
generation script
• Set all dimension key values in fact tables as deepest available
hierarchy level (ideally key of tree leaf values)
Benefits:
• Consistent dimensional modelling approach
• Can easily traverse hierarchy from leaves all the way up to root
• Easy to read & write JOINs with single-key ON conditions
52
Dimensional modelling approach – Hierarchical dimensions
ODL
ADL
Example: dim_channels dimension
Configuration
53
Dimensional modelling approach – Hierarchical dimensions
ODL
ADL
Example: dim_channels dimension
Output
54
Dimensional modelling approach – Hierarchical dimensions
Code taster
ODL
ADL
55
Data management framework
Code
Data
work-
flows
Data
layers
Clusters
56
Simplest workflow typically takes an input and applies dimensional
modelling & business rules to transform the data into desired output
Input table(s)
Transformation
Transformation
View
Table
SELECT
INSERT
Dimension(s)
Map(s)
Every component uses
the output from other
previously modelled
components as input
Relevant Dimensions
and Maps are included
to apply dimensional
model and required
business rules
SQL transformation logic is
decoupled and stored in a
separate Transformation VIEW
CREATE TABLE fact_output (...)
DISTSTYLE KEY DISTKEY (...) SORTKEY(...);
CREATE OR REPLACE VIEW fact_output_view AS
SELECT ... -- business logic and projections
FROM fact_input
JOIN dimensions
JOIN maps
GROUP BY ...; -- aggregation logic
TRUNCATE fact_output;
INSERT INTO fact_output
SELECT * FROM fact_output_view;
ANALYZE fact_output;
Final table that
contains output of
data workflow
57
For more complex workflows, we use one or more staging steps to
ensure code modularity and have better control over performance
Input table(s)
View
Table
SELECT
INSERT
Transformation
staging step(s)
Transformation
Transformation
staging step(s)
Transformation
Dimension(s)
Map(s)
Intermediate staging logic is
encapsulated in separate VIEWs
Staging output is materialised in
dedicated tables to used as input into
subsequent transformation steps
CREATE TABLE fact_output_staging_step1 (...)
DISTSTYLE KEY DISTKEY (...) SORTKEY(...);
CREATE TABLE fact_output_staging_step2 (...)
DISTSTYLE KEY DISTKEY (...) SORTKEY(...);
-- ... more staging steps as required ...
CREATE TABLE fact_output (...)
DISTSTYLE KEY DISTKEY (...) SORTKEY(...);
CREATE OR REPLACE VIEW
fact_output_staging_step1_view AS
SELECT ... -- business logic and projections
FROM fact_input
JOIN ...
GROUP BY ...; -- aggregation logic
CREATE OR REPLACE VIEW
fact_output_staging_step2_view AS
SELECT ... -- business logic and projections
FROM fact_output_staging_step1
JOIN ...
GROUP BY ...; -- aggregation logic
-- ... more staging steps as required ...
CREATE OR REPLACE fact_output_view AS
SELECT ... -- business logic and projections
FROM fact_output_staging_step2
JOIN ...
GROUP BY ...; -- aggregation logic
TRUNCATE fact_output_staging_step1;
INSERT INTO fact_output_staging_step1
SELECT * FROM fact_output_staging_step1_view;
ANALYZE fact_output_staging_step1;
TRUNCATE fact_output_staging_step1;
INSERT INTO fact_output_staging_step2
SELECT * FROM fact_output_staging_step2_view;
ANALYZE fact_output_staging_step1;
-- ... more staging steps as required ...
TRUNCATE fact_output;
INSERT INTO fact_output
SELECT * FROM fact_output_view;
ANALYZE fact_output;
58
We use Feeders to apply the same Transformation to different Inputs
Input table(s)
View
Table
SELECT
INSERT
Feeder
Transformation
Transformation
Dimension(s)
Map(s)
CREATE OR REPLACE fact_output_feeder1_view AS
-- Combine & apply projections / pre-processing
SELECT ... FROM fact_input1 UNION ALL
SELECT ... FROM fact_input2 UNION ALL
SELECT ... FROM fact_input3;
CREATE OR REPLACE fact_output_feeder2_view AS
-- Combine & apply projections / pre-processing
SELECT ... FROM fact_input4 UNION ALL
SELECT ... FROM fact_input5 UNION ALL
SELECT ... FROM fact_input6;
CREATE OR REPLACE fact_output_feeder_view AS
SELECT * FROM fact_output_feeder2_view;
CREATE OR REPLACE fact_output_view AS
SELECT ... -- business logic and projections
FROM fact_output_feeder_view
...;
TRUNCATE fact_output;
-- Run transformation on fact_input1
CREATE OR REPLACE fact_output_feeder_view AS
SELECT * FROM fact_output_feeder1_view;
INSERT INTO fact_output
SELECT * FROM fact_output_view;
ANALYZE fact_output;
-- Run transformation on fact_input2
CREATE OR REPLACE fact_output_feeder_view AS
SELECT * FROM fact_output_feeder2_view;
INSERT INTO fact_output
SELECT * FROM fact_output_view;
ANALYZE fact_output;
Feeders are VIEWs that decouple the
input data from the Transformation logic.
They can be used to collate multiple
inputs (e.g. UNION ALL) & apply basic
pre-processing and projections serving
as basis for rest of data flow
Input table(s)
59
Feeders can be very powerful with incremental data workflows
Input table(s)
Transformation
Transformation
Dimension(s)
Map(s)
View
Table
SELECT
INSERT
Fast incremental
load feeder
Slow incremental
load feeder
Full load feeder
CREATE OR REPLACE fact_output_feeder_full_load_view AS
SELECT ... -- projections & pre-processing
-- Use full available time range
FROM fact_input;
CREATE OR REPLACE fact_output_feeder_incr_load_slow_view AS
SELECT ... -- projections & pre-processing
FROM fact_input
-- Use 4-week incremental time window
WHERE date_nk >= (GETDATE() :: DATE - INTERVAL '4 week') :: DATE;
CREATE OR REPLACE fact_output_feeder_incr_load_fast_view AS
SELECT ... -- projections & pre-processing
FROM fact_input
-- Use 2-day incremental time window
WHERE date_nk >= (GETDATE() :: DATE - INTERVAL '2 day') :: DATE;
CREATE OR REPLACE fact_output_feeder_view AS
SELECT * FROM fact_output_feeder_incr_load_fast_view;
Typically we switch between 3 load feeders
depending on the ETL processing approach:
(1) Full load – Processes the input in its
entirety. Used when running transfor-
mation for the first time on full scale.
(2) Incremental Fast load – Processes last
few days / hours of data. Used by
default for scheduled production job.
(3) Incremental Slow load – Processes last
month of data. Used to manually fix
problems from previous runs without
having to re-process years of data.
60
Another great use of Feeders is as a method to switch between
Production and Development environments
Input table(s)
Transformation
Transformation
Dimension(s)
Map(s)
Development feeder Production feeder
View
Table
SELECT
INSERT
We use Production /
Development feeder VIEWs to
switch between full scale (e.g. all
countries) and development
scale (e.g. few small countries).
This enables fast ETL runtimes
during development and testing CREATE OR REPLACE fact_output_feeder_development_view AS
SELECT ... -- projections & pre-processing
FROM fact_input
WHERE country_sk IN ('olx|mea|gh', 'olx|mea|za');
CREATE OR REPLACE fact_output_feeder_production_view AS
SELECT ... -- projections & pre-processing
FROM fact_input;
CREATE OR REPLACE fact_output_feeder_view AS
SELECT * FROM fact_output_feeder_development_view;
61
Ultimately, these patterns can be combined in various combinations
depending on the requirements of the data workdlow
Input table(s)
Development feeder
Transformation
staging step(s)
Transformation
Production feeder
Fast incremental
load feeder
Slow incremental
load feeder
Full load feeder
Transformation
staging step(s)
Transformation
Root feeder
View
Table
SELECT
INSERT
Dimension(s)
Map(s)
62
Individual data workflows add up to our full data management
ecosystem
Component 1
Component 2
Component 3
Component 4
…
…
63
Data management framework
Code
Data
work-
flows
Data
layers
Clusters
64
High-level code repository organisation
master rdl hydra android *.sql
ios *.sql
… *.sql
livesync … *.sql
odl dim dim_actions *.sql
dim_categories *.sql
… *.sql
map map_actions *.sql
map_channels *.sql
… *.sql
fact fact_event_clickstream *.sql
fact_listings *.sql
… *.sql
stg … *.sql
… … *.sql
adl dim, map, fact … … *.sql
… … *.sql
clm rdl dim, map, fact … … *.sql
odl dim, map, fact … … *.sql
adl dim, map, fact … … *.sql
…
Cluster Data layer
Component
group
Component Code
65
Example repository structure and file naming
We use standard file naming
with special prefixes to
decouple the logical building
blocks of the data workflow:
• Table definition
• View definition (feeders,
transformations, unit tests)
• ETL scripts
• Configuration scripts
• Analysis
66
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
67
Collaborative filtering visual 101
68
Collaborative filtering visual 101
Depeche Mode The Cure Keane Placebo Suede
69
Collaborative filtering visual 101
Depeche Mode The Cure Keane Placebo Suede
Jack Jill Jim Jen
70
Collaborative filtering visual 101
Depeche Mode The Cure Keane Placebo Suede
Jack Jill Jim Jen
71
1st degree
recommendation
Collaborative filtering visual 101
Depeche Mode The Cure Keane Placebo Suede
Jack Jill Jim Jen
72
1st degree
recommendation
1st degree
recommendation
Collaborative filtering visual 101
Depeche Mode The Cure Keane Placebo Suede
Jack Jill Jim Jen
73
1st degree
recommendation
1st degree
recommendation
Collaborative filtering visual 101
Depeche Mode The Cure Keane Placebo Suede
2nd degree
recommendation
Jack Jill Jim Jen
74
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute personalised
user-to-item
recommendations
75
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute personalised
user-to-item
recommendations
CREATE TABLE AS item_interactions AS
SELECT user,
band,
SUM(DECODE(action, 'like', 1, 'play', 3)) AS score
FROM clickstream
WHERE action IN ('like', 'play')
GROUP BY 1, 2;
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …
76
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute personalised
user-to-item
recommendations
CREATE TABLE AS similarity_matrix AS
SELECT i1.band AS band1,
i2.band AS band2,
COUNT(1) AS frequency,
SUM(i1.score + i2.score) AS score
FROM item_interactions i1
JOIN item_interactions i2
ON i1.user = i2.user
AND i1.band <> i2.band
GROUP BY 1,2
HAVING COUNT(1) > 1
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …
Depeche
Mode
TheCure
Keane
Placebo
Suede
Depeche Mode 3 4
The Cure 3 3
Keane 4 3 5
Placebo 2
Suede 5 2
77
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …
Depeche
Mode
TheCure
Keane
Placebo
Suede
Depeche Mode 3 4
The Cure 3 3
Keane 4 3 5
Placebo 2
Suede 5 2
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute personalised
user-to-item
recommendations
1st degree item-to-item recommendations
band1 band2 frequency sum_score rec_rank
Depeche Mode Keane 4 49 1
Depeche Mode The Cure 3 39 2
Keane Suede 5 80 1
Keane Depeche Mode 4 49 2
Keane The Cure 3 48 3
Placebo Suede 2 62 1
Suede Keane 5 80 1
Suede Placebo 2 62 2
The Cure Keane 3 48 1
The Cure Depeche Mode 3 39 2
CREATE TABLE dobo.rec_item2item AS
SELECT band1, band2, frequency, sum_score,
ROW_NUMBER() OVER (PARTITION BY band1 ORDER BY frequency DESC, sum_score DESC) AS rec_rank
FROM similarity_matrix;
78
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute personalised
user-to-item
recommendations
INSERT INTO dobo.rec_item2item
WITH max_rank AS (
SELECT band1, MAX(rec_rank) AS max_rank_1st_degree
FROM dobo.rec_item2item
GROUP BY 1
)
SELECT rec_1st.band1,
rec_2nd.band2,
NULL AS frequency,
NULL AS sum_score,
ROW_NUMBER() OVER ( PARTITION BY rec_1st.band1
ORDER BY MIN(rec_1st.rec_rank *
rec_2nd.rec_rank),
MIN(rec_1st.rec_rank)
) + max_rank_1st_degree AS rec_rank
FROM dobo.rec_item2item rec_1st
JOIN max_rank USING (band1)
JOIN dobo.rec_item2item rec_2nd
ON rec_1st.band2 = rec_2nd.band1
AND rec_1st.band1 <> rec_2nd.band2
-- exclude items already in 1st degree recommendations
LEFT JOIN dobo.rec_item2item rec_excl
ON rec_1st.band1 = rec_excl.band1
AND rec_2nd.band2 = rec_excl.band2
WHERE rec_excl.band1 IS NULL
GROUP BY 1, 2, max_rank_1st_degree;
1st degree item-to-item recommendations
band1 band2 frequency sum_score rec_rank
Depeche Mode Keane 4 49 1
Depeche Mode The Cure 3 39 2
Keane Suede 5 80 1
Keane Depeche Mode 4 49 2
Keane The Cure 3 48 3
Placebo Suede 2 62 1
Suede Keane 5 80 1
Suede Placebo 2 62 2
The Cure Keane 3 48 1
The Cure Depeche Mode 3 39 2
2nd degree item-to-item recommendations
band1 band2 frequency sum_score rec_rank
Depeche Mode Suede 3
Keane Placebo 4
Placebo Keane 2
Suede Depeche Mode 3
Suede The Cure 4
The Cure Suede 3
79
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute
personalised
user-to-item
recommendations
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …
Depeche
Mode
TheCure
Keane
Placebo
Suede
Depeche Mode 3 4
The Cure 3 3
Keane 4 3 5
Placebo 2
Suede 5 2
user-to-item recommendations
user band frequency sum_score rec_rank
Ana Placebo 62 6 1
Ana Depeche Mode 49 5 2
Ana The Cure 48 7 3
Dave Placebo 62 6 1
Dave Depeche Mode 49 5 2
Dave The Cure 48 7 3
Eric Placebo 62 6 1
Eric Depeche Mode 49 5 2
Eric The Cure 48 7 3
Jack Placebo 4 1
Jack Suede 80 7 2
Jen Depeche Mode 3 1
Jen The Cure 4 2
Jen Keane 80 3 3
Jill The Cure 87 9 1
Jill Placebo 62 6 2
Jim Depeche Mode 49 5 1
Jim The Cure 48 7 2
John Placebo 4 1
John Suede 80 7 2
Sam Placebo 4 1
Sam Suede 80 7 2
SELECT int.user, rec.band2,
SUM(rec.sum_score) AS frequency, SUM(rec.rec_rank) AS sum_score,
ROW_NUMBER() OVER ( PARTITION BY int.user ORDER BY
SUM(rec.sum_score) DESC, SUM(rec.rec_rank) ASC) AS rec_rank
FROM dobo.item_interactions int
JOIN dobo.rec_item2item rec
ON int.band = rec.band1
-- Exclude recommendations that the user already interacted with
LEFT JOIN dobo.item_interactions excl
ON rec.band2 = excl.band
AND int.user = excl.user
WHERE excl.band IS NULL
GROUP BY 1,2;
80
At OLX, we apply this approach to implement variety of recommenders
Item-to-item People who viewed items A, B, C also viewed items X, Y, Z
Category-to-category People who bought Cars were also interested in Car Parts
Search-to-category People who searched for ‘black leather sofa’ were interested in Furniture
Search-to-search People who searched for ‘porsche’ also searched for ‘bmw’, ‘mercedes’, ‘ferrari'
Category-to-search People who browsed Mobile phones searched for ‘iphone 7’, ‘samsung galaxy’, …
+ many more … …
DescriptionRecommender
81
Example recommenders in production
www.olx.ph demo
82
Other examples
Related search
recommendations for
browsing users
Personalised item
recommendations for
active buyers
Recommended
categories to post for
active sellers
Users like you also liked
Try also these related searches
Here are some other selling ideas
83
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
84
Unit testing is an established paradigm in test-driven development
85
OLX is developing Qualis – a unit testing framework for Redshift / SQL
ü Switch from reactive to proactive
error handling
ü Enable SQL codebase scale out
ü Reduce maintenance time and ad
hoc data investigations
ü Make data platform more robust
ü Free up time for innovation
86
Qualis includes Redshift-side framework (being piloted) and Python test
automation & visualisation (currently in development)
Test 1 Test 2 Test 3
Test 4 Test 5 Test 6
Test 7 Test 8 Test 9
Test 10 Test 11 …
Redshift database Python
Qualis tests are
implemented in Redshift
using VIEWs that return
standard output &
PASS/FAIL result
Qualis script runs daily
and SELECTs test output
from each VIEW using
flexible configuration and
parses & aggregates
results
Final output is visualised
using plain text and / or
third party visualisation
tool (e.g. Tableau)
Visualisation
87
Example #1: Duplicates detection test
CREATE OR REPLACE VIEW
clm.utest_fact_segmentation_duplication_view AS
WITH test AS (
SELECT country_sk AS country_sk,
COUNT(1) AS cnt,
COUNT(DISTINCT user_sk) AS cnt_distinct
FROM clm.fact_segmentation
GROUP BY 1
)
SELECT 'fact_segmentation' AS test_module,
country_sk,
'duplication' AS test_group,
NULL AS test_subgroup,
NULL AS test_instance,
CASE WHEN cnt = cnt_distinct THEN 'PASS' ELSE
'FAIL' END || ' [cnt: ' || cnt || '; COUNT(DISTINCT): ' ||
cnt_distinct || ']' AS duplicates
FROM test
ORDER BY 1,2,3,4,5;
Compares overall
COUNT to
COUNT(DISTINCT) for
each OLX country to
detect data duplication
88
Example #1: Duplicates detection test
CREATE OR REPLACE VIEW
clm.utest_fact_segmentation_duplication_view AS
WITH test AS (
SELECT country_sk AS country_sk,
COUNT(1) AS cnt,
COUNT(DISTINCT user_sk) AS cnt_distinct
FROM clm.fact_segmentation
GROUP BY 1
)
SELECT 'fact_segmentation' AS test_module,
country_sk,
'duplication' AS test_group,
NULL AS test_subgroup,
NULL AS test_instance,
CASE WHEN cnt = cnt_distinct THEN 'PASS' ELSE
'FAIL' END || ' [cnt: ' || cnt || '; COUNT(DISTINCT): ' ||
cnt_distinct || ']' AS duplicates
FROM test
ORDER BY 1,2,3,4,5;
Compares overall
COUNT to
COUNT(DISTINCT) for
each OLX country to
detect data duplication
89
Example #2: Gap detection in time-series data
Aggregates data into hourly
buckets and identifies any
missing hours, while including
some logic to reduce false
positives (e.g. during night hours
in smaller OLX markets)
CREATE OR REPLACE VIEW clm.utest_fact_event_clickstream_agg_user_mapped_gaps_view AS
WITH
hours AS (
SELECT country_sk,
DATE_TRUNC('HOUR', current_local_time - INTERVAL '1 HOUR' * row_num) AS hour
FROM global_bi.dim_counter
CROSS JOIN clm.fact_country_current_time
WHERE row_num BETWEEN 25 AND 2 * 7 * 24 -- 2 weeks x 7 days x 24 hours (start checking for gaps from 24 hours ago)
),
test AS (
SELECT country_sk,
DATE_TRUNC('HOUR', time_event_local) AS hour,
COUNT(1) AS cnt
FROM (
SELECT *,
-- Use average of first and last event timestamp within aggregated time window to approximate overall timing of event(s)
TIMESTAMP 'epoch' + INTERVAL '1 second' * ((
DATEDIFF('second', 'epoch', time_first_event_local) +
DATEDIFF('second', 'epoch', time_last_event_local)
) / 2) AS time_event_local
FROM clm.fact_event_clickstream_agg_user_mapped
) fc
WHERE date_event_nk >= (GETDATE() :: DATE - INTERVAL '16 DAYS') :: DATE -- Performance filter
GROUP BY 1,2
),
countries_in_scope AS (
SELECT country_sk,
AVG(1.0 * cnt) AS avg_cnt_allday,
AVG(1.0 * CASE WHEN DATE_PART('hour', hour) BETWEEN 0 AND 6 THEN cnt END) AS avg_cnt_night
FROM test
GROUP BY 1
)
SELECT 'fact_event_clickstream_agg_user_mapped' AS test_module,
country_sk,
'gaps' AS test_group,
NULL AS test_subgroup,
DATE_TRUNC('day', hour) :: DATE AS test_instance,
CASE WHEN COUNT(CASE wHEN COALESCE(test.cnt, 0) = 0 THEN 1 END) = 0 THEN 'PASS' ELSE 'FAIL'
|| ' (' || COUNT(CASE wHEN COALESCE(test.cnt, 0) = 0 THEN 1 END) || ' missing hours: '
|| LISTAGG(CASE WHEN COALESCE(test.cnt, 0) = 0 THEN DATE_PART('hour', hour) END, ',')
WITHIN GROUP (ORDER BY DATE_PART('hour', hour)) || '; Avg/hr (all day): ' || avg_cnt_allday :: INT || '; Avg/hr (night
only): ' || avg_cnt_night :: INT || ']'
END AS gap_test
FROM hours
JOIN countries_in_scope USING (country_sk)
LEFT JOIN test USING (country_sk, hour)
WHERE -- Do not test night hours if night activity is very low on average
CASE WHEN avg_cnt_night <= 500 THEN DATE_PART('hour', hour) NOT BETWEEN 0 AND 6 ELSE 1 :: BOOL END = 1 :: BOOL
-- Do not test countries with super low activity to minimise false FAILs
AND avg_cnt_allday >= 500
GROUP BY 2,5,avg_cnt_allday,avg_cnt_night
ORDER BY 1,2,3,4,5;
90
Example #2: Gap detection in time-series data
91
Example #3: Simple business logic validation
Configuration sheet specifies
test rules (1 cell = 1 test) à in
this example, testing for data
coverage (minimum % of
records with non-NULL values)
per customer segment
92
Example #3: Simple business logic validation
93
Example #4: Complex business logic validation
Most advanced test case (until now)
validates segment business rules using
equality / inequality conditions across
different segments / dimensions
94
Example #4: Complex business logic validation
95
Example test result visualisation
96
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
97
Challenge: OLX has complex global reporting needs
Dimensions
(avg. cardinality)
Time (month / week / day) à ~200
Business unit (4-level hierarchy) à ~50
Category (6-level hierarchy) à ~600
Geography (3-level hierarchy) à ~220
Channel (3-level hierarchy) à ~10
Segment (3-way segmentation) à ~27
Type (3-level hierarchy) à ~7
Measures
Measure
variants
~20
measures
(additive &
non-additive)
Current
Lag
Year ago
Target
Up to
~200
trillion
data
points!!
98
3 possible solutions
Query
disaggregated
data and
compute
measures in
real-time
1
Use standalone OLAP product from
a big name database vendor
2
OLX Redshift OLAP framework3
Keep data in Redshift1a
Load data in fast columnar
storage (e.g. Cassandra or
Tableau’s internal database)
1b
On the fly calculations over
billions of records not fast
enough for responsive user
experience
Size of disaggregated data
exceeds limits (e.g. Tableau)
& too large to be able to
efficiently enable daily loads
Too expensive,
too complex, requires
hiring specialists with
domain knowledge
Pre-aggregated cubes with
direct Tableau integration
offers most pragmatic &
simplest solution
99
Under the framework, we use a configurable cube matrix to specify the
slices which we are interested in reporting on
• Number of users –
additive only across
Business Unit
dimension
• Number of listings
– additive across all
dimensions
Measures
Dimen-
sions
Time Channel Business unit
Q3
2017
Apr
2017
1 Apr
2 Apr
…
30 Apr
May
2017
1 May
2 May
…
31 May
Jun
2017
1 Jun
2 Jun
…
30 Jun
Total
Desk-
top
web
Mobile
web
Mobile
apps
An-
droid
iOS
Total
Europe
PL
PT
LATAM
AR
CO
Cube
matrix
Dimension
Time
(non-additive)
Channel
(non-additive)
Business unit
#
records
in cube
Perspective
Quarter
Month
Day
Total
L1 L2
Total
Region
Country
Cardinality 1 3 91 1 3 2 1 2 4
Full cube 3,990
Sub-cube 1 364
Sub-cube 2 28
Sub-cube 3 80
472 vs. 3,990 records (~12% of size of full cube)
Example
100
Under this example, the 3 sub-cubes translate to 11 slices
Dimension
Time
(non-additive)
Channel
(non-additive)
Business unit
# records
in slicePerspective
Quarter
Month
Day
Total
L1 L2
Total
Region
Country
Cardinality 1 3 91 1 3 2 1 2 4
Slice 1 364
Slice 2 4
Slice 3 12
Slice 4 2
Slice 5 6
Slice 6 1
Slice 7 3
Slice 8 12
Slice 9 36
Slice 10 8
Slice 11 24
101
Pseudo-SQL OLAP cube implementation
SELECT CASE WHEN cube_matrix.time_quarter THEN dim_time.quarter_name
WHEN cube_matrix.time_month THEN dim_time.month_name
WHEN cube_matrix.time_day THEN dim_time.date
END AS time_value,
CASE WHEN cube_matrix.channel_total THEN 'Total'
WHEN cube_matrix.channel_l1 THEN dim_channel.channel_l1_name
WHEN cube_matrix.channel_l2 THEN dim_channel.channel_l2_name
END AS channel_value,
CASE WHEN cube_matrix.bus_unit_total THEN 'Total'
WHEN cube_matrix.bus_unit_region THEN dim_bus_unit.region_name
WHEN cube_matrix.bus_unit_country THEN dim_bus_unit.country_name
END AS bus_unit_value,
COUNT(1) AS num_listings,
COUNT(DISTINCT user_key) AS num_users
FROM fact_listings
JOIN dim_time USING (time_key)
JOIN dim_channel USING (channel_key)
JOIN dim_bus_unit USING (bus_unit_key)
CROSS JOIN cube_matrix
GROUP BY 1,2,3
CHALLENGE
This CROSS JOIN
can be very expensive
as it explodes the
input fact table by the
number of slices
configured in the cube
matrix
SOLUTION
Aggregate only
non-additive
dimensions first
(Time and Channel),
then aggregate
additive dimensions
(Business Unit) using
already partially
aggregated output
102
Summary of OLX Redshift OLAP framework
Input
Operational data
model from OLX
data warehouse
Facts
(~15 tables)
Dimensions
(~15 tables)
~250B records
Step 1
Prepare cub
pre-aggregates at
the smallest grain
possible (i.e. user)
Cube 1
pre-aggregate
Cube 2
pre-aggregate
Cube 3
pre-aggregate
…
~6B records
OLAP cube configuration
(definition of cube dimensional
model and cube matrix)
Step 2
Aggregate all
non-additive cube
slices
Cube 1
non-additive
aggregation
Cube 2
non-additive
aggregation
Cube 3
non-additive
aggregation
…
~100M records
Step 3
Aggregate all
additive cube slices
using output from
previous step
Cube 1
additive
aggregation
Cube 2
additive
aggregation
Cube 3
additive
aggregation
…
~460M records
Step 4
Combine individual
cube outputs into a
single cube
Consolidated
cube
~200M records
103
Production example: Dimensions
104
Production example: Dimension perspectives
105
Production example: Cubes
106
Production example: Cube matrix / slices
…
107
Structure of final cube output
CREATE TABLE fact_cube
(
-- Unique key
cube_grain_sk VARCHAR(500) ENCODE RAW
-- Dimensions
,time_perspective VARCHAR(200) ENCODE RAW
,time_date_nk DATE ENCODE RAW
,time_value VARCHAR(200) ENCODE LZO
,time_display_value VARCHAR(255) ENCODE LZO
,time_num_days_in_period VARCHAR(255) ENCODE LZO
,time_num_days_in_period_lag VARCHAR(255) ENCODE LZO
,time_num_days_in_period_yago VARCHAR(255) ENCODE LZO
,dim01_perspective VARCHAR(200) ENCODE LZO
,dim02_perspective VARCHAR(200) ENCODE LZO
-- ...
,dim01_display_value VARCHAR(255) ENCODE RAW
,dim02_display_value VARCHAR(255) ENCODE RAW
-- ...
-- Measures
,measure01 BIGINT ENCODE LZO
,measure01_lag BIGINT ENCODE LZO
,measure01_target BIGINT ENCODE LZO
,measure01_yago BIGINT ENCODE LZO
,measure02 BIGINT ENCODE LZO
,measure02_lag BIGINT ENCODE LZO
,measure02_target BIGINT ENCODE LZO
,measure02_yago BIGINT ENCODE LZO
-- ...
)
DISTSTYLE KEY DISTKEY (cube_grain_sk)
SORTKEY (
time_date_nk,
time_date_nk,
dim01_display_value,
dim02_display_value,
...
);
In our current production version,
the final cube table has ~200M
rows and ~150 columns (~40 for
dimensions, ~110 for measures)
Example value:
time~day|2017-06-16
//business_unit~country|olx|eu|ro
//category~g2|core|for_sale
//geography~l1|olx|eu|ro|13
//channel~total|total
//user_hlv~segment|low_volume
//user_ftr~total|total
//user_pnp~total|total
//listing_liqg~total|total
108
Production example: Consolidated cube output
109
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
110
Summary of OLX Redshift OLAP framework
Input
Operational data
model from OLX
data warehouse
Facts
(~15 tables)
Dimensions
(~15 tables)
~250B records
Step 1
Prepare cub
pre-aggregates at
the smallest grain
possible (i.e. user)
Cube 1
pre-aggregate
Cube 2
pre-aggregate
Cube 3
pre-aggregate
…
~6B records
OLAP cube configuration
(definition of cube dimensional
model and cube matrix)
Step 2
Aggregate all
non-additive cube
slices
Cube 1
non-additive
aggregation
Cube 2
non-additive
aggregation
Cube 3
non-additive
aggregation
…
~100M records
Step 3
Aggregate all
additive cube slices
using output from
previous step
Cube 1
additive
aggregation
Cube 2
additive
aggregation
Cube 3
additive
aggregation
…
~460M records
Step 4
Combine individual
cube outputs into a
single cube
Consolidated
cube
~200M records
Tableau view
Output
Tableau live
dashboard
connection to
dedicated Tableau
Redshift cluster
Tableau
Tableau abstraction view adds
derivative measures
(calculated on the fly) and
formatted values needed to
implement reporting dashboard
Strive for 1
query per Tableau
interaction running
within max. ~3 sec.
Tableau demo
112
Debugging Tableau interaction with Redshift
-- Get the last queries run from Tableau
SELECT DATEDIFF('ms', starttime, endtime) / 1000.0 AS duration,
query, xid, pid, starttime, querytxt
FROM stl_query
WHERE userid = 105
ORDER BY starttime desc
LIMIT 100
-- Get the SQL
SELECT LISTAGG(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
text, 'n', 'n' ),
'"', '' ),
', ', ',ntt' ),
'AND ', 'n ANDt' ),
'FROM ', 'n FROMt' ),
'WHERE ', 'n WHEREt' ),
'SELECT ', 'nSELECTt' ),
'declare', '--declare' ),
'')
WITHIN GROUP(ORDER BY sequence, starttime) AS sql
FROM svl_statementtext
WHERE userid = 105
AND xid = 5650302
AND text NOT LIKE 'begin%'
AND text NOT LIKE 'fetch%'
AND text NOT LIKE 'close%';
113
Contents
• Introduction to Naspers and OLX Group
• OLX capabilities powered by Redshift
• OLX Redshift technical best practices
• Q&A
114
Thank you! Questions?
Dobo Radichkov
Sr. Director, Global Analytics and
Customer Lifecycle Management
Dobo@OLX.com
OLX Group
www.olx.com
Free classifieds
We are hiring!
www.joinolx.com
Roles:
• Data engineers
• Data scientists
• PHP / Java / Android
/ iOS developers
Locations: Berlin,
Lisbon, Buenos Aires,
Dubai, Barcelona,
Moscow, Delhi

More Related Content

What's hot

Search@flipkart
Search@flipkartSearch@flipkart
Search@flipkart
Umesh Prasad
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Databricks
 
Data science at OLX
Data science at OLXData science at OLX
Data science at OLX
Alexey Grigorev
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
Viet-Trung TRAN
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
Jaya Kawale
 
Machine Learning at Netflix Scale
Machine Learning at Netflix ScaleMachine Learning at Netflix Scale
Machine Learning at Netflix Scale
Aish Fenton
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
Spark Summit
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
Alexey Grigorev
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4j
Neo4j
 
Building Text Analytics Applications on AWS using Amazon Comprehend - AWS Onl...
Building Text Analytics Applications on AWS using Amazon Comprehend - AWS Onl...Building Text Analytics Applications on AWS using Amazon Comprehend - AWS Onl...
Building Text Analytics Applications on AWS using Amazon Comprehend - AWS Onl...
Amazon Web Services
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
ObjectRocket
 
Incorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemIncorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender System
Jacek Wasilewski
 
An introduction to Recommender Systems
An introduction to Recommender SystemsAn introduction to Recommender Systems
An introduction to Recommender Systems
David Zibriczky
 
Lessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsLessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systems
Xavier Amatriain
 
Elements of ecommerce
Elements of ecommerceElements of ecommerce
Elements of ecommerce
Chris Kenney
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
Sujit Pal
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020
Ontotext
 
Netflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 StarsNetflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 Stars
Xavier Amatriain
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
leanderlee2
 
Ecommerce backend process
Ecommerce backend processEcommerce backend process
Ecommerce backend process
BhanuTeja Polukonda
 

What's hot (20)

Search@flipkart
Search@flipkartSearch@flipkart
Search@flipkart
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
Data science at OLX
Data science at OLXData science at OLX
Data science at OLX
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Machine Learning at Netflix Scale
Machine Learning at Netflix ScaleMachine Learning at Netflix Scale
Machine Learning at Netflix Scale
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4j
 
Building Text Analytics Applications on AWS using Amazon Comprehend - AWS Onl...
Building Text Analytics Applications on AWS using Amazon Comprehend - AWS Onl...Building Text Analytics Applications on AWS using Amazon Comprehend - AWS Onl...
Building Text Analytics Applications on AWS using Amazon Comprehend - AWS Onl...
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Incorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender SystemIncorporating Diversity in a Learning to Rank Recommender System
Incorporating Diversity in a Learning to Rank Recommender System
 
An introduction to Recommender Systems
An introduction to Recommender SystemsAn introduction to Recommender Systems
An introduction to Recommender Systems
 
Lessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsLessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systems
 
Elements of ecommerce
Elements of ecommerceElements of ecommerce
Elements of ecommerce
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020
 
Netflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 StarsNetflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 Stars
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
 
Ecommerce backend process
Ecommerce backend processEcommerce backend process
Ecommerce backend process
 

Similar to OLX Group presentation for AWS Redshift meetup in London, 5 July 2017

Neo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j GraphTalks - Einführung in GraphdatenbankenNeo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j
 
Neo4j Partner Tag Berlin - Potential für System-Integratoren und Berater
Neo4j Partner Tag Berlin - Potential für System-Integratoren und Berater Neo4j Partner Tag Berlin - Potential für System-Integratoren und Berater
Neo4j Partner Tag Berlin - Potential für System-Integratoren und Berater
Neo4j
 
Neo4j PartnerDay Amsterdam 2017
Neo4j PartnerDay Amsterdam 2017Neo4j PartnerDay Amsterdam 2017
Neo4j PartnerDay Amsterdam 2017
Neo4j
 
Enterprise Reporting with MongoDB and JasperSoft
Enterprise Reporting with MongoDB and JasperSoftEnterprise Reporting with MongoDB and JasperSoft
Enterprise Reporting with MongoDB and JasperSoft
MongoDB
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4j
Neo4j
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
MongoDB
 
Redis Labs - NOAH18 Tel Aviv
Redis Labs - NOAH18 Tel Aviv Redis Labs - NOAH18 Tel Aviv
Redis Labs - NOAH18 Tel Aviv
NOAH Advisors
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 
Partner Recruitment Webinar: "Join the Most Productive Ecosystem in Big Data ...
Partner Recruitment Webinar: "Join the Most Productive Ecosystem in Big Data ...Partner Recruitment Webinar: "Join the Most Productive Ecosystem in Big Data ...
Partner Recruitment Webinar: "Join the Most Productive Ecosystem in Big Data ...
MongoDB
 
Emea partners recruitment webinar
Emea partners recruitment webinarEmea partners recruitment webinar
Emea partners recruitment webinarMongoDB
 
Webinar: General Technical Overview of MongoDB for Ops Teams
Webinar: General Technical Overview of MongoDB for Ops TeamsWebinar: General Technical Overview of MongoDB for Ops Teams
Webinar: General Technical Overview of MongoDB for Ops Teams
MongoDB
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Dataconomy Media
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Maya Lumbroso
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-Service
MongoDB
 
Neo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j GraphTalks - Einführung in GraphdatenbankenNeo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j
 
GraphTalks Hamburg - Einführung in Graphdatenbanken
GraphTalks Hamburg - Einführung in GraphdatenbankenGraphTalks Hamburg - Einführung in Graphdatenbanken
GraphTalks Hamburg - Einführung in Graphdatenbanken
Neo4j
 
Advanced applications with MongoDB
Advanced applications with MongoDBAdvanced applications with MongoDB
Advanced applications with MongoDB
Norberto Leite
 
Graphdatenbank Neo4j: Konzept, Positionierung, Status Region DACH - Bruno Un...
 Graphdatenbank Neo4j: Konzept, Positionierung, Status Region DACH - Bruno Un... Graphdatenbank Neo4j: Konzept, Positionierung, Status Region DACH - Bruno Un...
Graphdatenbank Neo4j: Konzept, Positionierung, Status Region DACH - Bruno Un...
Neo4j
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
MongoDB
 
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4jNeo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Neo4j
 

Similar to OLX Group presentation for AWS Redshift meetup in London, 5 July 2017 (20)

Neo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j GraphTalks - Einführung in GraphdatenbankenNeo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j GraphTalks - Einführung in Graphdatenbanken
 
Neo4j Partner Tag Berlin - Potential für System-Integratoren und Berater
Neo4j Partner Tag Berlin - Potential für System-Integratoren und Berater Neo4j Partner Tag Berlin - Potential für System-Integratoren und Berater
Neo4j Partner Tag Berlin - Potential für System-Integratoren und Berater
 
Neo4j PartnerDay Amsterdam 2017
Neo4j PartnerDay Amsterdam 2017Neo4j PartnerDay Amsterdam 2017
Neo4j PartnerDay Amsterdam 2017
 
Enterprise Reporting with MongoDB and JasperSoft
Enterprise Reporting with MongoDB and JasperSoftEnterprise Reporting with MongoDB and JasperSoft
Enterprise Reporting with MongoDB and JasperSoft
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4j
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Redis Labs - NOAH18 Tel Aviv
Redis Labs - NOAH18 Tel Aviv Redis Labs - NOAH18 Tel Aviv
Redis Labs - NOAH18 Tel Aviv
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
Partner Recruitment Webinar: "Join the Most Productive Ecosystem in Big Data ...
Partner Recruitment Webinar: "Join the Most Productive Ecosystem in Big Data ...Partner Recruitment Webinar: "Join the Most Productive Ecosystem in Big Data ...
Partner Recruitment Webinar: "Join the Most Productive Ecosystem in Big Data ...
 
Emea partners recruitment webinar
Emea partners recruitment webinarEmea partners recruitment webinar
Emea partners recruitment webinar
 
Webinar: General Technical Overview of MongoDB for Ops Teams
Webinar: General Technical Overview of MongoDB for Ops TeamsWebinar: General Technical Overview of MongoDB for Ops Teams
Webinar: General Technical Overview of MongoDB for Ops Teams
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-Service
 
Neo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j GraphTalks - Einführung in GraphdatenbankenNeo4j GraphTalks - Einführung in Graphdatenbanken
Neo4j GraphTalks - Einführung in Graphdatenbanken
 
GraphTalks Hamburg - Einführung in Graphdatenbanken
GraphTalks Hamburg - Einführung in GraphdatenbankenGraphTalks Hamburg - Einführung in Graphdatenbanken
GraphTalks Hamburg - Einführung in Graphdatenbanken
 
Advanced applications with MongoDB
Advanced applications with MongoDBAdvanced applications with MongoDB
Advanced applications with MongoDB
 
Graphdatenbank Neo4j: Konzept, Positionierung, Status Region DACH - Bruno Un...
 Graphdatenbank Neo4j: Konzept, Positionierung, Status Region DACH - Bruno Un... Graphdatenbank Neo4j: Konzept, Positionierung, Status Region DACH - Bruno Un...
Graphdatenbank Neo4j: Konzept, Positionierung, Status Region DACH - Bruno Un...
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4jNeo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4j
 

More from Dobo Radichkov

Holland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teamsHolland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teams
Dobo Radichkov
 
Unleashing the Power of GPT & LLM: A Holland & Barrett Exploration
Unleashing the Power of GPT & LLM: A Holland & Barrett ExplorationUnleashing the Power of GPT & LLM: A Holland & Barrett Exploration
Unleashing the Power of GPT & LLM: A Holland & Barrett Exploration
Dobo Radichkov
 
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...
Dobo Radichkov
 
Customer lifecycle management for fun and profit at OLX, Berlin marketplace c...
Customer lifecycle management for fun and profit at OLX, Berlin marketplace c...Customer lifecycle management for fun and profit at OLX, Berlin marketplace c...
Customer lifecycle management for fun and profit at OLX, Berlin marketplace c...
Dobo Radichkov
 
OLX Ventures blockchain perspective, Feb 2018
OLX Ventures blockchain perspective, Feb 2018OLX Ventures blockchain perspective, Feb 2018
OLX Ventures blockchain perspective, Feb 2018
Dobo Radichkov
 
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, BarcelonaReal-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Dobo Radichkov
 

More from Dobo Radichkov (6)

Holland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teamsHolland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teams
 
Unleashing the Power of GPT & LLM: A Holland & Barrett Exploration
Unleashing the Power of GPT & LLM: A Holland & Barrett ExplorationUnleashing the Power of GPT & LLM: A Holland & Barrett Exploration
Unleashing the Power of GPT & LLM: A Holland & Barrett Exploration
 
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...
 
Customer lifecycle management for fun and profit at OLX, Berlin marketplace c...
Customer lifecycle management for fun and profit at OLX, Berlin marketplace c...Customer lifecycle management for fun and profit at OLX, Berlin marketplace c...
Customer lifecycle management for fun and profit at OLX, Berlin marketplace c...
 
OLX Ventures blockchain perspective, Feb 2018
OLX Ventures blockchain perspective, Feb 2018OLX Ventures blockchain perspective, Feb 2018
OLX Ventures blockchain perspective, Feb 2018
 
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, BarcelonaReal-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
 

Recently uploaded

Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
gharris9
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
faizulhassanfaiz1670
 
Burning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdfBurning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdf
kkirkland2
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
Faculty of Medicine And Health Sciences
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
eCommerce Institute
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Access Innovations, Inc.
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
Vladimir Samoylov
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
OECD Directorate for Financial and Enterprise Affairs
 
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Dutch Power
 
Tom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issueTom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issue
amekonnen
 
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Dutch Power
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
Howard Spence
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
khadija278284
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
IP ServerOne
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
OWASP Beja
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Matjaž Lipuš
 
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
AwangAniqkmals
 

Recently uploaded (20)

Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
 
Burning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdfBurning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdf
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
 
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
 
Tom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issueTom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issue
 
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
 
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
 

OLX Group presentation for AWS Redshift meetup in London, 5 July 2017

  • 1. Free Classifieds www.olx.com Amazon Redshift at OLX Group Advanced analytics and big data innovation at the world’s largest online classifieds business Dobo Radichkov | London, 5 July 2017
  • 2. 2 Contents • Introduction to Naspers and OLX Group • OLX capabilities powered by Redshift • OLX Redshift technical best practices • Q&A
  • 3. 3 Contents • Introduction to Naspers and OLX Group • OLX capabilities powered by Redshift • OLX Redshift technical best practices • Q&A
  • 4. 4 Introducing NASPERS A $83B global internet & entertainment group and one of the largest technology investors in the world $15BRevenue $2BEarnings 130 Countries 1.5BAudience reach 27.000 People
  • 5. 5 Introducing OLX GROUP: The world’s largest classifieds business 40 Countries 20+ Offices 3000 Employees 15+ Brands
  • 6. 6 OLX Group is a powerful global community 1.7B+ monthly visits 35B+ monthly page views 60M+ monthly listings 300M+ monthly active users 4.4 APP RATING #1 app 22+ COUNTRIES Mobile leader People spend more than twice as long in OLX apps versus competitors Scale • 2 houses • 2 cars • 3 fashion items • 3 mobile phones Listed every second • 40 countries • 20+ offices • 3,000 employees • 15+ brands Global footprint
  • 7. 7 Contents • Introduction to Naspers and OLX Group • OLX capabilities powered by Redshift • OLX Redshift technical best practices • Q&A
  • 8. 8 At OLX, Redshift powers 3 important business capabilities Customer Lifecycle Management Personalisation & Relevance Business Intelligence
  • 9. 9 At OLX, Redshift powers 3 important business capabilities Customer Lifecycle Management Personalisation & Relevance Business Intelligence Enhance the user experience and lifetime value via personalised, relevant, targeted and unified omni-channel user communications
  • 10. 10 Activity level (buying and selling) Time First-time user Returning user Loyal user (buying and selling) Loyal user (across multiple categories) Fan!Visitor Increase the customer lifetime value… … through the ‘right’ product, marketing and customer care treatments Fundamentally, CLM is all about fuelling retention-driven growth by treating our customers the best way possible in their lifecycle stage
  • 11. 11 Trans- actional Website & mobile app Customer care Social / 3rd party Customer segmen- tation 4 Insights & analytics 3 Execution 6 Customer treatments 5 Single customer view 2 Platforms and data 1 The OLX CLM implementation is enabled by automated, data-driven, targeted and personalised customer treatments
  • 12. 12 At OLX, Redshift powers 3 important business capabilities Customer Lifecycle Management Personalisation & Relevance Business Intelligence Enhance the user experience and lifetime value via personalised, relevant, targeted and unified omni-channel user communications Grow buyer engage- ment, seller success and transactions by showing relevant & personalised content to each of our users
  • 13. 13 Context: Panamera is tackling the challenges of Search & Relevance across 3 pillars of content discovery Home page Search experience Recommendations Show the most relevant content to each of our users • Personalised content feed driving buyer engagement and seller success driven by buyer interests, social relationship, proximity, freshness and ad / user quality • Core search results experience including text matching, spell checking, synonym mapping, language- specific optimisa- tions, etc. • Search auto- complete, auto- suggest, instant results and curated content • Recommended content (e.g. listings, categories, search) used to personalise elements of the buyer + seller user journey(s) based on past behaviour, preferences and activity
  • 14. 14 At OLX, Redshift powers 3 important business capabilities Customer Lifecycle Management Personalisation & Relevance Business Intelligence Enhance the user experience and lifetime value via personalised, relevant, targeted and unified omni-channel user communications Grow buyer engage- ment, seller success and transactions by showing relevant & personalised content to each of our users Empower the business with robust BI data platform, high-quality executive reporting, and actionable customer insights
  • 15. 15 Contents • Introduction to Naspers and OLX Group • OLX capabilities powered by Redshift • OLX Redshift technical best practices • Q&A
  • 16. 16 Our guiding principle for big data development using Redshift ü Use few but powerful technologies (and become world-class at them) ü Keep architecture simple and minimise points of failure ü Standardise, build on each other, foster continuous improvement “Everything should be made as simple as possible, but not simpler.” Einstein
  • 17. 17 OLX Redshift technical best practices Ø Technical architecture Ø Data management Ø Recommenders Ø Unit testing Ø OLAP cubes Ø Tableau integration
  • 18. 18 OLX Redshift technical best practices Ø Technical architecture Ø Data management Ø Recommenders Ø Unit testing Ø OLAP cubes Ø Tableau integration
  • 19. 19 Let’s walk through our high-level data architecture step by step… RDL ODL Master 64 × ds1.xl CLM platform 100 × dc1.l RDL ODL RDL+ ODL+ ADL+ Analyst sandboxes (read / write access) BI platform 64 × ds1.xl ADL CLM APIs Management dashboards Operational dashboards SCV ADL ODL ODL Ad hoc analytics LiveSync Hydra Moderation CRMs APIs Crawlers … Master data infrastructure Extended BI infrastructure Data refreshed every ~3-5 hours Data refreshed every 24 hours Load / unload Transformation / modelling Replication Reporting platform 100 × dc1.l Mktg channels
  • 20. 20 LiveSync in-house technology enables dynamic synchronisation of MySQL production databases to Redshift LiveSync Platform database Live DB replica (MySQL) Lazarus MySQL extractor (Python) S3 storage Data lake LiveSync Redshift loader (Python) LiveSync
  • 21. 21 Ninja / Hydra in-house multi-tracking capability collects structured clickstream data from each client device LiveSync Client device (dekstop, mobile, apps) Ninja tracker Client-side library (JS, PHP, Android, iOS) Hydra tracker Server-side Java application S3 storage Data lake Hydra Ninja / Hydra
  • 22. 22 LiveSync and Hydra raw data form the backbone of our architecture and are loaded in Raw Data Layer (RDL) of our Master Redshift cluster LiveSync Hydra RDL Master 64 × ds1.xl Load / unload Transformation / modelling Replication LiveSync: • ~1,000 tables • ~100 billion records • ~12 TB compressed storage Hydra: • ~400 tables • ~400 billion records • ~30 TB compressed storage Raw data layer (RDL)
  • 23. 23 Raw data is transformed and modelled into the core Operational Data Layer (ODL) used to feed all data applications LiveSync Hydra RDL Master 64 × ds1.xl ODL Stats: • ~100 tables • ~150 billion records • ~5 TB compressed storage Facts: • Listings • Listing liquidity • Replies • Revenue transactions • Clickstream events • Listing impressions • … Dimensions: • Users • Business units • Geographies • Categories • Channels • (~40 dimensions in total) Operational data layer (ODL) Load / unload Transformation / modelling Replication
  • 24. 24 From here, the ODL is replicated to each data application via our Rzeka data replication in-house utility LiveSync Hydra RDL Master 64 × ds1.xl ODL CLM platform 100 × dc1.l Reporting platform 100 × dc1.l ODL ODL RDL ODL BI platform 64 × ds1.xl Rzeka replication utility What is it? • A fully-configurable Python utility enabling incremental Redshift-to-Redshift data replication Load / unload Transformation / modelling Replication
  • 25. 25 ODL is used as a basis to build the CLM Single Customer View enabling CLM treatments and recommendations LiveSync Hydra RDL Master 64 × ds1.xl ODL CLM platform 100 × dc1.l ODL ODL RDL ODL BI platform 64 × ds1.xl CLM data platform What does it do? • User mapping, customer lifecycle segmentation, recommendation & sort order algorithms, CLM treatments generation & execution, CLM reporting & analytics CLM APIs Mktg channels SCV Load / unload Transformation / modelling Replication Reporting platform 100 × dc1.l
  • 26. 26 Similarly, ODL is used to generate the Analytical Data Layer (ADL) in our Reporting platform LiveSync Hydra RDL Master 64 × ds1.xl ODL CLM platform 100 × dc1.l ODL ODL RDL ODL SCV Management dashboards ADL BI platform 64 × ds1.xl ADL Triton mgmt. reporting What is it? • Data models and cubes for Top Management KPIs across seller, buyer, liquidity, revenue and product activity • Tableau dashboard implementation Load / unload Transformation / modelling Replication CLM APIs Mktg channels Reporting platform 100 × dc1.l
  • 27. 27 The BI data warehouse sources additional raw data into an Extended Raw Data Layer (RDL+) LiveSync Hydra RDL Master 64 × ds1.xl ODL CLM platform 100 × dc1.l ODL ODL RDL ODL BI platform 64 × ds1.xl SCV Management dashboards ADL RDL+ Moderation CRMs APIs Crawlers … ADL Extended Raw Data Layer (RDL+) What is it? • New raw data sources supporting extended management and operational analytics across e.g. Performance Marketing, CS, Competitor, Salesforce, etc. Load / unload Transformation / modelling Replication CLM APIs Mktg channels Reporting platform 100 × dc1.l
  • 28. 28 The RDL+ is modelled into an Extended Operational & Analytical Data Layers (ODL+ & ADL+) used to power operational reporting LiveSync Hydra RDL Master 64 × ds1.xl ODL CLM platform 100 × dc1.l ODL ODL RDL ODL BI platform 64 × ds1.xl SCV Management dashboards ADL RDL+ Moderation CRMs APIs Crawlers … ADL ODL+ ADL+ Operational dashboards ODL+ and ADL+ What is it? • Data layers enabling extended operational reporting and analysis Load / unload Transformation / modelling Replication CLM APIs Mktg channels Reporting platform 100 × dc1.l
  • 29. 29 Ad hoc analysis enabled through read-only SQL endpoints and read/write sandboxes inside the BI platform LiveSync Hydra RDL Master 64 × ds1.xl ODL CLM platform 100 × dc1.l ODL ODL RDL ODL BI platform 64 × ds1.xl SCV Management dashboards ADL RDL+ Moderation CRMs APIs Crawlers … ADL ODL+ ADL+ Operational dashboards Analyst sandboxes for ad hoc analysis Analyst sandboxes (read / write access) Ad hoc analytics Load / unload Transformation / modelling Replication CLM APIs Mktg channels Reporting platform 100 × dc1.l
  • 30. 30 Detailed end-state OLX Group central data architecture RDL ODL Master 64 × ds1.xl CLM platform 100 × dc1.l RDL ODL RDL+ ODL+ ADL+ Analyst sandboxes (read / write access) BI platform 64 × ds1.xl ADL CLM APIs Management dashboards Operational dashboards SCV ADL ODL ODL Ad hoc analytics LiveSync Hydra Moderation CRMs APIs Crawlers … Master data infrastructure Extended BI infrastructure Data refreshed every ~3-5 hours Data refreshed every 24 hours Load / unload Transformation / modelling Replication Reporting platform 100 × dc1.l Mktg channels
  • 31. 31 Side note: With Amazon’s new Athena and Spectrum services we are exploring new architectural possibilities & improvements LiveSync Hydra Moderation CRMs APIs Crawlers … Redshift Athena metadata catalogue (using Hive DDL) Athena JDBC endpoint Presto distributed SQL engine Spectrum external S3 tables Redshift native tables Spectrum distributed SQL engine JOIN
  • 32. 32 OLX Redshift technical best practices Ø Technical architecture Ø Data management Ø Recommenders Ø Unit testing Ø OLAP cubes Ø Tableau integration
  • 35. 35 1 cluster = 1 capability = 1 owner ≈ 1 team Master data platform 64 × ds1.xl CLM platform 100 × dc1.l BI platform 64 × ds1.xl Reporting platform 100 × dc1.l Tableau cluster 48 × dc1.l
  • 37. 37 We typically organise our data in 3 data layers RDL ODL ADL Raw Data Layer Raw disaggregated and unprocessed clickstream and production database data Operational Data Layer Clean, structured and standardised dimensional model serving as foundation for all data applications Application Data Layer Data models specific to each data application – including CLM algorithms, recommenders, BI metrics, OLAP cubes, etc.
  • 38. 38 Large scale clickstream data management Optimise table design Partition data Create abstraction views RDL
  • 39. 39 Large scale clickstream data management Optimise table design Partition data Create abstraction views Goal is to achieve best possible distribution, date range WHERE performance, and compression effectiveness RDL
  • 40. 40 Large scale clickstream data management Optimise table design Partition data Create abstraction views Goal is to achieve best possible distribution, date range WHERE performance, and compression effectiveness Option 1 Distribution key user_id Sort key event_timestamp Best possible distribution Date range performance Effective compression (poor compression of text columns) RDL
  • 41. 41 Large scale clickstream data management Optimise table design Partition data Create abstraction views Goal is to achieve best possible distribution, date range WHERE performance, and compression effectiveness Option 1 Option 2 Distribution key user_id user_id Sort key event_timestamp user_id Best possible distribution Date range performance (always full scan) Effective compression (poor compression of text columns) (1/2x table size) RDL
  • 42. 42 Large scale clickstream data management Optimise table design Partition data Create abstraction views Goal is to achieve best possible distribution, date range WHERE performance, and compression effectiveness Option 1 Option 2 Option 3 Distribution key user_id user_id user_id Sort key event_timestamp user_id event_date, user_id, event_timestamp Best possible distribution Date range performance (always full scan) Effective compression (poor compression of text columns) (1/2x table size) (1/2x table size) RDL
  • 43. 43 Large scale clickstream data management Examples: • europe_android_201706 • latam_web_201703 • asia_ios_201706 • (~650 tables in total) Benefits: • Easily DROP older data when no longer needed • Minimise use of DELETE / VACUUM operations • Localise points of failure and ringfence data repairs • Allow for platform and channel-specific table schema Optimise table design Partition data Create abstraction views We partition our clickstream data into 1 table per PLATFORM × CHANNEL × MONTH combination RDL
  • 44. 44 Large scale clickstream data management Types of abstraction views • Current month à europe_android_current_month • Previous month à latam_web_previous_month • Last X months (X = 3/6/12) à asia_ios_last_X_months • All months à africa_android • View creation is automated via Python script • (~300 views in total) Benefits: • Abstraction of underlying partitioning mechanism • Time-agnostic ETLs and analytical queries Optimise table design Partition data Create abstraction views We create abstraction VIEWs over individual tables that UNION ALL the data into relevant groups RDL
  • 46. 46 Dimensional modelling approach Surrogate keys Hierarchical dimensions OLX operates in dozens of markets using many different classifieds platforms built on different technologies and using different production databases It is impossible to ensure system id uniqueness à Need robust mechanism for surrogate key modelling ODL ADL
  • 47. 47 Dimensional modelling approach Surrogate keys Hierarchical dimensions OLX operates in dozens of markets using many different classifieds platforms built on different technologies and using different production databases It is impossible to ensure system id uniqueness à Need robust mechanism for surrogate key modelling Option 1 Example Use combination of system ids that guarantees uniqueness Leads to nightmare spaghetti SQL that is difficult and time-consuming to write, read and maintain SELECT ... FROM odl.fact_listings f JOIN odl.dim_categories d ON f.platform_id = d.platform_id AND f.country_id = d.country_id AND f.brand_id = d.brand_id AND f.category_id = d.category_id AND f.category_level = d.category_level JOIN ... GROUP BY ... ODL ADL
  • 48. 48 Dimensional modelling approach Surrogate keys Hierarchical dimensions OLX operates in dozens of markets using many different classifieds platforms built on different technologies and using different production databases It is impossible to ensure system id uniqueness à Need robust mechanism for surrogate key modelling Option 1 Option 2 Use combination of system ids that guarantees uniqueness Create globally unique identifier (GUID) for each dimension value • Complex to implement • Requires GUID mapping tables to able able to trace values back to system ids • Dimension keys lose semantic meaning ODL ADL
  • 49. 49 Dimensional modelling approach Surrogate keys Hierarchical dimensions OLX operates in dozens of markets using many different classifieds platforms built on different technologies and using different production databases It is impossible to ensure system id uniqueness à Need robust mechanism for surrogate key modelling Option 1 Option 2 Option 3 Use combination of system ids that guarantees uniqueness Create globally unique identifier (GUID) for each dimension value Create smart & persistent surrogate keys that preserve semantic meaning ODL ADL
  • 50. 50 Dimensional modelling approach Surrogate keys Hierarchical dimensions SELECT ... FROM fact_listings JOIN dim_countries USING (country_sk) -- 'olx|eu|ua' JOIN dim_categories USING (category_sk) -- 'olx|asia|in|5|84|1531' JOIN dim_geographies USING (geography_sk) -- 'olx|eu|ua|17|194|194' JOIN dim_channels USING (channel_sk) -- 'mobile_app|android' JOIN dim_listing_status USING (listing_status_sk) -- 'inactive|mod|mod_removed' JOIN dim_listing_types USING (listing_type_sk) -- 'private' JOIN dim_listing_feeds USING (listing_feed_sk) -- 'normal' JOIN dim_listing_net USING (listing_net_sk) -- 'net|mod>live>eod' JOIN dim_currencies USING (currency_sk) -- 'aed' JOIN dim_users USING (user_sk) --'olx|latam|pe|platform|email|freddy@gmail.com' JOIN ... GROUP BY ... Example Benefits: • Simple and guaranteed JOINs • ‘Readable’ key values • Negligible impact on query performance (= Redshift rocks!) ODL ADL
  • 51. 51 Dimensional modelling approach Surrogate keys Hierarchical dimensions We model most dimensions as hierarchical dimensions ODL ADL Approach: • Extract parent-child relationship from system dimensions • For non-system dimensions, define child-parent relationship in configuration tables • Create hierarchical dimensions using generic SQL hierarchy generation script • Set all dimension key values in fact tables as deepest available hierarchy level (ideally key of tree leaf values) Benefits: • Consistent dimensional modelling approach • Can easily traverse hierarchy from leaves all the way up to root • Easy to read & write JOINs with single-key ON conditions
  • 52. 52 Dimensional modelling approach – Hierarchical dimensions ODL ADL Example: dim_channels dimension Configuration
  • 53. 53 Dimensional modelling approach – Hierarchical dimensions ODL ADL Example: dim_channels dimension Output
  • 54. 54 Dimensional modelling approach – Hierarchical dimensions Code taster ODL ADL
  • 56. 56 Simplest workflow typically takes an input and applies dimensional modelling & business rules to transform the data into desired output Input table(s) Transformation Transformation View Table SELECT INSERT Dimension(s) Map(s) Every component uses the output from other previously modelled components as input Relevant Dimensions and Maps are included to apply dimensional model and required business rules SQL transformation logic is decoupled and stored in a separate Transformation VIEW CREATE TABLE fact_output (...) DISTSTYLE KEY DISTKEY (...) SORTKEY(...); CREATE OR REPLACE VIEW fact_output_view AS SELECT ... -- business logic and projections FROM fact_input JOIN dimensions JOIN maps GROUP BY ...; -- aggregation logic TRUNCATE fact_output; INSERT INTO fact_output SELECT * FROM fact_output_view; ANALYZE fact_output; Final table that contains output of data workflow
  • 57. 57 For more complex workflows, we use one or more staging steps to ensure code modularity and have better control over performance Input table(s) View Table SELECT INSERT Transformation staging step(s) Transformation Transformation staging step(s) Transformation Dimension(s) Map(s) Intermediate staging logic is encapsulated in separate VIEWs Staging output is materialised in dedicated tables to used as input into subsequent transformation steps CREATE TABLE fact_output_staging_step1 (...) DISTSTYLE KEY DISTKEY (...) SORTKEY(...); CREATE TABLE fact_output_staging_step2 (...) DISTSTYLE KEY DISTKEY (...) SORTKEY(...); -- ... more staging steps as required ... CREATE TABLE fact_output (...) DISTSTYLE KEY DISTKEY (...) SORTKEY(...); CREATE OR REPLACE VIEW fact_output_staging_step1_view AS SELECT ... -- business logic and projections FROM fact_input JOIN ... GROUP BY ...; -- aggregation logic CREATE OR REPLACE VIEW fact_output_staging_step2_view AS SELECT ... -- business logic and projections FROM fact_output_staging_step1 JOIN ... GROUP BY ...; -- aggregation logic -- ... more staging steps as required ... CREATE OR REPLACE fact_output_view AS SELECT ... -- business logic and projections FROM fact_output_staging_step2 JOIN ... GROUP BY ...; -- aggregation logic TRUNCATE fact_output_staging_step1; INSERT INTO fact_output_staging_step1 SELECT * FROM fact_output_staging_step1_view; ANALYZE fact_output_staging_step1; TRUNCATE fact_output_staging_step1; INSERT INTO fact_output_staging_step2 SELECT * FROM fact_output_staging_step2_view; ANALYZE fact_output_staging_step1; -- ... more staging steps as required ... TRUNCATE fact_output; INSERT INTO fact_output SELECT * FROM fact_output_view; ANALYZE fact_output;
  • 58. 58 We use Feeders to apply the same Transformation to different Inputs Input table(s) View Table SELECT INSERT Feeder Transformation Transformation Dimension(s) Map(s) CREATE OR REPLACE fact_output_feeder1_view AS -- Combine & apply projections / pre-processing SELECT ... FROM fact_input1 UNION ALL SELECT ... FROM fact_input2 UNION ALL SELECT ... FROM fact_input3; CREATE OR REPLACE fact_output_feeder2_view AS -- Combine & apply projections / pre-processing SELECT ... FROM fact_input4 UNION ALL SELECT ... FROM fact_input5 UNION ALL SELECT ... FROM fact_input6; CREATE OR REPLACE fact_output_feeder_view AS SELECT * FROM fact_output_feeder2_view; CREATE OR REPLACE fact_output_view AS SELECT ... -- business logic and projections FROM fact_output_feeder_view ...; TRUNCATE fact_output; -- Run transformation on fact_input1 CREATE OR REPLACE fact_output_feeder_view AS SELECT * FROM fact_output_feeder1_view; INSERT INTO fact_output SELECT * FROM fact_output_view; ANALYZE fact_output; -- Run transformation on fact_input2 CREATE OR REPLACE fact_output_feeder_view AS SELECT * FROM fact_output_feeder2_view; INSERT INTO fact_output SELECT * FROM fact_output_view; ANALYZE fact_output; Feeders are VIEWs that decouple the input data from the Transformation logic. They can be used to collate multiple inputs (e.g. UNION ALL) & apply basic pre-processing and projections serving as basis for rest of data flow Input table(s)
  • 59. 59 Feeders can be very powerful with incremental data workflows Input table(s) Transformation Transformation Dimension(s) Map(s) View Table SELECT INSERT Fast incremental load feeder Slow incremental load feeder Full load feeder CREATE OR REPLACE fact_output_feeder_full_load_view AS SELECT ... -- projections & pre-processing -- Use full available time range FROM fact_input; CREATE OR REPLACE fact_output_feeder_incr_load_slow_view AS SELECT ... -- projections & pre-processing FROM fact_input -- Use 4-week incremental time window WHERE date_nk >= (GETDATE() :: DATE - INTERVAL '4 week') :: DATE; CREATE OR REPLACE fact_output_feeder_incr_load_fast_view AS SELECT ... -- projections & pre-processing FROM fact_input -- Use 2-day incremental time window WHERE date_nk >= (GETDATE() :: DATE - INTERVAL '2 day') :: DATE; CREATE OR REPLACE fact_output_feeder_view AS SELECT * FROM fact_output_feeder_incr_load_fast_view; Typically we switch between 3 load feeders depending on the ETL processing approach: (1) Full load – Processes the input in its entirety. Used when running transfor- mation for the first time on full scale. (2) Incremental Fast load – Processes last few days / hours of data. Used by default for scheduled production job. (3) Incremental Slow load – Processes last month of data. Used to manually fix problems from previous runs without having to re-process years of data.
  • 60. 60 Another great use of Feeders is as a method to switch between Production and Development environments Input table(s) Transformation Transformation Dimension(s) Map(s) Development feeder Production feeder View Table SELECT INSERT We use Production / Development feeder VIEWs to switch between full scale (e.g. all countries) and development scale (e.g. few small countries). This enables fast ETL runtimes during development and testing CREATE OR REPLACE fact_output_feeder_development_view AS SELECT ... -- projections & pre-processing FROM fact_input WHERE country_sk IN ('olx|mea|gh', 'olx|mea|za'); CREATE OR REPLACE fact_output_feeder_production_view AS SELECT ... -- projections & pre-processing FROM fact_input; CREATE OR REPLACE fact_output_feeder_view AS SELECT * FROM fact_output_feeder_development_view;
  • 61. 61 Ultimately, these patterns can be combined in various combinations depending on the requirements of the data workdlow Input table(s) Development feeder Transformation staging step(s) Transformation Production feeder Fast incremental load feeder Slow incremental load feeder Full load feeder Transformation staging step(s) Transformation Root feeder View Table SELECT INSERT Dimension(s) Map(s)
  • 62. 62 Individual data workflows add up to our full data management ecosystem Component 1 Component 2 Component 3 Component 4 … …
  • 64. 64 High-level code repository organisation master rdl hydra android *.sql ios *.sql … *.sql livesync … *.sql odl dim dim_actions *.sql dim_categories *.sql … *.sql map map_actions *.sql map_channels *.sql … *.sql fact fact_event_clickstream *.sql fact_listings *.sql … *.sql stg … *.sql … … *.sql adl dim, map, fact … … *.sql … … *.sql clm rdl dim, map, fact … … *.sql odl dim, map, fact … … *.sql adl dim, map, fact … … *.sql … Cluster Data layer Component group Component Code
  • 65. 65 Example repository structure and file naming We use standard file naming with special prefixes to decouple the logical building blocks of the data workflow: • Table definition • View definition (feeders, transformations, unit tests) • ETL scripts • Configuration scripts • Analysis
  • 66. 66 OLX Redshift technical best practices Ø Technical architecture Ø Data management Ø Recommenders Ø Unit testing Ø OLAP cubes Ø Tableau integration
  • 68. 68 Collaborative filtering visual 101 Depeche Mode The Cure Keane Placebo Suede
  • 69. 69 Collaborative filtering visual 101 Depeche Mode The Cure Keane Placebo Suede Jack Jill Jim Jen
  • 70. 70 Collaborative filtering visual 101 Depeche Mode The Cure Keane Placebo Suede Jack Jill Jim Jen
  • 71. 71 1st degree recommendation Collaborative filtering visual 101 Depeche Mode The Cure Keane Placebo Suede Jack Jill Jim Jen
  • 72. 72 1st degree recommendation 1st degree recommendation Collaborative filtering visual 101 Depeche Mode The Cure Keane Placebo Suede Jack Jill Jim Jen
  • 73. 73 1st degree recommendation 1st degree recommendation Collaborative filtering visual 101 Depeche Mode The Cure Keane Placebo Suede 2nd degree recommendation Jack Jill Jim Jen
  • 74. 74 Collaborative filtering implementation Get and rate all relevant item interactions Compute item similarity matrix Compute 1st and 2nd degree item-to-item recommendations Compute personalised user-to-item recommendations
  • 75. 75 Collaborative filtering implementation Get and rate all relevant item interactions Compute item similarity matrix Compute 1st and 2nd degree item-to-item recommendations Compute personalised user-to-item recommendations CREATE TABLE AS item_interactions AS SELECT user, band, SUM(DECODE(action, 'like', 1, 'play', 3)) AS score FROM clickstream WHERE action IN ('like', 'play') GROUP BY 1, 2; user band score Jack Depeche Mode 5 Jack The Cure 3 Jack Keane 8 Jill Depeche Mode 4 Jill Keane 10 Jill Suede 7 Jim Keane 15 … … …
  • 76. 76 Collaborative filtering implementation Get and rate all relevant item interactions Compute item similarity matrix Compute 1st and 2nd degree item-to-item recommendations Compute personalised user-to-item recommendations CREATE TABLE AS similarity_matrix AS SELECT i1.band AS band1, i2.band AS band2, COUNT(1) AS frequency, SUM(i1.score + i2.score) AS score FROM item_interactions i1 JOIN item_interactions i2 ON i1.user = i2.user AND i1.band <> i2.band GROUP BY 1,2 HAVING COUNT(1) > 1 user band score Jack Depeche Mode 5 Jack The Cure 3 Jack Keane 8 Jill Depeche Mode 4 Jill Keane 10 Jill Suede 7 Jim Keane 15 … … … Depeche Mode TheCure Keane Placebo Suede Depeche Mode 3 4 The Cure 3 3 Keane 4 3 5 Placebo 2 Suede 5 2
  • 77. 77 user band score Jack Depeche Mode 5 Jack The Cure 3 Jack Keane 8 Jill Depeche Mode 4 Jill Keane 10 Jill Suede 7 Jim Keane 15 … … … Depeche Mode TheCure Keane Placebo Suede Depeche Mode 3 4 The Cure 3 3 Keane 4 3 5 Placebo 2 Suede 5 2 Collaborative filtering implementation Get and rate all relevant item interactions Compute item similarity matrix Compute 1st and 2nd degree item-to-item recommendations Compute personalised user-to-item recommendations 1st degree item-to-item recommendations band1 band2 frequency sum_score rec_rank Depeche Mode Keane 4 49 1 Depeche Mode The Cure 3 39 2 Keane Suede 5 80 1 Keane Depeche Mode 4 49 2 Keane The Cure 3 48 3 Placebo Suede 2 62 1 Suede Keane 5 80 1 Suede Placebo 2 62 2 The Cure Keane 3 48 1 The Cure Depeche Mode 3 39 2 CREATE TABLE dobo.rec_item2item AS SELECT band1, band2, frequency, sum_score, ROW_NUMBER() OVER (PARTITION BY band1 ORDER BY frequency DESC, sum_score DESC) AS rec_rank FROM similarity_matrix;
  • 78. 78 Collaborative filtering implementation Get and rate all relevant item interactions Compute item similarity matrix Compute 1st and 2nd degree item-to-item recommendations Compute personalised user-to-item recommendations INSERT INTO dobo.rec_item2item WITH max_rank AS ( SELECT band1, MAX(rec_rank) AS max_rank_1st_degree FROM dobo.rec_item2item GROUP BY 1 ) SELECT rec_1st.band1, rec_2nd.band2, NULL AS frequency, NULL AS sum_score, ROW_NUMBER() OVER ( PARTITION BY rec_1st.band1 ORDER BY MIN(rec_1st.rec_rank * rec_2nd.rec_rank), MIN(rec_1st.rec_rank) ) + max_rank_1st_degree AS rec_rank FROM dobo.rec_item2item rec_1st JOIN max_rank USING (band1) JOIN dobo.rec_item2item rec_2nd ON rec_1st.band2 = rec_2nd.band1 AND rec_1st.band1 <> rec_2nd.band2 -- exclude items already in 1st degree recommendations LEFT JOIN dobo.rec_item2item rec_excl ON rec_1st.band1 = rec_excl.band1 AND rec_2nd.band2 = rec_excl.band2 WHERE rec_excl.band1 IS NULL GROUP BY 1, 2, max_rank_1st_degree; 1st degree item-to-item recommendations band1 band2 frequency sum_score rec_rank Depeche Mode Keane 4 49 1 Depeche Mode The Cure 3 39 2 Keane Suede 5 80 1 Keane Depeche Mode 4 49 2 Keane The Cure 3 48 3 Placebo Suede 2 62 1 Suede Keane 5 80 1 Suede Placebo 2 62 2 The Cure Keane 3 48 1 The Cure Depeche Mode 3 39 2 2nd degree item-to-item recommendations band1 band2 frequency sum_score rec_rank Depeche Mode Suede 3 Keane Placebo 4 Placebo Keane 2 Suede Depeche Mode 3 Suede The Cure 4 The Cure Suede 3
  • 79. 79 Collaborative filtering implementation Get and rate all relevant item interactions Compute item similarity matrix Compute 1st and 2nd degree item-to-item recommendations Compute personalised user-to-item recommendations user band score Jack Depeche Mode 5 Jack The Cure 3 Jack Keane 8 Jill Depeche Mode 4 Jill Keane 10 Jill Suede 7 Jim Keane 15 … … … Depeche Mode TheCure Keane Placebo Suede Depeche Mode 3 4 The Cure 3 3 Keane 4 3 5 Placebo 2 Suede 5 2 user-to-item recommendations user band frequency sum_score rec_rank Ana Placebo 62 6 1 Ana Depeche Mode 49 5 2 Ana The Cure 48 7 3 Dave Placebo 62 6 1 Dave Depeche Mode 49 5 2 Dave The Cure 48 7 3 Eric Placebo 62 6 1 Eric Depeche Mode 49 5 2 Eric The Cure 48 7 3 Jack Placebo 4 1 Jack Suede 80 7 2 Jen Depeche Mode 3 1 Jen The Cure 4 2 Jen Keane 80 3 3 Jill The Cure 87 9 1 Jill Placebo 62 6 2 Jim Depeche Mode 49 5 1 Jim The Cure 48 7 2 John Placebo 4 1 John Suede 80 7 2 Sam Placebo 4 1 Sam Suede 80 7 2 SELECT int.user, rec.band2, SUM(rec.sum_score) AS frequency, SUM(rec.rec_rank) AS sum_score, ROW_NUMBER() OVER ( PARTITION BY int.user ORDER BY SUM(rec.sum_score) DESC, SUM(rec.rec_rank) ASC) AS rec_rank FROM dobo.item_interactions int JOIN dobo.rec_item2item rec ON int.band = rec.band1 -- Exclude recommendations that the user already interacted with LEFT JOIN dobo.item_interactions excl ON rec.band2 = excl.band AND int.user = excl.user WHERE excl.band IS NULL GROUP BY 1,2;
  • 80. 80 At OLX, we apply this approach to implement variety of recommenders Item-to-item People who viewed items A, B, C also viewed items X, Y, Z Category-to-category People who bought Cars were also interested in Car Parts Search-to-category People who searched for ‘black leather sofa’ were interested in Furniture Search-to-search People who searched for ‘porsche’ also searched for ‘bmw’, ‘mercedes’, ‘ferrari' Category-to-search People who browsed Mobile phones searched for ‘iphone 7’, ‘samsung galaxy’, … + many more … … DescriptionRecommender
  • 81. 81 Example recommenders in production www.olx.ph demo
  • 82. 82 Other examples Related search recommendations for browsing users Personalised item recommendations for active buyers Recommended categories to post for active sellers Users like you also liked Try also these related searches Here are some other selling ideas
  • 83. 83 OLX Redshift technical best practices Ø Technical architecture Ø Data management Ø Recommenders Ø Unit testing Ø OLAP cubes Ø Tableau integration
  • 84. 84 Unit testing is an established paradigm in test-driven development
  • 85. 85 OLX is developing Qualis – a unit testing framework for Redshift / SQL ü Switch from reactive to proactive error handling ü Enable SQL codebase scale out ü Reduce maintenance time and ad hoc data investigations ü Make data platform more robust ü Free up time for innovation
  • 86. 86 Qualis includes Redshift-side framework (being piloted) and Python test automation & visualisation (currently in development) Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 Test 9 Test 10 Test 11 … Redshift database Python Qualis tests are implemented in Redshift using VIEWs that return standard output & PASS/FAIL result Qualis script runs daily and SELECTs test output from each VIEW using flexible configuration and parses & aggregates results Final output is visualised using plain text and / or third party visualisation tool (e.g. Tableau) Visualisation
  • 87. 87 Example #1: Duplicates detection test CREATE OR REPLACE VIEW clm.utest_fact_segmentation_duplication_view AS WITH test AS ( SELECT country_sk AS country_sk, COUNT(1) AS cnt, COUNT(DISTINCT user_sk) AS cnt_distinct FROM clm.fact_segmentation GROUP BY 1 ) SELECT 'fact_segmentation' AS test_module, country_sk, 'duplication' AS test_group, NULL AS test_subgroup, NULL AS test_instance, CASE WHEN cnt = cnt_distinct THEN 'PASS' ELSE 'FAIL' END || ' [cnt: ' || cnt || '; COUNT(DISTINCT): ' || cnt_distinct || ']' AS duplicates FROM test ORDER BY 1,2,3,4,5; Compares overall COUNT to COUNT(DISTINCT) for each OLX country to detect data duplication
  • 88. 88 Example #1: Duplicates detection test CREATE OR REPLACE VIEW clm.utest_fact_segmentation_duplication_view AS WITH test AS ( SELECT country_sk AS country_sk, COUNT(1) AS cnt, COUNT(DISTINCT user_sk) AS cnt_distinct FROM clm.fact_segmentation GROUP BY 1 ) SELECT 'fact_segmentation' AS test_module, country_sk, 'duplication' AS test_group, NULL AS test_subgroup, NULL AS test_instance, CASE WHEN cnt = cnt_distinct THEN 'PASS' ELSE 'FAIL' END || ' [cnt: ' || cnt || '; COUNT(DISTINCT): ' || cnt_distinct || ']' AS duplicates FROM test ORDER BY 1,2,3,4,5; Compares overall COUNT to COUNT(DISTINCT) for each OLX country to detect data duplication
  • 89. 89 Example #2: Gap detection in time-series data Aggregates data into hourly buckets and identifies any missing hours, while including some logic to reduce false positives (e.g. during night hours in smaller OLX markets) CREATE OR REPLACE VIEW clm.utest_fact_event_clickstream_agg_user_mapped_gaps_view AS WITH hours AS ( SELECT country_sk, DATE_TRUNC('HOUR', current_local_time - INTERVAL '1 HOUR' * row_num) AS hour FROM global_bi.dim_counter CROSS JOIN clm.fact_country_current_time WHERE row_num BETWEEN 25 AND 2 * 7 * 24 -- 2 weeks x 7 days x 24 hours (start checking for gaps from 24 hours ago) ), test AS ( SELECT country_sk, DATE_TRUNC('HOUR', time_event_local) AS hour, COUNT(1) AS cnt FROM ( SELECT *, -- Use average of first and last event timestamp within aggregated time window to approximate overall timing of event(s) TIMESTAMP 'epoch' + INTERVAL '1 second' * (( DATEDIFF('second', 'epoch', time_first_event_local) + DATEDIFF('second', 'epoch', time_last_event_local) ) / 2) AS time_event_local FROM clm.fact_event_clickstream_agg_user_mapped ) fc WHERE date_event_nk >= (GETDATE() :: DATE - INTERVAL '16 DAYS') :: DATE -- Performance filter GROUP BY 1,2 ), countries_in_scope AS ( SELECT country_sk, AVG(1.0 * cnt) AS avg_cnt_allday, AVG(1.0 * CASE WHEN DATE_PART('hour', hour) BETWEEN 0 AND 6 THEN cnt END) AS avg_cnt_night FROM test GROUP BY 1 ) SELECT 'fact_event_clickstream_agg_user_mapped' AS test_module, country_sk, 'gaps' AS test_group, NULL AS test_subgroup, DATE_TRUNC('day', hour) :: DATE AS test_instance, CASE WHEN COUNT(CASE wHEN COALESCE(test.cnt, 0) = 0 THEN 1 END) = 0 THEN 'PASS' ELSE 'FAIL' || ' (' || COUNT(CASE wHEN COALESCE(test.cnt, 0) = 0 THEN 1 END) || ' missing hours: ' || LISTAGG(CASE WHEN COALESCE(test.cnt, 0) = 0 THEN DATE_PART('hour', hour) END, ',') WITHIN GROUP (ORDER BY DATE_PART('hour', hour)) || '; Avg/hr (all day): ' || avg_cnt_allday :: INT || '; Avg/hr (night only): ' || avg_cnt_night :: INT || ']' END AS gap_test FROM hours JOIN countries_in_scope USING (country_sk) LEFT JOIN test USING (country_sk, hour) WHERE -- Do not test night hours if night activity is very low on average CASE WHEN avg_cnt_night <= 500 THEN DATE_PART('hour', hour) NOT BETWEEN 0 AND 6 ELSE 1 :: BOOL END = 1 :: BOOL -- Do not test countries with super low activity to minimise false FAILs AND avg_cnt_allday >= 500 GROUP BY 2,5,avg_cnt_allday,avg_cnt_night ORDER BY 1,2,3,4,5;
  • 90. 90 Example #2: Gap detection in time-series data
  • 91. 91 Example #3: Simple business logic validation Configuration sheet specifies test rules (1 cell = 1 test) à in this example, testing for data coverage (minimum % of records with non-NULL values) per customer segment
  • 92. 92 Example #3: Simple business logic validation
  • 93. 93 Example #4: Complex business logic validation Most advanced test case (until now) validates segment business rules using equality / inequality conditions across different segments / dimensions
  • 94. 94 Example #4: Complex business logic validation
  • 95. 95 Example test result visualisation
  • 96. 96 OLX Redshift technical best practices Ø Technical architecture Ø Data management Ø Recommenders Ø Unit testing Ø OLAP cubes Ø Tableau integration
  • 97. 97 Challenge: OLX has complex global reporting needs Dimensions (avg. cardinality) Time (month / week / day) à ~200 Business unit (4-level hierarchy) à ~50 Category (6-level hierarchy) à ~600 Geography (3-level hierarchy) à ~220 Channel (3-level hierarchy) à ~10 Segment (3-way segmentation) à ~27 Type (3-level hierarchy) à ~7 Measures Measure variants ~20 measures (additive & non-additive) Current Lag Year ago Target Up to ~200 trillion data points!!
  • 98. 98 3 possible solutions Query disaggregated data and compute measures in real-time 1 Use standalone OLAP product from a big name database vendor 2 OLX Redshift OLAP framework3 Keep data in Redshift1a Load data in fast columnar storage (e.g. Cassandra or Tableau’s internal database) 1b On the fly calculations over billions of records not fast enough for responsive user experience Size of disaggregated data exceeds limits (e.g. Tableau) & too large to be able to efficiently enable daily loads Too expensive, too complex, requires hiring specialists with domain knowledge Pre-aggregated cubes with direct Tableau integration offers most pragmatic & simplest solution
  • 99. 99 Under the framework, we use a configurable cube matrix to specify the slices which we are interested in reporting on • Number of users – additive only across Business Unit dimension • Number of listings – additive across all dimensions Measures Dimen- sions Time Channel Business unit Q3 2017 Apr 2017 1 Apr 2 Apr … 30 Apr May 2017 1 May 2 May … 31 May Jun 2017 1 Jun 2 Jun … 30 Jun Total Desk- top web Mobile web Mobile apps An- droid iOS Total Europe PL PT LATAM AR CO Cube matrix Dimension Time (non-additive) Channel (non-additive) Business unit # records in cube Perspective Quarter Month Day Total L1 L2 Total Region Country Cardinality 1 3 91 1 3 2 1 2 4 Full cube 3,990 Sub-cube 1 364 Sub-cube 2 28 Sub-cube 3 80 472 vs. 3,990 records (~12% of size of full cube) Example
  • 100. 100 Under this example, the 3 sub-cubes translate to 11 slices Dimension Time (non-additive) Channel (non-additive) Business unit # records in slicePerspective Quarter Month Day Total L1 L2 Total Region Country Cardinality 1 3 91 1 3 2 1 2 4 Slice 1 364 Slice 2 4 Slice 3 12 Slice 4 2 Slice 5 6 Slice 6 1 Slice 7 3 Slice 8 12 Slice 9 36 Slice 10 8 Slice 11 24
  • 101. 101 Pseudo-SQL OLAP cube implementation SELECT CASE WHEN cube_matrix.time_quarter THEN dim_time.quarter_name WHEN cube_matrix.time_month THEN dim_time.month_name WHEN cube_matrix.time_day THEN dim_time.date END AS time_value, CASE WHEN cube_matrix.channel_total THEN 'Total' WHEN cube_matrix.channel_l1 THEN dim_channel.channel_l1_name WHEN cube_matrix.channel_l2 THEN dim_channel.channel_l2_name END AS channel_value, CASE WHEN cube_matrix.bus_unit_total THEN 'Total' WHEN cube_matrix.bus_unit_region THEN dim_bus_unit.region_name WHEN cube_matrix.bus_unit_country THEN dim_bus_unit.country_name END AS bus_unit_value, COUNT(1) AS num_listings, COUNT(DISTINCT user_key) AS num_users FROM fact_listings JOIN dim_time USING (time_key) JOIN dim_channel USING (channel_key) JOIN dim_bus_unit USING (bus_unit_key) CROSS JOIN cube_matrix GROUP BY 1,2,3 CHALLENGE This CROSS JOIN can be very expensive as it explodes the input fact table by the number of slices configured in the cube matrix SOLUTION Aggregate only non-additive dimensions first (Time and Channel), then aggregate additive dimensions (Business Unit) using already partially aggregated output
  • 102. 102 Summary of OLX Redshift OLAP framework Input Operational data model from OLX data warehouse Facts (~15 tables) Dimensions (~15 tables) ~250B records Step 1 Prepare cub pre-aggregates at the smallest grain possible (i.e. user) Cube 1 pre-aggregate Cube 2 pre-aggregate Cube 3 pre-aggregate … ~6B records OLAP cube configuration (definition of cube dimensional model and cube matrix) Step 2 Aggregate all non-additive cube slices Cube 1 non-additive aggregation Cube 2 non-additive aggregation Cube 3 non-additive aggregation … ~100M records Step 3 Aggregate all additive cube slices using output from previous step Cube 1 additive aggregation Cube 2 additive aggregation Cube 3 additive aggregation … ~460M records Step 4 Combine individual cube outputs into a single cube Consolidated cube ~200M records
  • 106. 106 Production example: Cube matrix / slices …
  • 107. 107 Structure of final cube output CREATE TABLE fact_cube ( -- Unique key cube_grain_sk VARCHAR(500) ENCODE RAW -- Dimensions ,time_perspective VARCHAR(200) ENCODE RAW ,time_date_nk DATE ENCODE RAW ,time_value VARCHAR(200) ENCODE LZO ,time_display_value VARCHAR(255) ENCODE LZO ,time_num_days_in_period VARCHAR(255) ENCODE LZO ,time_num_days_in_period_lag VARCHAR(255) ENCODE LZO ,time_num_days_in_period_yago VARCHAR(255) ENCODE LZO ,dim01_perspective VARCHAR(200) ENCODE LZO ,dim02_perspective VARCHAR(200) ENCODE LZO -- ... ,dim01_display_value VARCHAR(255) ENCODE RAW ,dim02_display_value VARCHAR(255) ENCODE RAW -- ... -- Measures ,measure01 BIGINT ENCODE LZO ,measure01_lag BIGINT ENCODE LZO ,measure01_target BIGINT ENCODE LZO ,measure01_yago BIGINT ENCODE LZO ,measure02 BIGINT ENCODE LZO ,measure02_lag BIGINT ENCODE LZO ,measure02_target BIGINT ENCODE LZO ,measure02_yago BIGINT ENCODE LZO -- ... ) DISTSTYLE KEY DISTKEY (cube_grain_sk) SORTKEY ( time_date_nk, time_date_nk, dim01_display_value, dim02_display_value, ... ); In our current production version, the final cube table has ~200M rows and ~150 columns (~40 for dimensions, ~110 for measures) Example value: time~day|2017-06-16 //business_unit~country|olx|eu|ro //category~g2|core|for_sale //geography~l1|olx|eu|ro|13 //channel~total|total //user_hlv~segment|low_volume //user_ftr~total|total //user_pnp~total|total //listing_liqg~total|total
  • 109. 109 OLX Redshift technical best practices Ø Technical architecture Ø Data management Ø Recommenders Ø Unit testing Ø OLAP cubes Ø Tableau integration
  • 110. 110 Summary of OLX Redshift OLAP framework Input Operational data model from OLX data warehouse Facts (~15 tables) Dimensions (~15 tables) ~250B records Step 1 Prepare cub pre-aggregates at the smallest grain possible (i.e. user) Cube 1 pre-aggregate Cube 2 pre-aggregate Cube 3 pre-aggregate … ~6B records OLAP cube configuration (definition of cube dimensional model and cube matrix) Step 2 Aggregate all non-additive cube slices Cube 1 non-additive aggregation Cube 2 non-additive aggregation Cube 3 non-additive aggregation … ~100M records Step 3 Aggregate all additive cube slices using output from previous step Cube 1 additive aggregation Cube 2 additive aggregation Cube 3 additive aggregation … ~460M records Step 4 Combine individual cube outputs into a single cube Consolidated cube ~200M records Tableau view Output Tableau live dashboard connection to dedicated Tableau Redshift cluster Tableau Tableau abstraction view adds derivative measures (calculated on the fly) and formatted values needed to implement reporting dashboard Strive for 1 query per Tableau interaction running within max. ~3 sec.
  • 112. 112 Debugging Tableau interaction with Redshift -- Get the last queries run from Tableau SELECT DATEDIFF('ms', starttime, endtime) / 1000.0 AS duration, query, xid, pid, starttime, querytxt FROM stl_query WHERE userid = 105 ORDER BY starttime desc LIMIT 100 -- Get the SQL SELECT LISTAGG(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE( text, 'n', 'n' ), '"', '' ), ', ', ',ntt' ), 'AND ', 'n ANDt' ), 'FROM ', 'n FROMt' ), 'WHERE ', 'n WHEREt' ), 'SELECT ', 'nSELECTt' ), 'declare', '--declare' ), '') WITHIN GROUP(ORDER BY sequence, starttime) AS sql FROM svl_statementtext WHERE userid = 105 AND xid = 5650302 AND text NOT LIKE 'begin%' AND text NOT LIKE 'fetch%' AND text NOT LIKE 'close%';
  • 113. 113 Contents • Introduction to Naspers and OLX Group • OLX capabilities powered by Redshift • OLX Redshift technical best practices • Q&A
  • 114. 114 Thank you! Questions? Dobo Radichkov Sr. Director, Global Analytics and Customer Lifecycle Management Dobo@OLX.com OLX Group www.olx.com Free classifieds We are hiring! www.joinolx.com Roles: • Data engineers • Data scientists • PHP / Java / Android / iOS developers Locations: Berlin, Lisbon, Buenos Aires, Dubai, Barcelona, Moscow, Delhi