OLX Group presentation for AWS Redshift meetup in London, 5 July 2017

Free Classifieds
www.olx.com
Amazon Redshift at OLX Group
Advanced analytics and big data innovation at the
world’s largest online classifieds business
Dobo Radichkov | London, 5 July 2017

2
Contents
• Introduction to Naspers and OLX Group
• OLX capabilities powered by Redshift
• OLX Redshift technical best practices
• Q&A

3
Contents
• Q&A

4
Introducing NASPERS
A $83B global internet
& entertainment group
and one of the largest
technology investors
in the world
$15BRevenue
$2BEarnings
130
Countries
1.5BAudience
reach
27.000
People

5
Introducing OLX GROUP: The world’s largest classifieds business
40
Countries
20+
Offices
3000
Employees
15+
Brands

6
OLX Group is a powerful global community
1.7B+ monthly visits
35B+ monthly page views
60M+ monthly listings
300M+ monthly active users
4.4
APP RATING
#1 app
22+ COUNTRIES
Mobile leader
People spend more than
twice as long in OLX apps
versus competitors
Scale
• 2 houses
• 2 cars
• 3 fashion items
• 3 mobile phones
Listed every second
• 40 countries
• 20+ offices
• 3,000 employees
• 15+ brands
Global footprint

7
Contents
• Q&A

8
At OLX, Redshift powers 3 important business capabilities
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence

9
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence
Enhance the user
experience and lifetime
value via personalised,
relevant, targeted and
unified omni-channel
user communications

10
Activity level (buying and selling)
Time
First-time
user
Returning
user
Loyal
user
(buying
and
selling)
Loyal
user
(across
multiple
categories)
Fan!Visitor
Increase the
customer
lifetime value…
… through the
‘right’ product,
marketing and
customer care
treatments
Fundamentally, CLM is all about fuelling retention-driven growth by
treating our customers the best way possible in their lifecycle stage

11
Trans-
actional
Website &
mobile app
Customer
care
Social /
3rd party
Customer
segmen-
tation
4
Insights &
analytics
3
Execution
6
Customer
treatments
5
Single
customer
view
2
Platforms
and data
1
The OLX CLM implementation is enabled by automated, data-driven,
targeted and personalised customer treatments

12
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence
Enhance the user
user communications
Grow buyer engage-
ment, seller success
and transactions by
showing relevant &
personalised content to
each of our users

13
Context: Panamera is tackling the challenges of Search & Relevance
across 3 pillars of content discovery
Home page Search experience Recommendations
Show the most relevant
content to each of our users
• Personalised
content feed driving
buyer engagement
and seller success
driven by buyer
interests, social
relationship,
proximity, freshness
and ad / user quality
• Core search results
experience including
text matching, spell
checking, synonym
mapping, language-
specific optimisa-
tions, etc.
• Search auto-
complete, auto-
suggest, instant
results and curated
content
• Recommended
content (e.g. listings,
categories, search)
used to personalise
elements of the
buyer + seller user
journey(s) based on
past behaviour,
preferences and
activity

14
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence
Enhance the user
user communications
Grow buyer engage-
ment, seller success
and transactions by
showing relevant &
personalised content to
each of our users
Empower the business
with robust BI data
platform, high-quality
executive reporting,
and actionable
customer insights

15
Contents
• Q&A

16
Our guiding principle for big data development using Redshift
ü Use few but powerful technologies (and become world-class at them)
ü Keep architecture simple and minimise points of failure
ü Standardise, build on each other, foster continuous improvement
“Everything should be made as simple
as possible, but not simpler.”
Einstein

17
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration

18
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes

19
Let’s walk through our high-level data architecture step by step…
RDL
ODL
Master
64 × ds1.xl
CLM platform
100 × dc1.l
RDL
ODL
RDL+
ODL+
ADL+
Analyst
sandboxes
(read / write
access)
BI platform
64 × ds1.xl
ADL
CLM
APIs
Management
dashboards
Operational
dashboards
SCV ADL
ODL ODL
Ad hoc analytics
LiveSync Hydra Moderation CRMs APIs Crawlers …
Master data infrastructure Extended BI infrastructure
Data refreshed every ~3-5 hours Data refreshed every 24 hours
Load / unload Transformation / modelling Replication
Reporting platform
100 × dc1.l
Mktg
channels

20
LiveSync in-house technology enables dynamic synchronisation of
MySQL production databases to Redshift
LiveSync
Platform database
Live DB replica
(MySQL)
Lazarus
MySQL extractor
(Python)
S3 storage
Data lake
LiveSync
Redshift loader
(Python)
LiveSync

21
Ninja / Hydra in-house multi-tracking capability collects structured
clickstream data from each client device
LiveSync
Client device
(dekstop,
mobile, apps)
Ninja tracker
Client-side library
(JS, PHP,
Android, iOS)
Hydra tracker
Server-side
Java application
S3 storage
Data lake
Hydra
Ninja / Hydra

22
LiveSync and Hydra raw data form the backbone of our architecture
and are loaded in Raw Data Layer (RDL) of our Master Redshift cluster
LiveSync Hydra
RDL
Master
64 × ds1.xl
LiveSync:
• ~1,000 tables
• ~100 billion records
• ~12 TB compressed storage
Hydra:
• ~400 tables
Raw data layer (RDL)

23
Raw data is transformed and modelled into the core Operational Data
Layer (ODL) used to feed all data applications
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
Stats:
• ~100 tables
Facts:
• Listings
• Listing liquidity
• Replies
• Revenue transactions
• Clickstream events
• Listing impressions
• …
Dimensions:
• Users
• Business units
• Geographies
• Categories
• Channels
• (~40 dimensions in total)
Operational data layer (ODL)

24
From here, the ODL is replicated to each data application via our Rzeka
data replication in-house utility
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
Reporting platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
Rzeka replication utility
What is it?
• A fully-configurable Python utility enabling incremental
Redshift-to-Redshift data replication

25
ODL is used as a basis to build the CLM Single Customer View
enabling CLM treatments and recommendations
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
CLM data platform
What does it do?
• User mapping, customer lifecycle segmentation,
recommendation & sort order algorithms, CLM treatments
generation & execution, CLM reporting & analytics
CLM
APIs
Mktg
channels
SCV
Reporting platform
100 × dc1.l

26
Similarly, ODL is used to generate the Analytical Data Layer (ADL) in
our Reporting platform
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
SCV
Management
dashboards
ADL
BI platform
64 × ds1.xl
ADL
Triton mgmt. reporting
What is it?
• Data models and cubes for Top Management KPIs across
seller, buyer, liquidity, revenue and product activity
• Tableau dashboard implementation
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l

27
The BI data warehouse sources additional raw data into an Extended
Raw Data Layer (RDL+)
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
SCV
Management
dashboards
ADL
RDL+
Moderation CRMs APIs Crawlers …
ADL
Extended Raw Data Layer (RDL+)
What is it?
• New raw data sources supporting extended management
and operational analytics across e.g. Performance
Marketing, CS, Competitor, Salesforce, etc.
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l

28
The RDL+ is modelled into an Extended Operational & Analytical
Data Layers (ODL+ & ADL+) used to power operational reporting
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
SCV
Management
dashboards
ADL
RDL+
ADL
ODL+
ADL+
Operational
dashboards
ODL+ and ADL+
What is it?
• Data layers enabling
extended operational
reporting and
analysis
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l

29
Ad hoc analysis enabled through read-only SQL endpoints and
read/write sandboxes inside the BI platform
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
SCV
Management
dashboards
ADL
RDL+
ADL
ODL+
ADL+
Operational
dashboards
Analyst
sandboxes
for ad hoc
analysis
Analyst
sandboxes
(read / write
access)
Ad hoc analytics
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l

30
Detailed end-state OLX Group central data architecture
RDL
ODL
Master
64 × ds1.xl
CLM platform
100 × dc1.l
RDL
ODL
RDL+
ODL+
ADL+
Analyst
sandboxes
(read / write
access)
BI platform
64 × ds1.xl
ADL
CLM
APIs
Management
dashboards
Operational
dashboards
SCV ADL
ODL ODL
Ad hoc analytics
Master data infrastructure Extended BI infrastructure
Data refreshed every ~3-5 hours Data refreshed every 24 hours
Reporting platform
100 × dc1.l
Mktg
channels

31
Side note: With Amazon’s new Athena and Spectrum services we are
exploring new architectural possibilities & improvements
Redshift
Athena metadata catalogue (using Hive DDL)
Athena JDBC endpoint
Presto distributed SQL engine
Spectrum
external S3
tables
Redshift
native
tables
Spectrum distributed SQL engine
JOIN

32
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes

33
Data management framework
Code
Data
work-
flows
Data
layers
Clusters

34
Code
Data
work-
flows
Data
layers
Clusters

35
1 cluster = 1 capability = 1 owner ≈ 1 team
Master data
platform
64 × ds1.xl
CLM platform
100 × dc1.l
BI platform
64 × ds1.xl
Reporting
platform
100 × dc1.l
Tableau cluster
48 × dc1.l

36
Code
Data
work-
flows
Data
layers
Clusters

37
We typically organise our data in 3 data layers
RDL
ODL
ADL
Raw Data Layer
Raw disaggregated and
unprocessed clickstream and
production database data
Operational Data Layer
Clean, structured and standardised
dimensional model serving as
foundation for all data applications
Application Data Layer
Data models specific to each data application
– including CLM algorithms, recommenders,
BI metrics, OLAP cubes, etc.

38
Large scale clickstream data management
Optimise
table design
Partition
data
Create
abstraction
views
RDL

39
Optimise
table design
Partition
data
Create
abstraction
views
Goal is to achieve best possible distribution, date range
WHERE performance, and compression effectiveness
RDL

40
Optimise
table design
Partition
data
Create
abstraction
views
Option 1
Distribution key user_id
Sort key event_timestamp
Best possible
distribution
Date range
performance
Effective
compression
(poor compression
of text columns)
RDL

41
Optimise
table design
Partition
data
Create
abstraction
views
Option 1 Option 2
Distribution key user_id user_id
Sort key event_timestamp user_id
Best possible
distribution
Date range
performance (always full scan)
Effective
compression
(poor compression
of text columns)
(1/2x table size)
RDL

42
Optimise
table design
Partition
data
Create
abstraction
views
Option 1 Option 2 Option 3
Distribution key user_id user_id user_id
Sort key event_timestamp user_id
event_date,
user_id,
event_timestamp
Best possible
distribution
Date range
performance (always full scan)
Effective
compression
(poor compression
of text columns)
(1/2x table size) (1/2x table size)
RDL

43
Examples:
• europe_android_201706
• latam_web_201703
• asia_ios_201706
• (~650 tables in total)
Benefits:
• Easily DROP older data when no longer needed
• Minimise use of DELETE / VACUUM operations
• Localise points of failure and ringfence data repairs
• Allow for platform and channel-specific table schema
Optimise
table design
Partition
data
Create
abstraction
views
We partition our clickstream data into 1 table per
PLATFORM × CHANNEL × MONTH combination
RDL

44
Types of abstraction views
• Current month à europe_android_current_month
• Previous month à latam_web_previous_month
• Last X months (X = 3/6/12) à asia_ios_last_X_months
• All months à africa_android
• View creation is automated via Python script
• (~300 views in total)
Benefits:
• Abstraction of underlying partitioning mechanism
• Time-agnostic ETLs and analytical queries
Optimise
table design
Partition
data
Create
abstraction
views
We create abstraction VIEWs over individual tables
that UNION ALL the data into relevant groups
RDL

45
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
ODL
ADL

46
Surrogate
keys
Hierarchical
dimensions
OLX operates in dozens of markets using many
different classifieds platforms built on different
technologies and using different production databases
It is impossible to ensure system id uniqueness à
Need robust mechanism for surrogate key modelling
ODL
ADL

47
Surrogate
keys
Hierarchical
dimensions
Option 1 Example
Use combination of
system ids that
guarantees uniqueness
Leads to nightmare
spaghetti SQL that is
difficult and
time-consuming to
write, read and maintain
SELECT ...
FROM odl.fact_listings f
JOIN odl.dim_categories d
ON f.platform_id = d.platform_id
AND f.country_id = d.country_id
AND f.brand_id = d.brand_id
AND f.category_id = d.category_id
AND f.category_level = d.category_level
JOIN ...
GROUP BY ...
ODL
ADL

48
Surrogate
keys
Hierarchical
dimensions
Option 1 Option 2
Use combination of
system ids that
Create globally unique
identifier (GUID) for
each dimension value
• Complex to implement
• Requires GUID
mapping tables to able
able to trace values
back to system ids
• Dimension keys lose
semantic meaning
ODL
ADL

49
Surrogate
keys
Hierarchical
dimensions
Option 1 Option 2 Option 3
Use combination of
system ids that
Create globally unique
identifier (GUID) for
each dimension value
Create smart &
persistent surrogate
keys that preserve
semantic meaning
ODL
ADL

50
Surrogate
keys
Hierarchical
dimensions
SELECT ...
FROM fact_listings
JOIN dim_countries USING (country_sk) -- 'olx|eu|ua'
JOIN dim_categories USING (category_sk) -- 'olx|asia|in|5|84|1531'
JOIN dim_geographies USING (geography_sk) -- 'olx|eu|ua|17|194|194'
JOIN dim_channels USING (channel_sk) -- 'mobile_app|android'
JOIN dim_listing_status USING (listing_status_sk) -- 'inactive|mod|mod_removed'
JOIN dim_listing_types USING (listing_type_sk) -- 'private'
JOIN dim_listing_feeds USING (listing_feed_sk) -- 'normal'
JOIN dim_listing_net USING (listing_net_sk) -- 'net|mod>live>eod'
JOIN dim_currencies USING (currency_sk) -- 'aed'
JOIN dim_users USING (user_sk)
--'olx|latam|pe|platform|email|freddy@gmail.com'
JOIN ...
GROUP BY ...
Example
Benefits:
• Simple and guaranteed JOINs
• ‘Readable’ key values
• Negligible impact on query performance (= Redshift rocks!)
ODL
ADL

51
Surrogate
keys
Hierarchical
dimensions
We model most dimensions as hierarchical dimensions
ODL
ADL
Approach:
• Extract parent-child relationship from system dimensions
• For non-system dimensions, define child-parent relationship in
configuration tables
• Create hierarchical dimensions using generic SQL hierarchy
generation script
• Set all dimension key values in fact tables as deepest available
hierarchy level (ideally key of tree leaf values)
Benefits:
• Consistent dimensional modelling approach
• Can easily traverse hierarchy from leaves all the way up to root
• Easy to read & write JOINs with single-key ON conditions

52
Dimensional modelling approach – Hierarchical dimensions
ODL
ADL
Example: dim_channels dimension
Configuration

53
ODL
ADL
Example: dim_channels dimension
Output

54
Code taster
ODL
ADL

55
Code
Data
work-
flows
Data
layers
Clusters

56
Simplest workflow typically takes an input and applies dimensional
modelling & business rules to transform the data into desired output
Input table(s)
Transformation
Transformation
View
Table
SELECT
INSERT
Dimension(s)
Map(s)
Every component uses
the output from other
previously modelled
components as input
Relevant Dimensions
and Maps are included
to apply dimensional
model and required
business rules
SQL transformation logic is
decoupled and stored in a
separate Transformation VIEW
CREATE TABLE fact_output (...)
DISTSTYLE KEY DISTKEY (...) SORTKEY(...);
CREATE OR REPLACE VIEW fact_output_view AS
SELECT ... -- business logic and projections
FROM fact_input
JOIN dimensions
JOIN maps
GROUP BY ...; -- aggregation logic
TRUNCATE fact_output;
INSERT INTO fact_output
SELECT * FROM fact_output_view;
ANALYZE fact_output;
Final table that
contains output of
data workflow

57
For more complex workflows, we use one or more staging steps to
ensure code modularity and have better control over performance
Input table(s)
View
Table
SELECT
INSERT
Transformation
staging step(s)
Transformation
Transformation
staging step(s)
Transformation
Dimension(s)
Map(s)
Intermediate staging logic is
encapsulated in separate VIEWs
Staging output is materialised in
dedicated tables to used as input into
subsequent transformation steps
CREATE TABLE fact_output_staging_step1 (...)
CREATE TABLE fact_output_staging_step2 (...)
-- ... more staging steps as required ...
CREATE TABLE fact_output (...)
CREATE OR REPLACE VIEW
fact_output_staging_step1_view AS
FROM fact_input
JOIN ...
fact_output_staging_step2_view AS
FROM fact_output_staging_step1
JOIN ...
CREATE OR REPLACE fact_output_view AS
FROM fact_output_staging_step2
JOIN ...
TRUNCATE fact_output_staging_step1;
INSERT INTO fact_output_staging_step1
SELECT * FROM fact_output_staging_step1_view;
ANALYZE fact_output_staging_step1;
TRUNCATE fact_output_staging_step1;
INSERT INTO fact_output_staging_step2
SELECT * FROM fact_output_staging_step2_view;
ANALYZE fact_output_staging_step1;

58
We use Feeders to apply the same Transformation to different Inputs
Input table(s)
View
Table
SELECT
INSERT
Feeder
Transformation
Transformation
Dimension(s)
Map(s)
CREATE OR REPLACE fact_output_feeder1_view AS
-- Combine & apply projections / pre-processing
SELECT ... FROM fact_input1 UNION ALL
SELECT ... FROM fact_input3;
CREATE OR REPLACE fact_output_feeder2_view AS
-- Combine & apply projections / pre-processing
SELECT ... FROM fact_input6;
CREATE OR REPLACE fact_output_feeder_view AS
SELECT * FROM fact_output_feeder2_view;
CREATE OR REPLACE fact_output_view AS
FROM fact_output_feeder_view
...;
-- Run transformation on fact_input1
-- Run transformation on fact_input2
Feeders are VIEWs that decouple the
input data from the Transformation logic.
They can be used to collate multiple
inputs (e.g. UNION ALL) & apply basic
pre-processing and projections serving
as basis for rest of data flow
Input table(s)

59
Feeders can be very powerful with incremental data workflows
Input table(s)
Transformation
Transformation
Dimension(s)
Map(s)
View
Table
SELECT
INSERT
Fast incremental
load feeder
Slow incremental
load feeder
Full load feeder
CREATE OR REPLACE fact_output_feeder_full_load_view AS
SELECT ... -- projections & pre-processing
-- Use full available time range
FROM fact_input;
CREATE OR REPLACE fact_output_feeder_incr_load_slow_view AS
FROM fact_input
-- Use 4-week incremental time window
WHERE date_nk >= (GETDATE() :: DATE - INTERVAL '4 week') :: DATE;
CREATE OR REPLACE fact_output_feeder_incr_load_fast_view AS
FROM fact_input
-- Use 2-day incremental time window
WHERE date_nk >= (GETDATE() :: DATE - INTERVAL '2 day') :: DATE;
SELECT * FROM fact_output_feeder_incr_load_fast_view;
Typically we switch between 3 load feeders
depending on the ETL processing approach:
(1) Full load – Processes the input in its
entirety. Used when running transfor-
mation for the first time on full scale.
(2) Incremental Fast load – Processes last
few days / hours of data. Used by
default for scheduled production job.
(3) Incremental Slow load – Processes last
month of data. Used to manually fix
problems from previous runs without
having to re-process years of data.

60
Another great use of Feeders is as a method to switch between
Production and Development environments
Input table(s)
Transformation
Transformation
Dimension(s)
Map(s)
Development feeder Production feeder
View
Table
SELECT
INSERT
We use Production /
Development feeder VIEWs to
switch between full scale (e.g. all
countries) and development
scale (e.g. few small countries).
This enables fast ETL runtimes
during development and testing CREATE OR REPLACE fact_output_feeder_development_view AS
FROM fact_input
WHERE country_sk IN ('olx|mea|gh', 'olx|mea|za');
CREATE OR REPLACE fact_output_feeder_production_view AS
FROM fact_input;
SELECT * FROM fact_output_feeder_development_view;

61
Ultimately, these patterns can be combined in various combinations
depending on the requirements of the data workdlow
Input table(s)
Development feeder
Transformation
staging step(s)
Transformation
Production feeder
Fast incremental
load feeder
Slow incremental
load feeder
Full load feeder
Transformation
staging step(s)
Transformation
Root feeder
View
Table
SELECT
INSERT
Dimension(s)
Map(s)

62
Individual data workflows add up to our full data management
ecosystem
Component 1
Component 2
Component 3
Component 4
…
…

63
Code
Data
work-
flows
Data
layers
Clusters

64
High-level code repository organisation
master rdl hydra android *.sql
ios *.sql
… *.sql
livesync … *.sql
odl dim dim_actions *.sql
dim_categories *.sql
… *.sql
map map_actions *.sql
map_channels *.sql
… *.sql
fact fact_event_clickstream *.sql
fact_listings *.sql
… *.sql
stg … *.sql
… … *.sql
adl dim, map, fact … … *.sql
… … *.sql
clm rdl dim, map, fact … … *.sql
odl dim, map, fact … … *.sql
adl dim, map, fact … … *.sql
…
Cluster Data layer
Component
group
Component Code

65
Example repository structure and file naming
We use standard file naming
with special prefixes to
decouple the logical building
blocks of the data workflow:
• Table definition
• View definition (feeders,
transformations, unit tests)
• ETL scripts
• Configuration scripts
• Analysis

66
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes

67
Collaborative filtering visual 101

68
Depeche Mode The Cure Keane Placebo Suede

69
Jack Jill Jim Jen

70
Jack Jill Jim Jen

71
1st degree
recommendation
Jack Jill Jim Jen

72
1st degree
recommendation
1st degree
recommendation
Jack Jill Jim Jen

73
1st degree
recommendation
1st degree
recommendation
2nd degree
recommendation
Jack Jill Jim Jen

74
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute personalised
user-to-item
recommendations

75
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
user-to-item
recommendations
CREATE TABLE AS item_interactions AS
SELECT user,
band,
SUM(DECODE(action, 'like', 1, 'play', 3)) AS score
FROM clickstream
WHERE action IN ('like', 'play')
GROUP BY 1, 2;
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …

76
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
user-to-item
recommendations
CREATE TABLE AS similarity_matrix AS
SELECT i1.band AS band1,
i2.band AS band2,
COUNT(1) AS frequency,
SUM(i1.score + i2.score) AS score
FROM item_interactions i1
JOIN item_interactions i2
ON i1.user = i2.user
AND i1.band <> i2.band
GROUP BY 1,2
HAVING COUNT(1) > 1
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …
Depeche
Mode
TheCure
Keane
Placebo
Suede
Depeche Mode 3 4
The Cure 3 3
Keane 4 3 5
Placebo 2
Suede 5 2

77
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …
Depeche
Mode
TheCure
Keane
Placebo
Suede
Depeche Mode 3 4
The Cure 3 3
Keane 4 3 5
Placebo 2
Suede 5 2
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
user-to-item
recommendations
1st degree item-to-item recommendations
band1 band2 frequency sum_score rec_rank
Depeche Mode Keane 4 49 1
Depeche Mode The Cure 3 39 2
Keane Suede 5 80 1
Keane Depeche Mode 4 49 2
Keane The Cure 3 48 3
Placebo Suede 2 62 1
Suede Keane 5 80 1
Suede Placebo 2 62 2
The Cure Keane 3 48 1
The Cure Depeche Mode 3 39 2
CREATE TABLE dobo.rec_item2item AS
SELECT band1, band2, frequency, sum_score,
ROW_NUMBER() OVER (PARTITION BY band1 ORDER BY frequency DESC, sum_score DESC) AS rec_rank
FROM similarity_matrix;

78
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
user-to-item
recommendations
INSERT INTO dobo.rec_item2item
WITH max_rank AS (
SELECT band1, MAX(rec_rank) AS max_rank_1st_degree
FROM dobo.rec_item2item
GROUP BY 1
)
SELECT rec_1st.band1,
rec_2nd.band2,
NULL AS frequency,
NULL AS sum_score,
ROW_NUMBER() OVER ( PARTITION BY rec_1st.band1
ORDER BY MIN(rec_1st.rec_rank *
rec_2nd.rec_rank),
MIN(rec_1st.rec_rank)
) + max_rank_1st_degree AS rec_rank
FROM dobo.rec_item2item rec_1st
JOIN max_rank USING (band1)
JOIN dobo.rec_item2item rec_2nd
ON rec_1st.band2 = rec_2nd.band1
AND rec_1st.band1 <> rec_2nd.band2
-- exclude items already in 1st degree recommendations
LEFT JOIN dobo.rec_item2item rec_excl
ON rec_1st.band1 = rec_excl.band1
AND rec_2nd.band2 = rec_excl.band2
WHERE rec_excl.band1 IS NULL
GROUP BY 1, 2, max_rank_1st_degree;
1st degree item-to-item recommendations
Depeche Mode Keane 4 49 1
Depeche Mode The Cure 3 39 2
Keane Suede 5 80 1
Keane Depeche Mode 4 49 2
Keane The Cure 3 48 3
Placebo Suede 2 62 1
Suede Keane 5 80 1
Suede Placebo 2 62 2
The Cure Keane 3 48 1
The Cure Depeche Mode 3 39 2
2nd degree item-to-item recommendations
Depeche Mode Suede 3
Keane Placebo 4
Placebo Keane 2
Suede Depeche Mode 3
Suede The Cure 4
The Cure Suede 3

79
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute
personalised
user-to-item
recommendations
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …
Depeche
Mode
TheCure
Keane
Placebo
Suede
Depeche Mode 3 4
The Cure 3 3
Keane 4 3 5
Placebo 2
Suede 5 2
user-to-item recommendations
user band frequency sum_score rec_rank
Ana Placebo 62 6 1
Ana Depeche Mode 49 5 2
Ana The Cure 48 7 3
Dave Placebo 62 6 1
Dave Depeche Mode 49 5 2
Dave The Cure 48 7 3
Eric Placebo 62 6 1
Eric Depeche Mode 49 5 2
Eric The Cure 48 7 3
Jack Placebo 4 1
Jack Suede 80 7 2
Jen Depeche Mode 3 1
Jen The Cure 4 2
Jen Keane 80 3 3
Jill The Cure 87 9 1
Jill Placebo 62 6 2
Jim Depeche Mode 49 5 1
Jim The Cure 48 7 2
John Placebo 4 1
John Suede 80 7 2
Sam Placebo 4 1
Sam Suede 80 7 2
SELECT int.user, rec.band2,
SUM(rec.sum_score) AS frequency, SUM(rec.rec_rank) AS sum_score,
ROW_NUMBER() OVER ( PARTITION BY int.user ORDER BY
SUM(rec.sum_score) DESC, SUM(rec.rec_rank) ASC) AS rec_rank
FROM dobo.item_interactions int
JOIN dobo.rec_item2item rec
ON int.band = rec.band1
-- Exclude recommendations that the user already interacted with
LEFT JOIN dobo.item_interactions excl
ON rec.band2 = excl.band
AND int.user = excl.user
WHERE excl.band IS NULL
GROUP BY 1,2;

80
At OLX, we apply this approach to implement variety of recommenders
Item-to-item People who viewed items A, B, C also viewed items X, Y, Z
Category-to-category People who bought Cars were also interested in Car Parts
Search-to-category People who searched for ‘black leather sofa’ were interested in Furniture
Search-to-search People who searched for ‘porsche’ also searched for ‘bmw’, ‘mercedes’, ‘ferrari'
Category-to-search People who browsed Mobile phones searched for ‘iphone 7’, ‘samsung galaxy’, …
+ many more … …
DescriptionRecommender

81
Example recommenders in production
www.olx.ph demo

82
Other examples
Related search
recommendations for
browsing users
Personalised item
recommendations for
active buyers
Recommended
categories to post for
active sellers
Users like you also liked
Try also these related searches
Here are some other selling ideas

83
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes

84
Unit testing is an established paradigm in test-driven development

85
OLX is developing Qualis – a unit testing framework for Redshift / SQL
ü Switch from reactive to proactive
error handling
ü Enable SQL codebase scale out
ü Reduce maintenance time and ad
hoc data investigations
ü Make data platform more robust
ü Free up time for innovation

86
Qualis includes Redshift-side framework (being piloted) and Python test
automation & visualisation (currently in development)
Test 1 Test 2 Test 3
Test 10 Test 11 …
Redshift database Python
Qualis tests are
implemented in Redshift
using VIEWs that return
standard output &
PASS/FAIL result
Qualis script runs daily
and SELECTs test output
from each VIEW using
flexible configuration and
parses & aggregates
results
Final output is visualised
using plain text and / or
third party visualisation
tool (e.g. Tableau)
Visualisation

87
Example #1: Duplicates detection test
clm.utest_fact_segmentation_duplication_view AS
WITH test AS (
SELECT country_sk AS country_sk,
COUNT(1) AS cnt,
COUNT(DISTINCT user_sk) AS cnt_distinct
FROM clm.fact_segmentation
GROUP BY 1
)
SELECT 'fact_segmentation' AS test_module,
country_sk,
'duplication' AS test_group,
NULL AS test_subgroup,
NULL AS test_instance,
CASE WHEN cnt = cnt_distinct THEN 'PASS' ELSE
'FAIL' END || ' [cnt: ' || cnt || '; COUNT(DISTINCT): ' ||
cnt_distinct || ']' AS duplicates
FROM test
ORDER BY 1,2,3,4,5;
Compares overall
COUNT to
COUNT(DISTINCT) for
each OLX country to
detect data duplication

88
Example #1: Duplicates detection test
clm.utest_fact_segmentation_duplication_view AS
WITH test AS (
SELECT country_sk AS country_sk,
COUNT(1) AS cnt,
COUNT(DISTINCT user_sk) AS cnt_distinct
FROM clm.fact_segmentation
GROUP BY 1
)
SELECT 'fact_segmentation' AS test_module,
country_sk,
'duplication' AS test_group,
NULL AS test_instance,
CASE WHEN cnt = cnt_distinct THEN 'PASS' ELSE
'FAIL' END || ' [cnt: ' || cnt || '; COUNT(DISTINCT): ' ||
cnt_distinct || ']' AS duplicates
FROM test
ORDER BY 1,2,3,4,5;
Compares overall
COUNT to
COUNT(DISTINCT) for
each OLX country to
detect data duplication

89
Example #2: Gap detection in time-series data
Aggregates data into hourly
buckets and identifies any
missing hours, while including
some logic to reduce false
positives (e.g. during night hours
in smaller OLX markets)
CREATE OR REPLACE VIEW clm.utest_fact_event_clickstream_agg_user_mapped_gaps_view AS
WITH
hours AS (
SELECT country_sk,
DATE_TRUNC('HOUR', current_local_time - INTERVAL '1 HOUR' * row_num) AS hour
FROM global_bi.dim_counter
CROSS JOIN clm.fact_country_current_time
WHERE row_num BETWEEN 25 AND 2 * 7 * 24 -- 2 weeks x 7 days x 24 hours (start checking for gaps from 24 hours ago)
),
test AS (
SELECT country_sk,
DATE_TRUNC('HOUR', time_event_local) AS hour,
COUNT(1) AS cnt
FROM (
SELECT *,
-- Use average of first and last event timestamp within aggregated time window to approximate overall timing of event(s)
TIMESTAMP 'epoch' + INTERVAL '1 second' * ((
DATEDIFF('second', 'epoch', time_first_event_local) +
DATEDIFF('second', 'epoch', time_last_event_local)
) / 2) AS time_event_local
FROM clm.fact_event_clickstream_agg_user_mapped
) fc
WHERE date_event_nk >= (GETDATE() :: DATE - INTERVAL '16 DAYS') :: DATE -- Performance filter
GROUP BY 1,2
),
countries_in_scope AS (
SELECT country_sk,
AVG(1.0 * cnt) AS avg_cnt_allday,
AVG(1.0 * CASE WHEN DATE_PART('hour', hour) BETWEEN 0 AND 6 THEN cnt END) AS avg_cnt_night
FROM test
GROUP BY 1
)
SELECT 'fact_event_clickstream_agg_user_mapped' AS test_module,
country_sk,
'gaps' AS test_group,
DATE_TRUNC('day', hour) :: DATE AS test_instance,
CASE WHEN COUNT(CASE wHEN COALESCE(test.cnt, 0) = 0 THEN 1 END) = 0 THEN 'PASS' ELSE 'FAIL'
|| ' (' || COUNT(CASE wHEN COALESCE(test.cnt, 0) = 0 THEN 1 END) || ' missing hours: '
|| LISTAGG(CASE WHEN COALESCE(test.cnt, 0) = 0 THEN DATE_PART('hour', hour) END, ',')
WITHIN GROUP (ORDER BY DATE_PART('hour', hour)) || '; Avg/hr (all day): ' || avg_cnt_allday :: INT || '; Avg/hr (night
only): ' || avg_cnt_night :: INT || ']'
END AS gap_test
FROM hours
JOIN countries_in_scope USING (country_sk)
LEFT JOIN test USING (country_sk, hour)
WHERE -- Do not test night hours if night activity is very low on average
CASE WHEN avg_cnt_night <= 500 THEN DATE_PART('hour', hour) NOT BETWEEN 0 AND 6 ELSE 1 :: BOOL END = 1 :: BOOL
-- Do not test countries with super low activity to minimise false FAILs
AND avg_cnt_allday >= 500
GROUP BY 2,5,avg_cnt_allday,avg_cnt_night
ORDER BY 1,2,3,4,5;

90
Example #2: Gap detection in time-series data

91
Example #3: Simple business logic validation
Configuration sheet specifies
test rules (1 cell = 1 test) à in
this example, testing for data
coverage (minimum % of
records with non-NULL values)
per customer segment

92
Example #3: Simple business logic validation

93
Example #4: Complex business logic validation
Most advanced test case (until now)
validates segment business rules using
equality / inequality conditions across
different segments / dimensions

94
Example #4: Complex business logic validation

95
Example test result visualisation

96
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes

97
Challenge: OLX has complex global reporting needs
Dimensions
(avg. cardinality)
Time (month / week / day) à ~200
Business unit (4-level hierarchy) à ~50
Category (6-level hierarchy) à ~600
Geography (3-level hierarchy) à ~220
Channel (3-level hierarchy) à ~10
Segment (3-way segmentation) à ~27
Type (3-level hierarchy) à ~7
Measures
Measure
variants
~20
measures
(additive &
non-additive)
Current
Lag
Year ago
Target
Up to
~200
trillion
data
points!!

98
3 possible solutions
Query
disaggregated
data and
compute
measures in
real-time
1
Use standalone OLAP product from
a big name database vendor
2
OLX Redshift OLAP framework3
Keep data in Redshift1a
Load data in fast columnar
storage (e.g. Cassandra or
Tableau’s internal database)
1b
On the fly calculations over
billions of records not fast
enough for responsive user
experience
Size of disaggregated data
exceeds limits (e.g. Tableau)
& too large to be able to
efficiently enable daily loads
Too expensive,
too complex, requires
hiring specialists with
domain knowledge
Pre-aggregated cubes with
direct Tableau integration
offers most pragmatic &
simplest solution

99
Under the framework, we use a configurable cube matrix to specify the
slices which we are interested in reporting on
• Number of users –
additive only across
Business Unit
dimension
• Number of listings
– additive across all
dimensions
Measures
Dimen-
sions
Time Channel Business unit
Q3
2017
Apr
2017
1 Apr
2 Apr
…
30 Apr
May
2017
1 May
2 May
…
31 May
Jun
2017
1 Jun
2 Jun
…
30 Jun
Total
Desk-
top
web
Mobile
web
Mobile
apps
An-
droid
iOS
Total
Europe
PL
PT
LATAM
AR
CO
Cube
matrix
Dimension
Time
(non-additive)
Channel
(non-additive)
Business unit
#
records
in cube
Perspective
Quarter
Month
Day
Total
L1 L2
Total
Region
Country
Cardinality 1 3 91 1 3 2 1 2 4
Full cube 3,990
Sub-cube 1 364
Sub-cube 2 28
Sub-cube 3 80
472 vs. 3,990 records (~12% of size of full cube)
Example

100
Under this example, the 3 sub-cubes translate to 11 slices
Dimension
Time
(non-additive)
Channel
(non-additive)
Business unit
# records
in slicePerspective
Quarter
Month
Day
Total
L1 L2
Total
Region
Country
Cardinality 1 3 91 1 3 2 1 2 4
Slice 1 364
Slice 2 4
Slice 3 12
Slice 4 2
Slice 5 6
Slice 6 1
Slice 7 3
Slice 8 12
Slice 9 36
Slice 10 8
Slice 11 24

101
Pseudo-SQL OLAP cube implementation
SELECT CASE WHEN cube_matrix.time_quarter THEN dim_time.quarter_name
WHEN cube_matrix.time_month THEN dim_time.month_name
WHEN cube_matrix.time_day THEN dim_time.date
END AS time_value,
CASE WHEN cube_matrix.channel_total THEN 'Total'
WHEN cube_matrix.channel_l1 THEN dim_channel.channel_l1_name
WHEN cube_matrix.channel_l2 THEN dim_channel.channel_l2_name
END AS channel_value,
CASE WHEN cube_matrix.bus_unit_total THEN 'Total'
WHEN cube_matrix.bus_unit_region THEN dim_bus_unit.region_name
WHEN cube_matrix.bus_unit_country THEN dim_bus_unit.country_name
END AS bus_unit_value,
COUNT(1) AS num_listings,
COUNT(DISTINCT user_key) AS num_users
FROM fact_listings
JOIN dim_time USING (time_key)
JOIN dim_channel USING (channel_key)
JOIN dim_bus_unit USING (bus_unit_key)
CROSS JOIN cube_matrix
GROUP BY 1,2,3
CHALLENGE
This CROSS JOIN
can be very expensive
as it explodes the
input fact table by the
number of slices
configured in the cube
matrix
SOLUTION
Aggregate only
non-additive
dimensions first
(Time and Channel),
then aggregate
additive dimensions
(Business Unit) using
already partially
aggregated output

102
Summary of OLX Redshift OLAP framework
Input
Operational data
model from OLX
data warehouse
Facts
(~15 tables)
Dimensions
(~15 tables)
~250B records
Step 1
Prepare cub
pre-aggregates at
the smallest grain
possible (i.e. user)
Cube 1
pre-aggregate
Cube 2
pre-aggregate
Cube 3
pre-aggregate
…
~6B records
OLAP cube configuration
(definition of cube dimensional
model and cube matrix)
Step 2
Aggregate all
non-additive cube
slices
Cube 1
non-additive
aggregation
Cube 2
non-additive
aggregation
Cube 3
non-additive
aggregation
…
~100M records
Step 3
Aggregate all
additive cube slices
using output from
previous step
Cube 1
additive
aggregation
Cube 2
additive
aggregation
Cube 3
additive
aggregation
…
~460M records
Step 4
Combine individual
cube outputs into a
single cube
Consolidated
cube
~200M records

103
Production example: Dimensions

104
Production example: Dimension perspectives

106
Production example: Cube matrix / slices
…

107
Structure of final cube output
CREATE TABLE fact_cube
(
-- Unique key
cube_grain_sk VARCHAR(500) ENCODE RAW
-- Dimensions
,time_perspective VARCHAR(200) ENCODE RAW
,time_date_nk DATE ENCODE RAW
,time_value VARCHAR(200) ENCODE LZO
,time_display_value VARCHAR(255) ENCODE LZO
,time_num_days_in_period VARCHAR(255) ENCODE LZO
,time_num_days_in_period_lag VARCHAR(255) ENCODE LZO
,time_num_days_in_period_yago VARCHAR(255) ENCODE LZO
,dim01_perspective VARCHAR(200) ENCODE LZO
,dim02_perspective VARCHAR(200) ENCODE LZO
-- ...
,dim01_display_value VARCHAR(255) ENCODE RAW
,dim02_display_value VARCHAR(255) ENCODE RAW
-- ...
-- Measures
,measure01 BIGINT ENCODE LZO
,measure01_lag BIGINT ENCODE LZO
,measure01_target BIGINT ENCODE LZO
,measure01_yago BIGINT ENCODE LZO
,measure02 BIGINT ENCODE LZO
,measure02_lag BIGINT ENCODE LZO
,measure02_target BIGINT ENCODE LZO
,measure02_yago BIGINT ENCODE LZO
-- ...
)
DISTSTYLE KEY DISTKEY (cube_grain_sk)
SORTKEY (
time_date_nk,
time_date_nk,
dim01_display_value,
dim02_display_value,
...
);
In our current production version,
the final cube table has ~200M
rows and ~150 columns (~40 for
dimensions, ~110 for measures)
Example value:
time~day|2017-06-16
//business_unit~country|olx|eu|ro
//category~g2|core|for_sale
//geography~l1|olx|eu|ro|13
//channel~total|total
//user_hlv~segment|low_volume
//user_ftr~total|total
//user_pnp~total|total
//listing_liqg~total|total

108
Production example: Consolidated cube output

109
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes

110
Summary of OLX Redshift OLAP framework
Input
Operational data
model from OLX
data warehouse
Facts
(~15 tables)
Dimensions
(~15 tables)
~250B records
Step 1
Prepare cub
pre-aggregates at
the smallest grain
possible (i.e. user)
Cube 1
pre-aggregate
Cube 2
pre-aggregate
Cube 3
pre-aggregate
…
~6B records
OLAP cube configuration
(definition of cube dimensional
model and cube matrix)
Step 2
Aggregate all
non-additive cube
slices
Cube 1
non-additive
aggregation
Cube 2
non-additive
aggregation
Cube 3
non-additive
aggregation
…
~100M records
Step 3
Aggregate all
additive cube slices
using output from
previous step
Cube 1
additive
aggregation
Cube 2
additive
aggregation
Cube 3
additive
aggregation
…
~460M records
Step 4
Combine individual
cube outputs into a
single cube
Consolidated
cube
~200M records
Tableau view
Output
Tableau live
dashboard
connection to
dedicated Tableau
Redshift cluster
Tableau
Tableau abstraction view adds
derivative measures
(calculated on the fly) and
formatted values needed to
implement reporting dashboard
Strive for 1
query per Tableau
interaction running
within max. ~3 sec.

112
Debugging Tableau interaction with Redshift
-- Get the last queries run from Tableau
SELECT DATEDIFF('ms', starttime, endtime) / 1000.0 AS duration,
query, xid, pid, starttime, querytxt
FROM stl_query
WHERE userid = 105
ORDER BY starttime desc
LIMIT 100
-- Get the SQL
SELECT LISTAGG(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
text, 'n', 'n' ),
'"', '' ),
', ', ',ntt' ),
'AND ', 'n ANDt' ),
'FROM ', 'n FROMt' ),
'WHERE ', 'n WHEREt' ),
'SELECT ', 'nSELECTt' ),
'declare', '--declare' ),
'')
WITHIN GROUP(ORDER BY sequence, starttime) AS sql
FROM svl_statementtext
WHERE userid = 105
AND xid = 5650302
AND text NOT LIKE 'begin%'
AND text NOT LIKE 'fetch%'
AND text NOT LIKE 'close%';

113
Contents
• Q&A

114
Thank you! Questions?
Dobo Radichkov
Sr. Director, Global Analytics and
Customer Lifecycle Management
Dobo@OLX.com
OLX Group
www.olx.com
Free classifieds
We are hiring!
www.joinolx.com
Roles:
• Data engineers
• Data scientists
• PHP / Java / Android
/ iOS developers
Locations: Berlin,
Lisbon, Buenos Aires,
Dubai, Barcelona,
Moscow, Delhi

OLX Group presentation for AWS Redshift meetup in London, 5 July 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OLX Group presentation for AWS Redshift meetup in London, 5 July 2017

Similar to OLX Group presentation for AWS Redshift meetup in London, 5 July 2017 (20)

More from Dobo Radichkov

More from Dobo Radichkov (6)

Recently uploaded

Recently uploaded (20)

OLX Group presentation for AWS Redshift meetup in London, 5 July 2017