OLX Group presentation for AWS Redshift meetup in London focusing on using Redshift to power Customer Lifecycle Management, Personalisations & Relevance and Business Intelligence.
This is a mostly technical presentation focusing on OLX's best practices in using Redshift to its full potential. Topics discussed include:
- Data architecture
- Data management
- Recommenders
- Unit testing
- OLAP cubes
- Tableau integration
The talk explores following topics:
- What is the search relevance and why is it important?
- Relevance scoring in Elasticsearch
- Manipulating relevance with Query DSL structure
- Pros and cons in using Machine Learning for improving search relevance
- Using Learning to Rank (aka Machine Learning for better relevance) in Elasticsearch
Fact Store at Scale for Netflix Recommendations with Nitin Sharma and Kedar S...Databricks
As a data driven company, we use Machine Learning algos and A/B tests to drive all of the content recommendations for our members. To improve the quality of our personalized recommendations, we try an idea offline using historical data. Ideas that improve our offline metrics are then pushed as A/B tests which are measured through statistically significant improvements in core metrics such as member engagement, satisfaction, and retention.The heart of such offline analyses are historical facts data that are used to generate features required by the machine learning model. For example, viewing history of a member, videos in mylist etc.
Building a fact store at an ever evolving Netflix scale is non trivial. Ensuring we capture enough fact data to cover all stratification needs of various experiments and guarantee that the data we serve is temporally accurate is an important requirement. In this talk, we will present the key requirements, evolution of our fact store design, its implementation, the scale and our learnings.
We will also take a deep dive into fact vs feature logging, design tradeoffs, infrastructure performance, reliability and query API for the store. We use Spark and Scala extensively and variety of compression techniques to store/retrieve data efficiently.
Web Scraping and Data Extraction ServicePromptCloud
Learn more about Web Scraping and data extraction services. We have covered various points about scraping, extraction and converting un-structured data to structured format. For more info visit http://promptcloud.com/
How to create a cutting edge recommender that is fast, scalable, can use almost any applicable data, and is extremely flexible for use in many different contexts. Uses Spark, Mahout, and a search engine.
The talk explores following topics:
- What is the search relevance and why is it important?
- Relevance scoring in Elasticsearch
- Manipulating relevance with Query DSL structure
- Pros and cons in using Machine Learning for improving search relevance
- Using Learning to Rank (aka Machine Learning for better relevance) in Elasticsearch
Fact Store at Scale for Netflix Recommendations with Nitin Sharma and Kedar S...Databricks
As a data driven company, we use Machine Learning algos and A/B tests to drive all of the content recommendations for our members. To improve the quality of our personalized recommendations, we try an idea offline using historical data. Ideas that improve our offline metrics are then pushed as A/B tests which are measured through statistically significant improvements in core metrics such as member engagement, satisfaction, and retention.The heart of such offline analyses are historical facts data that are used to generate features required by the machine learning model. For example, viewing history of a member, videos in mylist etc.
Building a fact store at an ever evolving Netflix scale is non trivial. Ensuring we capture enough fact data to cover all stratification needs of various experiments and guarantee that the data we serve is temporally accurate is an important requirement. In this talk, we will present the key requirements, evolution of our fact store design, its implementation, the scale and our learnings.
We will also take a deep dive into fact vs feature logging, design tradeoffs, infrastructure performance, reliability and query API for the store. We use Spark and Scala extensively and variety of compression techniques to store/retrieve data efficiently.
Web Scraping and Data Extraction ServicePromptCloud
Learn more about Web Scraping and data extraction services. We have covered various points about scraping, extraction and converting un-structured data to structured format. For more info visit http://promptcloud.com/
How to create a cutting edge recommender that is fast, scalable, can use almost any applicable data, and is extremely flexible for use in many different contexts. Uses Spark, Mahout, and a search engine.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
Netflix is the world’s leading Internet television network with over 48 million members in more than 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series. Netflix uses machine learning to deliver a personalized experience to each one of our 48 million users.
In this talk you will hear about the machine learning algorithms that power almost every part of the Netflix experience, including some of our recent work on distributed Neural Networks on AWS GPUs. You will also get an insight into the innovation approach that includes offline experimentation and online AB testing. Finally, you will learn about the system architectures that enable all of this at a Netflix scale.
An introduction to Neo4j and Graph Databases. Learn about the primary use cases for Graph Databases and explore the properties of Neo4j that make those use cases possible.
Building Text Analytics Applications on AWS using Amazon Comprehend - AWS Onl...Amazon Web Services
Learning Objectives:
- Get an introduction to Natural Language Processing (NLP)
- Learn benefits of new approaches to analytics and technologies that help empower better decisions, e.g., NLP, data prep
- Build a text analytics solution with Amazon Comprehend and Amazon Relational Database Service in a step by step demo
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
Incorporating Diversity in a Learning to Rank Recommender SystemJacek Wasilewski
Diversity is a desirable property of recommendations. Diversity can be increased with the use of re-rankers. This work presents an alternative approach where diversity is optimised together with accuracy during a matrix factorisation learning.
Learning to Rank (LTR) presentation at RELX Search Summit 2018. Contains information about history of LTR, taxonomy of LTR algorithms, popular algorithms, and case studies of applying LTR using the TMDB dataset using Solr, Elasticsearch and without index support.
Property graph vs. RDF Triplestore comparison in 2020Ontotext
This presentation goes all the way from intro "what graph databases are" to table comparing the RDF vs. PG plus two different diagrams presenting the market circa 2020
Apache Doris (incubating) is an MPP-based interactive SQL data warehousing for reporting and analysis. It is open-sourced by Baidu. Doris mainly integrates the technology of Google Mesa and Apache Impala. Unlike other popular SQL-on-Hadoop systems, Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Doris not only provides batch data loading, but also provides near real-time mini-batch data loading. Doris also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
Netflix is the world’s leading Internet television network with over 48 million members in more than 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series. Netflix uses machine learning to deliver a personalized experience to each one of our 48 million users.
In this talk you will hear about the machine learning algorithms that power almost every part of the Netflix experience, including some of our recent work on distributed Neural Networks on AWS GPUs. You will also get an insight into the innovation approach that includes offline experimentation and online AB testing. Finally, you will learn about the system architectures that enable all of this at a Netflix scale.
An introduction to Neo4j and Graph Databases. Learn about the primary use cases for Graph Databases and explore the properties of Neo4j that make those use cases possible.
Building Text Analytics Applications on AWS using Amazon Comprehend - AWS Onl...Amazon Web Services
Learning Objectives:
- Get an introduction to Natural Language Processing (NLP)
- Learn benefits of new approaches to analytics and technologies that help empower better decisions, e.g., NLP, data prep
- Build a text analytics solution with Amazon Comprehend and Amazon Relational Database Service in a step by step demo
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
Incorporating Diversity in a Learning to Rank Recommender SystemJacek Wasilewski
Diversity is a desirable property of recommendations. Diversity can be increased with the use of re-rankers. This work presents an alternative approach where diversity is optimised together with accuracy during a matrix factorisation learning.
Learning to Rank (LTR) presentation at RELX Search Summit 2018. Contains information about history of LTR, taxonomy of LTR algorithms, popular algorithms, and case studies of applying LTR using the TMDB dataset using Solr, Elasticsearch and without index support.
Property graph vs. RDF Triplestore comparison in 2020Ontotext
This presentation goes all the way from intro "what graph databases are" to table comparing the RDF vs. PG plus two different diagrams presenting the market circa 2020
Apache Doris (incubating) is an MPP-based interactive SQL data warehousing for reporting and analysis. It is open-sourced by Baidu. Doris mainly integrates the technology of Google Mesa and Apache Impala. Unlike other popular SQL-on-Hadoop systems, Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Doris not only provides batch data loading, but also provides near real-time mini-batch data loading. Doris also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris.
Enterprise Reporting with MongoDB and JasperSoftMongoDB
Presented by Daniel Roberts, Senior Solutions Architect at MongoDB, at the recent JasperWorld event during the London leg of their current European city tour.
About the Speaker, Daniel Roberts:
Prior to MongoDB Daniel worked at Oracle for 11 years in a number of different positions, including Oracle's middleware technologies and strategy. Prior roles include consulting, product management, business development and more recently as a solution architect for financial services. Daniel has also worked for Novell, ICL and as a freelance contractor. He has a degree in Computer Science from Nottingham Trent University in the UK.
Data is both our most valuable asset and our biggest ongoing challenge. As data grows in volume, variety and complexity, across applications, clouds and siloed systems, traditional ways of working with data no longer work.
Unlike traditional databases, which arrange data in rows, columns and tables, Neo4j has a flexible structure defined by stored relationships between data records.
We'll discuss the primary use cases for graph databases
Explore the properties of Neo4j that make those use cases possible
Look into the visualisation of graphs
Introduce how to write queries.
Webinar, 23 July 2020
MongoDB & Hadoop - Understanding Your Big DataMongoDB
Big Data is the evolution of supercomputing for commercial enterprise and governments. Originally the domain of companies operating at Internet scale, today Big Data connects organizations of all sizes with discovery about their patterns, and insights into their business.
But understanding the differences between the plethora of new technologies can be daunting. Graph / columnar / key value store / document are all called NoSQL, but which is best? How does Hadoop play in this ecosystem - its low cost and high efficiency have made it very popular, but how does it fit?
In this webinar, we will explore:
The full spectrum of Big Data
Hadoop and MongoDB: friends or frenemies?
Differences between Systems of Record and Systems of Engagement
MongoDB customer examples of Systems of Engagement
Big Data & Analytics - Presentation by Yiftach Shoolman, Co-Founder & CTO of Redis Labs at the NOAH Conference Tel Aviv 2018, Haoman 17 on the 13th of March 2018.
Partner Recruitment Webinar: "Join the Most Productive Ecosystem in Big Data ...MongoDB
We are looking for more partners in your region to deal with the increasing demand for MongoDB. This is the slide deck of the webinar, broadcast on 21st May 2014, dedicated to see if a MongoDB partnership could benefit your company as well.
In this presentation you can find out more about:
- Why MongoDB is growing so fast and how you can benefit from this fast changing market
- How existing partners succeed with MongoDB and how they benefit
- Potential business opportunities
To give you some idea of the momentum in EMEA:
- Tens of thousands of active leads visiting our website
- Tens of thousands of registrations for MongoDB Online Education
- 30.000+ members on LinkedIn with MongoDB on their profile
Visit the Partner Program http://www.mongodb.com/partners/partner-program for more general information.
About the speaker: Luca Olivari
Luca Olivari is the Director of Business Development at MongoDB, where he's responsible for building the ecosystem in Europe, The Middle East and Africa.
Prior to MongoDB, Luca worked at Oracle, where he led the MySQL Sales Consulting team in EMEA. Before MySQL, he ran the Database and Business Intelligence practice and then coordinated the Business Development and Strategy team for a Systems Integrator. Luca has a BA in Business and Marketing
Webinar: General Technical Overview of MongoDB for Ops TeamsMongoDB
MongoDB is the leading open-source, document database. In this webinar we'll dive into the technical details of MongoDB by first focusing on what makes it different from traditional relational database management systems. We'll review data storage, high availability and scaling for MongoDB. Next we'll discuss what's involved in deploying MongoDB in production. Finally, we'll delve into some of the operational challenges including performance tuning, capacity planning and what it takes to deploy robust highly-available cluster topology.
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Dataconomy Media
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can Speed up the World"
Bio:
Ronan Corkery is a kdb+ engineer who has been working with Kx and First Derivatives for the past 4 years. Currently based in Total Gas and Power he spent his first 2 year working with Morgan Stanley.
Abstract:
Ronan's presentation will focus on the vertical industries the formally only finance based technologies Kx offers has been moving into. He will present proven solutions as well as introducing the overall architecture that Kx uses as well as laying out potential opportunities to work with Kx.
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Maya Lumbroso
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can Speed up the World"
Bio:
Ronan Corkery is a kdb+ engineer who has been working with Kx and First Derivatives for the past 4 years. Currently based in Total Gas and Power he spent his first 2 year working with Morgan Stanley.
Abstract:
Ronan's presentation will focus on the vertical industries the formally only finance based technologies Kx offers has been moving into. He will present proven solutions as well as introducing the overall architecture that Kx uses as well as laying out potential opportunities to work with Kx.
Webinar: Enterprise Trends for Database-as-a-ServiceMongoDB
Two complementary trends are particularly strong in enterprise IT today: MongoDB itself, and the movement of infrastructure, platform, and software to as-a-service models. Being designed from the start to work in cloud deployments, MongoDB is a natural fit.
Learn how your enterprise can create its own MongoDB service offering, combining the advantages of MongoDB and cloud for agile, nearly-instantaneous deployments. Ease your operations workload by centralizing your points for enforcement, standardize best policies, and enable elastic scalability.
We will provide you with an enterprise planning outline which incorporates needs and value for stakeholders across operations, development, and business. We will cover accounting, chargeback integration, and quantification of benefits to the enterprise (such as standardizing best practices, creating elastic architecture, and reducing database maintenance costs).
Introductory talk to how can MongoDB enable new age software taking into account the expected growth rates, the constant availability of services and new business models that appear on a daily basis.
New to MongoDB? We'll provide an overview of installation, high availability through replication, scale out through sharding, and options for monitoring and backup. No prior knowledge of MongoDB is assumed. This session will jumpstart your knowledge of MongoDB operations, providing you with context for the rest of the day's content.
Holland & Barrett: Gen AI Prompt Engineering for Tech teamsDobo Radichkov
Discover Holland & Barrett's Journey into Gen AI: Prompt Engineering and Beyond"
Join us on a captivating journey into the world of Generative AI as Holland & Barrett's Data Team leads a deep dive into the OpenAI ecosystem and the art of prompt engineering. This SlideShare presentation captures the essence of our recent session dedicated to evangelizing the adoption of Gen AI across business and tech within Holland & Barrett. Delve into the nuances of prompt engineering, the comparative analysis of gpt-3.5-turbo and gpt-4, and our recommendations for starting with Prompt Engineering and Retrieval Augmented Generation (RAG). Whether you're a tech enthusiast, a business leader, or an AI aficionado, this presentation offers valuable insights and practical tips to harness the power of AI in your domain.
Unleashing the Power of GPT & LLM: A Holland & Barrett ExplorationDobo Radichkov
Join Holland & Barrett's innovative journey into the world of Generative Pre-trained Transformers (GPT) and Large Language Models (LLM).
In this presentation, we delve into the promise these models hold for personal productivity and beyond. Our Data team spearheads this exploration, highlighting potential applications across diverse roles, from Editors to Engineers.
Discover how we're formulating best practices, developing guiding frameworks, and innovating for the future. Whether you're new to the world of GPT and LLM or an expert, gain valuable insights from our experiences at Holland & Barrett.
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...Dobo Radichkov
This presentation, delivered at the AWS London Summit 2023, provides an in-depth look at how Holland & Barrett built a robust, high-performing data platform on AWS to drive insights at the speed of thought. Dobo Radichkov, Chief Data Officer, shares key aspects of the data strategy, outlining how the company utilised AWS Redshift, Metabase, and Retool to create an efficient data lake, data warehouse, and analytics layer. The presentation also discusses the transformative impact of this data infrastructure on various business areas, including Finance, Commercial, Supply Chain, Customer, Digital, and Wellness. Through this data-driven journey, Holland & Barrett aims to become the beating heart of the organization, unlocking success for colleagues, customers, and partners alike.
In the presentation, Dobo Radichkov lays out Holland & Barrett's vision to make their Data & Analytics team the heartbeat of the organization, a vision that has guided their strategy and tool selection. He explains how this vision is brought to life through their organizational structure, comprising of six specialized teams: Data Engineering, Data Warehouse, Business Intelligence, Data Science, Web & App Analytics, and Digital Analytics.
Dobo takes the audience through the company's strategic roadmap, a three-phase plan guiding the growth and development of their data capabilities. This roadmap isn’t just a technological plan but signifies a transformational journey for the team, aiming to embed data-driven decision-making in the DNA of Holland & Barrett.
Lastly, he showcases the '3-Michelin-star' data platform's architecture, painting a clear picture of how data moves from raw systems to the operational master data and, finally, to the analytics layer. The presentation concludes by highlighting how the newly formed data platform drives core business value and innovation across various business domains, reinforcing Holland & Barrett's commitment to becoming a data-led organization.
OLX Ventures blockchain perspective, Feb 2018Dobo Radichkov
OLX Ventures perspective on the state of affairs, outlook and opportunity that blockchain technology presents. Presentation focuses on how blockchain technology could complement and
/ or enhance the OLX marketplace business model.
Presentation reviews current blockchain state of affairs (as of Feb 2018; focusing on the cryptocurrency space), strategic outlook (inspired by Gartner hype cycle) and short-term opportunities around payments, escrow, reputation, provenance and ownership.
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, BarcelonaDobo Radichkov
OLX Group presentation on real-time serverless analytics at the 2018 OLX internal data summit in Barcelona.
The presentation focuses on best practices in real-time data applications, including AWS technologies such as Kinesis, Lambda (with serverless framework) and ElastiCache.
Presentation examines case study of real-time product recommendations built on top of serverless architecture.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
OLX Group presentation for AWS Redshift meetup in London, 5 July 2017
1. Free Classifieds
www.olx.com
Amazon Redshift at OLX Group
Advanced analytics and big data innovation at the
world’s largest online classifieds business
Dobo Radichkov | London, 5 July 2017
2. 2
Contents
• Introduction to Naspers and OLX Group
• OLX capabilities powered by Redshift
• OLX Redshift technical best practices
• Q&A
3. 3
Contents
• Introduction to Naspers and OLX Group
• OLX capabilities powered by Redshift
• OLX Redshift technical best practices
• Q&A
4. 4
Introducing NASPERS
A $83B global internet
& entertainment group
and one of the largest
technology investors
in the world
$15BRevenue
$2BEarnings
130
Countries
1.5BAudience
reach
27.000
People
5. 5
Introducing OLX GROUP: The world’s largest classifieds business
40
Countries
20+
Offices
3000
Employees
15+
Brands
6. 6
OLX Group is a powerful global community
1.7B+ monthly visits
35B+ monthly page views
60M+ monthly listings
300M+ monthly active users
4.4
APP RATING
#1 app
22+ COUNTRIES
Mobile leader
People spend more than
twice as long in OLX apps
versus competitors
Scale
• 2 houses
• 2 cars
• 3 fashion items
• 3 mobile phones
Listed every second
• 40 countries
• 20+ offices
• 3,000 employees
• 15+ brands
Global footprint
7. 7
Contents
• Introduction to Naspers and OLX Group
• OLX capabilities powered by Redshift
• OLX Redshift technical best practices
• Q&A
8. 8
At OLX, Redshift powers 3 important business capabilities
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence
9. 9
At OLX, Redshift powers 3 important business capabilities
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence
Enhance the user
experience and lifetime
value via personalised,
relevant, targeted and
unified omni-channel
user communications
10. 10
Activity level (buying and selling)
Time
First-time
user
Returning
user
Loyal
user
(buying
and
selling)
Loyal
user
(across
multiple
categories)
Fan!Visitor
Increase the
customer
lifetime value…
… through the
‘right’ product,
marketing and
customer care
treatments
Fundamentally, CLM is all about fuelling retention-driven growth by
treating our customers the best way possible in their lifecycle stage
11. 11
Trans-
actional
Website &
mobile app
Customer
care
Social /
3rd party
Customer
segmen-
tation
4
Insights &
analytics
3
Execution
6
Customer
treatments
5
Single
customer
view
2
Platforms
and data
1
The OLX CLM implementation is enabled by automated, data-driven,
targeted and personalised customer treatments
12. 12
At OLX, Redshift powers 3 important business capabilities
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence
Enhance the user
experience and lifetime
value via personalised,
relevant, targeted and
unified omni-channel
user communications
Grow buyer engage-
ment, seller success
and transactions by
showing relevant &
personalised content to
each of our users
13. 13
Context: Panamera is tackling the challenges of Search & Relevance
across 3 pillars of content discovery
Home page Search experience Recommendations
Show the most relevant
content to each of our users
• Personalised
content feed driving
buyer engagement
and seller success
driven by buyer
interests, social
relationship,
proximity, freshness
and ad / user quality
• Core search results
experience including
text matching, spell
checking, synonym
mapping, language-
specific optimisa-
tions, etc.
• Search auto-
complete, auto-
suggest, instant
results and curated
content
• Recommended
content (e.g. listings,
categories, search)
used to personalise
elements of the
buyer + seller user
journey(s) based on
past behaviour,
preferences and
activity
14. 14
At OLX, Redshift powers 3 important business capabilities
Customer
Lifecycle
Management
Personalisation &
Relevance
Business
Intelligence
Enhance the user
experience and lifetime
value via personalised,
relevant, targeted and
unified omni-channel
user communications
Grow buyer engage-
ment, seller success
and transactions by
showing relevant &
personalised content to
each of our users
Empower the business
with robust BI data
platform, high-quality
executive reporting,
and actionable
customer insights
15. 15
Contents
• Introduction to Naspers and OLX Group
• OLX capabilities powered by Redshift
• OLX Redshift technical best practices
• Q&A
16. 16
Our guiding principle for big data development using Redshift
ü Use few but powerful technologies (and become world-class at them)
ü Keep architecture simple and minimise points of failure
ü Standardise, build on each other, foster continuous improvement
“Everything should be made as simple
as possible, but not simpler.”
Einstein
17. 17
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
18. 18
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
19. 19
Let’s walk through our high-level data architecture step by step…
RDL
ODL
Master
64 × ds1.xl
CLM platform
100 × dc1.l
RDL
ODL
RDL+
ODL+
ADL+
Analyst
sandboxes
(read / write
access)
BI platform
64 × ds1.xl
ADL
CLM
APIs
Management
dashboards
Operational
dashboards
SCV ADL
ODL ODL
Ad hoc analytics
LiveSync Hydra Moderation CRMs APIs Crawlers …
Master data infrastructure Extended BI infrastructure
Data refreshed every ~3-5 hours Data refreshed every 24 hours
Load / unload Transformation / modelling Replication
Reporting platform
100 × dc1.l
Mktg
channels
20. 20
LiveSync in-house technology enables dynamic synchronisation of
MySQL production databases to Redshift
LiveSync
Platform database
Live DB replica
(MySQL)
Lazarus
MySQL extractor
(Python)
S3 storage
Data lake
LiveSync
Redshift loader
(Python)
LiveSync
21. 21
Ninja / Hydra in-house multi-tracking capability collects structured
clickstream data from each client device
LiveSync
Client device
(dekstop,
mobile, apps)
Ninja tracker
Client-side library
(JS, PHP,
Android, iOS)
Hydra tracker
Server-side
Java application
S3 storage
Data lake
Hydra
Ninja / Hydra
22. 22
LiveSync and Hydra raw data form the backbone of our architecture
and are loaded in Raw Data Layer (RDL) of our Master Redshift cluster
LiveSync Hydra
RDL
Master
64 × ds1.xl
Load / unload Transformation / modelling Replication
LiveSync:
• ~1,000 tables
• ~100 billion records
• ~12 TB compressed storage
Hydra:
• ~400 tables
• ~400 billion records
• ~30 TB compressed storage
Raw data layer (RDL)
23. 23
Raw data is transformed and modelled into the core Operational Data
Layer (ODL) used to feed all data applications
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
Stats:
• ~100 tables
• ~150 billion records
• ~5 TB compressed storage
Facts:
• Listings
• Listing liquidity
• Replies
• Revenue transactions
• Clickstream events
• Listing impressions
• …
Dimensions:
• Users
• Business units
• Geographies
• Categories
• Channels
• (~40 dimensions in total)
Operational data layer (ODL)
Load / unload Transformation / modelling Replication
24. 24
From here, the ODL is replicated to each data application via our Rzeka
data replication in-house utility
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
Reporting platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
Rzeka replication utility
What is it?
• A fully-configurable Python utility enabling incremental
Redshift-to-Redshift data replication
Load / unload Transformation / modelling Replication
25. 25
ODL is used as a basis to build the CLM Single Customer View
enabling CLM treatments and recommendations
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
CLM data platform
What does it do?
• User mapping, customer lifecycle segmentation,
recommendation & sort order algorithms, CLM treatments
generation & execution, CLM reporting & analytics
CLM
APIs
Mktg
channels
SCV
Load / unload Transformation / modelling Replication
Reporting platform
100 × dc1.l
26. 26
Similarly, ODL is used to generate the Analytical Data Layer (ADL) in
our Reporting platform
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
SCV
Management
dashboards
ADL
BI platform
64 × ds1.xl
ADL
Triton mgmt. reporting
What is it?
• Data models and cubes for Top Management KPIs across
seller, buyer, liquidity, revenue and product activity
• Tableau dashboard implementation
Load / unload Transformation / modelling Replication
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l
27. 27
The BI data warehouse sources additional raw data into an Extended
Raw Data Layer (RDL+)
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
SCV
Management
dashboards
ADL
RDL+
Moderation CRMs APIs Crawlers …
ADL
Extended Raw Data Layer (RDL+)
What is it?
• New raw data sources supporting extended management
and operational analytics across e.g. Performance
Marketing, CS, Competitor, Salesforce, etc.
Load / unload Transformation / modelling Replication
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l
28. 28
The RDL+ is modelled into an Extended Operational & Analytical
Data Layers (ODL+ & ADL+) used to power operational reporting
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
SCV
Management
dashboards
ADL
RDL+
Moderation CRMs APIs Crawlers …
ADL
ODL+
ADL+
Operational
dashboards
ODL+ and ADL+
What is it?
• Data layers enabling
extended operational
reporting and
analysis
Load / unload Transformation / modelling Replication
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l
29. 29
Ad hoc analysis enabled through read-only SQL endpoints and
read/write sandboxes inside the BI platform
LiveSync Hydra
RDL
Master
64 × ds1.xl
ODL
CLM platform
100 × dc1.l
ODL ODL
RDL
ODL
BI platform
64 × ds1.xl
SCV
Management
dashboards
ADL
RDL+
Moderation CRMs APIs Crawlers …
ADL
ODL+
ADL+
Operational
dashboards
Analyst
sandboxes
for ad hoc
analysis
Analyst
sandboxes
(read / write
access)
Ad hoc analytics
Load / unload Transformation / modelling Replication
CLM
APIs
Mktg
channels
Reporting platform
100 × dc1.l
30. 30
Detailed end-state OLX Group central data architecture
RDL
ODL
Master
64 × ds1.xl
CLM platform
100 × dc1.l
RDL
ODL
RDL+
ODL+
ADL+
Analyst
sandboxes
(read / write
access)
BI platform
64 × ds1.xl
ADL
CLM
APIs
Management
dashboards
Operational
dashboards
SCV ADL
ODL ODL
Ad hoc analytics
LiveSync Hydra Moderation CRMs APIs Crawlers …
Master data infrastructure Extended BI infrastructure
Data refreshed every ~3-5 hours Data refreshed every 24 hours
Load / unload Transformation / modelling Replication
Reporting platform
100 × dc1.l
Mktg
channels
31. 31
Side note: With Amazon’s new Athena and Spectrum services we are
exploring new architectural possibilities & improvements
LiveSync Hydra Moderation CRMs APIs Crawlers …
Redshift
Athena metadata catalogue (using Hive DDL)
Athena JDBC endpoint
Presto distributed SQL engine
Spectrum
external S3
tables
Redshift
native
tables
Spectrum distributed SQL engine
JOIN
32. 32
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
37. 37
We typically organise our data in 3 data layers
RDL
ODL
ADL
Raw Data Layer
Raw disaggregated and
unprocessed clickstream and
production database data
Operational Data Layer
Clean, structured and standardised
dimensional model serving as
foundation for all data applications
Application Data Layer
Data models specific to each data application
– including CLM algorithms, recommenders,
BI metrics, OLAP cubes, etc.
38. 38
Large scale clickstream data management
Optimise
table design
Partition
data
Create
abstraction
views
RDL
39. 39
Large scale clickstream data management
Optimise
table design
Partition
data
Create
abstraction
views
Goal is to achieve best possible distribution, date range
WHERE performance, and compression effectiveness
RDL
40. 40
Large scale clickstream data management
Optimise
table design
Partition
data
Create
abstraction
views
Goal is to achieve best possible distribution, date range
WHERE performance, and compression effectiveness
Option 1
Distribution key user_id
Sort key event_timestamp
Best possible
distribution
Date range
performance
Effective
compression
(poor compression
of text columns)
RDL
41. 41
Large scale clickstream data management
Optimise
table design
Partition
data
Create
abstraction
views
Goal is to achieve best possible distribution, date range
WHERE performance, and compression effectiveness
Option 1 Option 2
Distribution key user_id user_id
Sort key event_timestamp user_id
Best possible
distribution
Date range
performance (always full scan)
Effective
compression
(poor compression
of text columns)
(1/2x table size)
RDL
42. 42
Large scale clickstream data management
Optimise
table design
Partition
data
Create
abstraction
views
Goal is to achieve best possible distribution, date range
WHERE performance, and compression effectiveness
Option 1 Option 2 Option 3
Distribution key user_id user_id user_id
Sort key event_timestamp user_id
event_date,
user_id,
event_timestamp
Best possible
distribution
Date range
performance (always full scan)
Effective
compression
(poor compression
of text columns)
(1/2x table size) (1/2x table size)
RDL
43. 43
Large scale clickstream data management
Examples:
• europe_android_201706
• latam_web_201703
• asia_ios_201706
• (~650 tables in total)
Benefits:
• Easily DROP older data when no longer needed
• Minimise use of DELETE / VACUUM operations
• Localise points of failure and ringfence data repairs
• Allow for platform and channel-specific table schema
Optimise
table design
Partition
data
Create
abstraction
views
We partition our clickstream data into 1 table per
PLATFORM × CHANNEL × MONTH combination
RDL
44. 44
Large scale clickstream data management
Types of abstraction views
• Current month à europe_android_current_month
• Previous month à latam_web_previous_month
• Last X months (X = 3/6/12) à asia_ios_last_X_months
• All months à africa_android
• View creation is automated via Python script
• (~300 views in total)
Benefits:
• Abstraction of underlying partitioning mechanism
• Time-agnostic ETLs and analytical queries
Optimise
table design
Partition
data
Create
abstraction
views
We create abstraction VIEWs over individual tables
that UNION ALL the data into relevant groups
RDL
46. 46
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
OLX operates in dozens of markets using many
different classifieds platforms built on different
technologies and using different production databases
It is impossible to ensure system id uniqueness à
Need robust mechanism for surrogate key modelling
ODL
ADL
47. 47
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
OLX operates in dozens of markets using many
different classifieds platforms built on different
technologies and using different production databases
It is impossible to ensure system id uniqueness à
Need robust mechanism for surrogate key modelling
Option 1 Example
Use combination of
system ids that
guarantees uniqueness
Leads to nightmare
spaghetti SQL that is
difficult and
time-consuming to
write, read and maintain
SELECT ...
FROM odl.fact_listings f
JOIN odl.dim_categories d
ON f.platform_id = d.platform_id
AND f.country_id = d.country_id
AND f.brand_id = d.brand_id
AND f.category_id = d.category_id
AND f.category_level = d.category_level
JOIN ...
GROUP BY ...
ODL
ADL
48. 48
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
OLX operates in dozens of markets using many
different classifieds platforms built on different
technologies and using different production databases
It is impossible to ensure system id uniqueness à
Need robust mechanism for surrogate key modelling
Option 1 Option 2
Use combination of
system ids that
guarantees uniqueness
Create globally unique
identifier (GUID) for
each dimension value
• Complex to implement
• Requires GUID
mapping tables to able
able to trace values
back to system ids
• Dimension keys lose
semantic meaning
ODL
ADL
49. 49
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
OLX operates in dozens of markets using many
different classifieds platforms built on different
technologies and using different production databases
It is impossible to ensure system id uniqueness à
Need robust mechanism for surrogate key modelling
Option 1 Option 2 Option 3
Use combination of
system ids that
guarantees uniqueness
Create globally unique
identifier (GUID) for
each dimension value
Create smart &
persistent surrogate
keys that preserve
semantic meaning
ODL
ADL
50. 50
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
SELECT ...
FROM fact_listings
JOIN dim_countries USING (country_sk) -- 'olx|eu|ua'
JOIN dim_categories USING (category_sk) -- 'olx|asia|in|5|84|1531'
JOIN dim_geographies USING (geography_sk) -- 'olx|eu|ua|17|194|194'
JOIN dim_channels USING (channel_sk) -- 'mobile_app|android'
JOIN dim_listing_status USING (listing_status_sk) -- 'inactive|mod|mod_removed'
JOIN dim_listing_types USING (listing_type_sk) -- 'private'
JOIN dim_listing_feeds USING (listing_feed_sk) -- 'normal'
JOIN dim_listing_net USING (listing_net_sk) -- 'net|mod>live>eod'
JOIN dim_currencies USING (currency_sk) -- 'aed'
JOIN dim_users USING (user_sk)
--'olx|latam|pe|platform|email|freddy@gmail.com'
JOIN ...
GROUP BY ...
Example
Benefits:
• Simple and guaranteed JOINs
• ‘Readable’ key values
• Negligible impact on query performance (= Redshift rocks!)
ODL
ADL
51. 51
Dimensional modelling approach
Surrogate
keys
Hierarchical
dimensions
We model most dimensions as hierarchical dimensions
ODL
ADL
Approach:
• Extract parent-child relationship from system dimensions
• For non-system dimensions, define child-parent relationship in
configuration tables
• Create hierarchical dimensions using generic SQL hierarchy
generation script
• Set all dimension key values in fact tables as deepest available
hierarchy level (ideally key of tree leaf values)
Benefits:
• Consistent dimensional modelling approach
• Can easily traverse hierarchy from leaves all the way up to root
• Easy to read & write JOINs with single-key ON conditions
56. 56
Simplest workflow typically takes an input and applies dimensional
modelling & business rules to transform the data into desired output
Input table(s)
Transformation
Transformation
View
Table
SELECT
INSERT
Dimension(s)
Map(s)
Every component uses
the output from other
previously modelled
components as input
Relevant Dimensions
and Maps are included
to apply dimensional
model and required
business rules
SQL transformation logic is
decoupled and stored in a
separate Transformation VIEW
CREATE TABLE fact_output (...)
DISTSTYLE KEY DISTKEY (...) SORTKEY(...);
CREATE OR REPLACE VIEW fact_output_view AS
SELECT ... -- business logic and projections
FROM fact_input
JOIN dimensions
JOIN maps
GROUP BY ...; -- aggregation logic
TRUNCATE fact_output;
INSERT INTO fact_output
SELECT * FROM fact_output_view;
ANALYZE fact_output;
Final table that
contains output of
data workflow
57. 57
For more complex workflows, we use one or more staging steps to
ensure code modularity and have better control over performance
Input table(s)
View
Table
SELECT
INSERT
Transformation
staging step(s)
Transformation
Transformation
staging step(s)
Transformation
Dimension(s)
Map(s)
Intermediate staging logic is
encapsulated in separate VIEWs
Staging output is materialised in
dedicated tables to used as input into
subsequent transformation steps
CREATE TABLE fact_output_staging_step1 (...)
DISTSTYLE KEY DISTKEY (...) SORTKEY(...);
CREATE TABLE fact_output_staging_step2 (...)
DISTSTYLE KEY DISTKEY (...) SORTKEY(...);
-- ... more staging steps as required ...
CREATE TABLE fact_output (...)
DISTSTYLE KEY DISTKEY (...) SORTKEY(...);
CREATE OR REPLACE VIEW
fact_output_staging_step1_view AS
SELECT ... -- business logic and projections
FROM fact_input
JOIN ...
GROUP BY ...; -- aggregation logic
CREATE OR REPLACE VIEW
fact_output_staging_step2_view AS
SELECT ... -- business logic and projections
FROM fact_output_staging_step1
JOIN ...
GROUP BY ...; -- aggregation logic
-- ... more staging steps as required ...
CREATE OR REPLACE fact_output_view AS
SELECT ... -- business logic and projections
FROM fact_output_staging_step2
JOIN ...
GROUP BY ...; -- aggregation logic
TRUNCATE fact_output_staging_step1;
INSERT INTO fact_output_staging_step1
SELECT * FROM fact_output_staging_step1_view;
ANALYZE fact_output_staging_step1;
TRUNCATE fact_output_staging_step1;
INSERT INTO fact_output_staging_step2
SELECT * FROM fact_output_staging_step2_view;
ANALYZE fact_output_staging_step1;
-- ... more staging steps as required ...
TRUNCATE fact_output;
INSERT INTO fact_output
SELECT * FROM fact_output_view;
ANALYZE fact_output;
58. 58
We use Feeders to apply the same Transformation to different Inputs
Input table(s)
View
Table
SELECT
INSERT
Feeder
Transformation
Transformation
Dimension(s)
Map(s)
CREATE OR REPLACE fact_output_feeder1_view AS
-- Combine & apply projections / pre-processing
SELECT ... FROM fact_input1 UNION ALL
SELECT ... FROM fact_input2 UNION ALL
SELECT ... FROM fact_input3;
CREATE OR REPLACE fact_output_feeder2_view AS
-- Combine & apply projections / pre-processing
SELECT ... FROM fact_input4 UNION ALL
SELECT ... FROM fact_input5 UNION ALL
SELECT ... FROM fact_input6;
CREATE OR REPLACE fact_output_feeder_view AS
SELECT * FROM fact_output_feeder2_view;
CREATE OR REPLACE fact_output_view AS
SELECT ... -- business logic and projections
FROM fact_output_feeder_view
...;
TRUNCATE fact_output;
-- Run transformation on fact_input1
CREATE OR REPLACE fact_output_feeder_view AS
SELECT * FROM fact_output_feeder1_view;
INSERT INTO fact_output
SELECT * FROM fact_output_view;
ANALYZE fact_output;
-- Run transformation on fact_input2
CREATE OR REPLACE fact_output_feeder_view AS
SELECT * FROM fact_output_feeder2_view;
INSERT INTO fact_output
SELECT * FROM fact_output_view;
ANALYZE fact_output;
Feeders are VIEWs that decouple the
input data from the Transformation logic.
They can be used to collate multiple
inputs (e.g. UNION ALL) & apply basic
pre-processing and projections serving
as basis for rest of data flow
Input table(s)
59. 59
Feeders can be very powerful with incremental data workflows
Input table(s)
Transformation
Transformation
Dimension(s)
Map(s)
View
Table
SELECT
INSERT
Fast incremental
load feeder
Slow incremental
load feeder
Full load feeder
CREATE OR REPLACE fact_output_feeder_full_load_view AS
SELECT ... -- projections & pre-processing
-- Use full available time range
FROM fact_input;
CREATE OR REPLACE fact_output_feeder_incr_load_slow_view AS
SELECT ... -- projections & pre-processing
FROM fact_input
-- Use 4-week incremental time window
WHERE date_nk >= (GETDATE() :: DATE - INTERVAL '4 week') :: DATE;
CREATE OR REPLACE fact_output_feeder_incr_load_fast_view AS
SELECT ... -- projections & pre-processing
FROM fact_input
-- Use 2-day incremental time window
WHERE date_nk >= (GETDATE() :: DATE - INTERVAL '2 day') :: DATE;
CREATE OR REPLACE fact_output_feeder_view AS
SELECT * FROM fact_output_feeder_incr_load_fast_view;
Typically we switch between 3 load feeders
depending on the ETL processing approach:
(1) Full load – Processes the input in its
entirety. Used when running transfor-
mation for the first time on full scale.
(2) Incremental Fast load – Processes last
few days / hours of data. Used by
default for scheduled production job.
(3) Incremental Slow load – Processes last
month of data. Used to manually fix
problems from previous runs without
having to re-process years of data.
60. 60
Another great use of Feeders is as a method to switch between
Production and Development environments
Input table(s)
Transformation
Transformation
Dimension(s)
Map(s)
Development feeder Production feeder
View
Table
SELECT
INSERT
We use Production /
Development feeder VIEWs to
switch between full scale (e.g. all
countries) and development
scale (e.g. few small countries).
This enables fast ETL runtimes
during development and testing CREATE OR REPLACE fact_output_feeder_development_view AS
SELECT ... -- projections & pre-processing
FROM fact_input
WHERE country_sk IN ('olx|mea|gh', 'olx|mea|za');
CREATE OR REPLACE fact_output_feeder_production_view AS
SELECT ... -- projections & pre-processing
FROM fact_input;
CREATE OR REPLACE fact_output_feeder_view AS
SELECT * FROM fact_output_feeder_development_view;
61. 61
Ultimately, these patterns can be combined in various combinations
depending on the requirements of the data workdlow
Input table(s)
Development feeder
Transformation
staging step(s)
Transformation
Production feeder
Fast incremental
load feeder
Slow incremental
load feeder
Full load feeder
Transformation
staging step(s)
Transformation
Root feeder
View
Table
SELECT
INSERT
Dimension(s)
Map(s)
62. 62
Individual data workflows add up to our full data management
ecosystem
Component 1
Component 2
Component 3
Component 4
…
…
65. 65
Example repository structure and file naming
We use standard file naming
with special prefixes to
decouple the logical building
blocks of the data workflow:
• Table definition
• View definition (feeders,
transformations, unit tests)
• ETL scripts
• Configuration scripts
• Analysis
66. 66
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
74. 74
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute personalised
user-to-item
recommendations
75. 75
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute personalised
user-to-item
recommendations
CREATE TABLE AS item_interactions AS
SELECT user,
band,
SUM(DECODE(action, 'like', 1, 'play', 3)) AS score
FROM clickstream
WHERE action IN ('like', 'play')
GROUP BY 1, 2;
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …
76. 76
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute personalised
user-to-item
recommendations
CREATE TABLE AS similarity_matrix AS
SELECT i1.band AS band1,
i2.band AS band2,
COUNT(1) AS frequency,
SUM(i1.score + i2.score) AS score
FROM item_interactions i1
JOIN item_interactions i2
ON i1.user = i2.user
AND i1.band <> i2.band
GROUP BY 1,2
HAVING COUNT(1) > 1
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …
Depeche
Mode
TheCure
Keane
Placebo
Suede
Depeche Mode 3 4
The Cure 3 3
Keane 4 3 5
Placebo 2
Suede 5 2
77. 77
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …
Depeche
Mode
TheCure
Keane
Placebo
Suede
Depeche Mode 3 4
The Cure 3 3
Keane 4 3 5
Placebo 2
Suede 5 2
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute personalised
user-to-item
recommendations
1st degree item-to-item recommendations
band1 band2 frequency sum_score rec_rank
Depeche Mode Keane 4 49 1
Depeche Mode The Cure 3 39 2
Keane Suede 5 80 1
Keane Depeche Mode 4 49 2
Keane The Cure 3 48 3
Placebo Suede 2 62 1
Suede Keane 5 80 1
Suede Placebo 2 62 2
The Cure Keane 3 48 1
The Cure Depeche Mode 3 39 2
CREATE TABLE dobo.rec_item2item AS
SELECT band1, band2, frequency, sum_score,
ROW_NUMBER() OVER (PARTITION BY band1 ORDER BY frequency DESC, sum_score DESC) AS rec_rank
FROM similarity_matrix;
78. 78
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute personalised
user-to-item
recommendations
INSERT INTO dobo.rec_item2item
WITH max_rank AS (
SELECT band1, MAX(rec_rank) AS max_rank_1st_degree
FROM dobo.rec_item2item
GROUP BY 1
)
SELECT rec_1st.band1,
rec_2nd.band2,
NULL AS frequency,
NULL AS sum_score,
ROW_NUMBER() OVER ( PARTITION BY rec_1st.band1
ORDER BY MIN(rec_1st.rec_rank *
rec_2nd.rec_rank),
MIN(rec_1st.rec_rank)
) + max_rank_1st_degree AS rec_rank
FROM dobo.rec_item2item rec_1st
JOIN max_rank USING (band1)
JOIN dobo.rec_item2item rec_2nd
ON rec_1st.band2 = rec_2nd.band1
AND rec_1st.band1 <> rec_2nd.band2
-- exclude items already in 1st degree recommendations
LEFT JOIN dobo.rec_item2item rec_excl
ON rec_1st.band1 = rec_excl.band1
AND rec_2nd.band2 = rec_excl.band2
WHERE rec_excl.band1 IS NULL
GROUP BY 1, 2, max_rank_1st_degree;
1st degree item-to-item recommendations
band1 band2 frequency sum_score rec_rank
Depeche Mode Keane 4 49 1
Depeche Mode The Cure 3 39 2
Keane Suede 5 80 1
Keane Depeche Mode 4 49 2
Keane The Cure 3 48 3
Placebo Suede 2 62 1
Suede Keane 5 80 1
Suede Placebo 2 62 2
The Cure Keane 3 48 1
The Cure Depeche Mode 3 39 2
2nd degree item-to-item recommendations
band1 band2 frequency sum_score rec_rank
Depeche Mode Suede 3
Keane Placebo 4
Placebo Keane 2
Suede Depeche Mode 3
Suede The Cure 4
The Cure Suede 3
79. 79
Collaborative filtering implementation
Get and rate all
relevant item
interactions
Compute item
similarity matrix
Compute 1st and 2nd
degree item-to-item
recommendations
Compute
personalised
user-to-item
recommendations
user band score
Jack Depeche Mode 5
Jack The Cure 3
Jack Keane 8
Jill Depeche Mode 4
Jill Keane 10
Jill Suede 7
Jim Keane 15
… … …
Depeche
Mode
TheCure
Keane
Placebo
Suede
Depeche Mode 3 4
The Cure 3 3
Keane 4 3 5
Placebo 2
Suede 5 2
user-to-item recommendations
user band frequency sum_score rec_rank
Ana Placebo 62 6 1
Ana Depeche Mode 49 5 2
Ana The Cure 48 7 3
Dave Placebo 62 6 1
Dave Depeche Mode 49 5 2
Dave The Cure 48 7 3
Eric Placebo 62 6 1
Eric Depeche Mode 49 5 2
Eric The Cure 48 7 3
Jack Placebo 4 1
Jack Suede 80 7 2
Jen Depeche Mode 3 1
Jen The Cure 4 2
Jen Keane 80 3 3
Jill The Cure 87 9 1
Jill Placebo 62 6 2
Jim Depeche Mode 49 5 1
Jim The Cure 48 7 2
John Placebo 4 1
John Suede 80 7 2
Sam Placebo 4 1
Sam Suede 80 7 2
SELECT int.user, rec.band2,
SUM(rec.sum_score) AS frequency, SUM(rec.rec_rank) AS sum_score,
ROW_NUMBER() OVER ( PARTITION BY int.user ORDER BY
SUM(rec.sum_score) DESC, SUM(rec.rec_rank) ASC) AS rec_rank
FROM dobo.item_interactions int
JOIN dobo.rec_item2item rec
ON int.band = rec.band1
-- Exclude recommendations that the user already interacted with
LEFT JOIN dobo.item_interactions excl
ON rec.band2 = excl.band
AND int.user = excl.user
WHERE excl.band IS NULL
GROUP BY 1,2;
80. 80
At OLX, we apply this approach to implement variety of recommenders
Item-to-item People who viewed items A, B, C also viewed items X, Y, Z
Category-to-category People who bought Cars were also interested in Car Parts
Search-to-category People who searched for ‘black leather sofa’ were interested in Furniture
Search-to-search People who searched for ‘porsche’ also searched for ‘bmw’, ‘mercedes’, ‘ferrari'
Category-to-search People who browsed Mobile phones searched for ‘iphone 7’, ‘samsung galaxy’, …
+ many more … …
DescriptionRecommender
82. 82
Other examples
Related search
recommendations for
browsing users
Personalised item
recommendations for
active buyers
Recommended
categories to post for
active sellers
Users like you also liked
Try also these related searches
Here are some other selling ideas
83. 83
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
85. 85
OLX is developing Qualis – a unit testing framework for Redshift / SQL
ü Switch from reactive to proactive
error handling
ü Enable SQL codebase scale out
ü Reduce maintenance time and ad
hoc data investigations
ü Make data platform more robust
ü Free up time for innovation
86. 86
Qualis includes Redshift-side framework (being piloted) and Python test
automation & visualisation (currently in development)
Test 1 Test 2 Test 3
Test 4 Test 5 Test 6
Test 7 Test 8 Test 9
Test 10 Test 11 …
Redshift database Python
Qualis tests are
implemented in Redshift
using VIEWs that return
standard output &
PASS/FAIL result
Qualis script runs daily
and SELECTs test output
from each VIEW using
flexible configuration and
parses & aggregates
results
Final output is visualised
using plain text and / or
third party visualisation
tool (e.g. Tableau)
Visualisation
87. 87
Example #1: Duplicates detection test
CREATE OR REPLACE VIEW
clm.utest_fact_segmentation_duplication_view AS
WITH test AS (
SELECT country_sk AS country_sk,
COUNT(1) AS cnt,
COUNT(DISTINCT user_sk) AS cnt_distinct
FROM clm.fact_segmentation
GROUP BY 1
)
SELECT 'fact_segmentation' AS test_module,
country_sk,
'duplication' AS test_group,
NULL AS test_subgroup,
NULL AS test_instance,
CASE WHEN cnt = cnt_distinct THEN 'PASS' ELSE
'FAIL' END || ' [cnt: ' || cnt || '; COUNT(DISTINCT): ' ||
cnt_distinct || ']' AS duplicates
FROM test
ORDER BY 1,2,3,4,5;
Compares overall
COUNT to
COUNT(DISTINCT) for
each OLX country to
detect data duplication
88. 88
Example #1: Duplicates detection test
CREATE OR REPLACE VIEW
clm.utest_fact_segmentation_duplication_view AS
WITH test AS (
SELECT country_sk AS country_sk,
COUNT(1) AS cnt,
COUNT(DISTINCT user_sk) AS cnt_distinct
FROM clm.fact_segmentation
GROUP BY 1
)
SELECT 'fact_segmentation' AS test_module,
country_sk,
'duplication' AS test_group,
NULL AS test_subgroup,
NULL AS test_instance,
CASE WHEN cnt = cnt_distinct THEN 'PASS' ELSE
'FAIL' END || ' [cnt: ' || cnt || '; COUNT(DISTINCT): ' ||
cnt_distinct || ']' AS duplicates
FROM test
ORDER BY 1,2,3,4,5;
Compares overall
COUNT to
COUNT(DISTINCT) for
each OLX country to
detect data duplication
89. 89
Example #2: Gap detection in time-series data
Aggregates data into hourly
buckets and identifies any
missing hours, while including
some logic to reduce false
positives (e.g. during night hours
in smaller OLX markets)
CREATE OR REPLACE VIEW clm.utest_fact_event_clickstream_agg_user_mapped_gaps_view AS
WITH
hours AS (
SELECT country_sk,
DATE_TRUNC('HOUR', current_local_time - INTERVAL '1 HOUR' * row_num) AS hour
FROM global_bi.dim_counter
CROSS JOIN clm.fact_country_current_time
WHERE row_num BETWEEN 25 AND 2 * 7 * 24 -- 2 weeks x 7 days x 24 hours (start checking for gaps from 24 hours ago)
),
test AS (
SELECT country_sk,
DATE_TRUNC('HOUR', time_event_local) AS hour,
COUNT(1) AS cnt
FROM (
SELECT *,
-- Use average of first and last event timestamp within aggregated time window to approximate overall timing of event(s)
TIMESTAMP 'epoch' + INTERVAL '1 second' * ((
DATEDIFF('second', 'epoch', time_first_event_local) +
DATEDIFF('second', 'epoch', time_last_event_local)
) / 2) AS time_event_local
FROM clm.fact_event_clickstream_agg_user_mapped
) fc
WHERE date_event_nk >= (GETDATE() :: DATE - INTERVAL '16 DAYS') :: DATE -- Performance filter
GROUP BY 1,2
),
countries_in_scope AS (
SELECT country_sk,
AVG(1.0 * cnt) AS avg_cnt_allday,
AVG(1.0 * CASE WHEN DATE_PART('hour', hour) BETWEEN 0 AND 6 THEN cnt END) AS avg_cnt_night
FROM test
GROUP BY 1
)
SELECT 'fact_event_clickstream_agg_user_mapped' AS test_module,
country_sk,
'gaps' AS test_group,
NULL AS test_subgroup,
DATE_TRUNC('day', hour) :: DATE AS test_instance,
CASE WHEN COUNT(CASE wHEN COALESCE(test.cnt, 0) = 0 THEN 1 END) = 0 THEN 'PASS' ELSE 'FAIL'
|| ' (' || COUNT(CASE wHEN COALESCE(test.cnt, 0) = 0 THEN 1 END) || ' missing hours: '
|| LISTAGG(CASE WHEN COALESCE(test.cnt, 0) = 0 THEN DATE_PART('hour', hour) END, ',')
WITHIN GROUP (ORDER BY DATE_PART('hour', hour)) || '; Avg/hr (all day): ' || avg_cnt_allday :: INT || '; Avg/hr (night
only): ' || avg_cnt_night :: INT || ']'
END AS gap_test
FROM hours
JOIN countries_in_scope USING (country_sk)
LEFT JOIN test USING (country_sk, hour)
WHERE -- Do not test night hours if night activity is very low on average
CASE WHEN avg_cnt_night <= 500 THEN DATE_PART('hour', hour) NOT BETWEEN 0 AND 6 ELSE 1 :: BOOL END = 1 :: BOOL
-- Do not test countries with super low activity to minimise false FAILs
AND avg_cnt_allday >= 500
GROUP BY 2,5,avg_cnt_allday,avg_cnt_night
ORDER BY 1,2,3,4,5;
91. 91
Example #3: Simple business logic validation
Configuration sheet specifies
test rules (1 cell = 1 test) à in
this example, testing for data
coverage (minimum % of
records with non-NULL values)
per customer segment
93. 93
Example #4: Complex business logic validation
Most advanced test case (until now)
validates segment business rules using
equality / inequality conditions across
different segments / dimensions
96. 96
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
97. 97
Challenge: OLX has complex global reporting needs
Dimensions
(avg. cardinality)
Time (month / week / day) à ~200
Business unit (4-level hierarchy) à ~50
Category (6-level hierarchy) à ~600
Geography (3-level hierarchy) à ~220
Channel (3-level hierarchy) à ~10
Segment (3-way segmentation) à ~27
Type (3-level hierarchy) à ~7
Measures
Measure
variants
~20
measures
(additive &
non-additive)
Current
Lag
Year ago
Target
Up to
~200
trillion
data
points!!
98. 98
3 possible solutions
Query
disaggregated
data and
compute
measures in
real-time
1
Use standalone OLAP product from
a big name database vendor
2
OLX Redshift OLAP framework3
Keep data in Redshift1a
Load data in fast columnar
storage (e.g. Cassandra or
Tableau’s internal database)
1b
On the fly calculations over
billions of records not fast
enough for responsive user
experience
Size of disaggregated data
exceeds limits (e.g. Tableau)
& too large to be able to
efficiently enable daily loads
Too expensive,
too complex, requires
hiring specialists with
domain knowledge
Pre-aggregated cubes with
direct Tableau integration
offers most pragmatic &
simplest solution
99. 99
Under the framework, we use a configurable cube matrix to specify the
slices which we are interested in reporting on
• Number of users –
additive only across
Business Unit
dimension
• Number of listings
– additive across all
dimensions
Measures
Dimen-
sions
Time Channel Business unit
Q3
2017
Apr
2017
1 Apr
2 Apr
…
30 Apr
May
2017
1 May
2 May
…
31 May
Jun
2017
1 Jun
2 Jun
…
30 Jun
Total
Desk-
top
web
Mobile
web
Mobile
apps
An-
droid
iOS
Total
Europe
PL
PT
LATAM
AR
CO
Cube
matrix
Dimension
Time
(non-additive)
Channel
(non-additive)
Business unit
#
records
in cube
Perspective
Quarter
Month
Day
Total
L1 L2
Total
Region
Country
Cardinality 1 3 91 1 3 2 1 2 4
Full cube 3,990
Sub-cube 1 364
Sub-cube 2 28
Sub-cube 3 80
472 vs. 3,990 records (~12% of size of full cube)
Example
100. 100
Under this example, the 3 sub-cubes translate to 11 slices
Dimension
Time
(non-additive)
Channel
(non-additive)
Business unit
# records
in slicePerspective
Quarter
Month
Day
Total
L1 L2
Total
Region
Country
Cardinality 1 3 91 1 3 2 1 2 4
Slice 1 364
Slice 2 4
Slice 3 12
Slice 4 2
Slice 5 6
Slice 6 1
Slice 7 3
Slice 8 12
Slice 9 36
Slice 10 8
Slice 11 24
101. 101
Pseudo-SQL OLAP cube implementation
SELECT CASE WHEN cube_matrix.time_quarter THEN dim_time.quarter_name
WHEN cube_matrix.time_month THEN dim_time.month_name
WHEN cube_matrix.time_day THEN dim_time.date
END AS time_value,
CASE WHEN cube_matrix.channel_total THEN 'Total'
WHEN cube_matrix.channel_l1 THEN dim_channel.channel_l1_name
WHEN cube_matrix.channel_l2 THEN dim_channel.channel_l2_name
END AS channel_value,
CASE WHEN cube_matrix.bus_unit_total THEN 'Total'
WHEN cube_matrix.bus_unit_region THEN dim_bus_unit.region_name
WHEN cube_matrix.bus_unit_country THEN dim_bus_unit.country_name
END AS bus_unit_value,
COUNT(1) AS num_listings,
COUNT(DISTINCT user_key) AS num_users
FROM fact_listings
JOIN dim_time USING (time_key)
JOIN dim_channel USING (channel_key)
JOIN dim_bus_unit USING (bus_unit_key)
CROSS JOIN cube_matrix
GROUP BY 1,2,3
CHALLENGE
This CROSS JOIN
can be very expensive
as it explodes the
input fact table by the
number of slices
configured in the cube
matrix
SOLUTION
Aggregate only
non-additive
dimensions first
(Time and Channel),
then aggregate
additive dimensions
(Business Unit) using
already partially
aggregated output
102. 102
Summary of OLX Redshift OLAP framework
Input
Operational data
model from OLX
data warehouse
Facts
(~15 tables)
Dimensions
(~15 tables)
~250B records
Step 1
Prepare cub
pre-aggregates at
the smallest grain
possible (i.e. user)
Cube 1
pre-aggregate
Cube 2
pre-aggregate
Cube 3
pre-aggregate
…
~6B records
OLAP cube configuration
(definition of cube dimensional
model and cube matrix)
Step 2
Aggregate all
non-additive cube
slices
Cube 1
non-additive
aggregation
Cube 2
non-additive
aggregation
Cube 3
non-additive
aggregation
…
~100M records
Step 3
Aggregate all
additive cube slices
using output from
previous step
Cube 1
additive
aggregation
Cube 2
additive
aggregation
Cube 3
additive
aggregation
…
~460M records
Step 4
Combine individual
cube outputs into a
single cube
Consolidated
cube
~200M records
109. 109
OLX Redshift technical best practices
Ø Technical architecture
Ø Data management
Ø Recommenders
Ø Unit testing
Ø OLAP cubes
Ø Tableau integration
110. 110
Summary of OLX Redshift OLAP framework
Input
Operational data
model from OLX
data warehouse
Facts
(~15 tables)
Dimensions
(~15 tables)
~250B records
Step 1
Prepare cub
pre-aggregates at
the smallest grain
possible (i.e. user)
Cube 1
pre-aggregate
Cube 2
pre-aggregate
Cube 3
pre-aggregate
…
~6B records
OLAP cube configuration
(definition of cube dimensional
model and cube matrix)
Step 2
Aggregate all
non-additive cube
slices
Cube 1
non-additive
aggregation
Cube 2
non-additive
aggregation
Cube 3
non-additive
aggregation
…
~100M records
Step 3
Aggregate all
additive cube slices
using output from
previous step
Cube 1
additive
aggregation
Cube 2
additive
aggregation
Cube 3
additive
aggregation
…
~460M records
Step 4
Combine individual
cube outputs into a
single cube
Consolidated
cube
~200M records
Tableau view
Output
Tableau live
dashboard
connection to
dedicated Tableau
Redshift cluster
Tableau
Tableau abstraction view adds
derivative measures
(calculated on the fly) and
formatted values needed to
implement reporting dashboard
Strive for 1
query per Tableau
interaction running
within max. ~3 sec.
112. 112
Debugging Tableau interaction with Redshift
-- Get the last queries run from Tableau
SELECT DATEDIFF('ms', starttime, endtime) / 1000.0 AS duration,
query, xid, pid, starttime, querytxt
FROM stl_query
WHERE userid = 105
ORDER BY starttime desc
LIMIT 100
-- Get the SQL
SELECT LISTAGG(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
text, 'n', 'n' ),
'"', '' ),
', ', ',ntt' ),
'AND ', 'n ANDt' ),
'FROM ', 'n FROMt' ),
'WHERE ', 'n WHEREt' ),
'SELECT ', 'nSELECTt' ),
'declare', '--declare' ),
'')
WITHIN GROUP(ORDER BY sequence, starttime) AS sql
FROM svl_statementtext
WHERE userid = 105
AND xid = 5650302
AND text NOT LIKE 'begin%'
AND text NOT LIKE 'fetch%'
AND text NOT LIKE 'close%';
113. 113
Contents
• Introduction to Naspers and OLX Group
• OLX capabilities powered by Redshift
• OLX Redshift technical best practices
• Q&A
114. 114
Thank you! Questions?
Dobo Radichkov
Sr. Director, Global Analytics and
Customer Lifecycle Management
Dobo@OLX.com
OLX Group
www.olx.com
Free classifieds
We are hiring!
www.joinolx.com
Roles:
• Data engineers
• Data scientists
• PHP / Java / Android
/ iOS developers
Locations: Berlin,
Lisbon, Buenos Aires,
Dubai, Barcelona,
Moscow, Delhi