SlideShare a Scribd company logo
Apache Kylin Meetup
Berlin, October 2019
2
19:15 - Leveraging analytics at OLX Group with Kylin
Victor Fujihara – Data Engineering Manager
Mateusz Jerzyk – Senior Data Engineer
Rafael Correa – Senior System Engineer
20:00 - Apache Kylin: Boost your SQLs on an extremely large
dataset
George Ni – Apache Kylin committer
Agenda
3
Table of
contents
● Introduction to OLX Group
● What problem we were trying to solve ?
● Data Engineering with Kylin
● Handling Kylin infrastructure
4
THE OLX GROUP IS PART OF NASPERS
A collection of leading
companies and exciting
businesses!
US$107Bn
Founded 1915
South Africa
Market Cap: $100B
5
THE OLX GROUP – NASPERS CLASSIFIEDS
We improve lives by bringing
people together
for win-win exchanges
6
OLX GROUP TODAY:
THE WORLD'S #1 CLASSIFIEDS BUSINESS
HORIZONTALS REAL ESTATE
VERTICALS
OTHER
VERTICALS
CAR
VERTICALS
global
global
app-only
Russia
UAE
Africa and
Philippines
Russia
Portugal
Poland
Romania,
Egypt
Furniture,
Europe
Heavy
machinery,
global
Services,
Poland
Poland
South Africa
Romania
Portugal
CONVENIENT
TRANSACTIONS
Global
UAE
Latin
America
South
Africa*
*Pending approval
Jobs,
India
Jobs,
Poland
7
7
OLX GROUP - WHO WE ARE
We are a global product and tech group.
★ +30 countries
★ +35 offices
★ +5000 people
★ +1000 in Product & Tech
★ +350 MAU
★ +4B events per day
8
WE LOVE C2C EXCHANGES
1/4 of Russian population
uses Avito every month
>80% of secondhand car trade
in India are through OLX
500,000 items are listed
everyday at OLX Brazil
letgo: fastest growing app to trade
to buy and sell locally in US
OLX is the most visited
website in Romania
Every second are
listed on OLX:
2 houses
2 cars
3 fashion items
2.5 mobile
phones
Sources: Avito (Mediacorp); Romania (Audience and Traffic Measurement (SATI); Brazil, India and OLX (BI).
People spend 2x more times in OLX
apps versus competitors
9
Table of
contents
● Introduction to OLX Group
● What problem we were trying to solve ?
● Data Engineering with Kylin
● Handling Kylin infrastructure
10
Challenges we faced in Global BI
Performance Data democracyDesign
• Daily night job
• Slow response time for
detailed data
• It’s difficult for users to get numbers
• Needs access to database ( SQL
knowledge )
• Can’t access the database. It will
compete with our daily job
• Hard to add more metrics to
dashboards
• Very complex
architecture
• Low flexibility solution
• Hard to maintain and to
add new metrics
11
Triton Dashboards
12
Self Service functionality
13
Technical Requirements
Has very good response time
Is easy to manage and maintain
Scales horizontally
Has good integration with Tableau
14
Table of
contents
● Introduction to OLX Group
● What problem we were trying to solve ?
● Data Engineering with Kylin
● Handling Kylin infrastructure
15
Data pipeline
CREATE PARTITIONS
TRANSFORM THE DATA
CREATE EXTERNAL
TABLES
Native
ODBC
LISTINGS
REPLIES
...
UNLOAD ORC
Stay tuned!
Rafael will talk more
about this part
16
One of the challenges
Dimension
Fact Table
Dimension
Dimension
Dimension Dimension
➔ Apache Kylin supports only star
or snowflake schemas which
means we need to use only a
single fact table
➔ We can’t use separate cubes to
aggregate measures from
multiple facts in a single query
17
One of the challenges
Dimension
Fact Table
Dimension
Dimension
Dimension Dimension
Fact Table
Dimension
Dimension
Fact Table
18
One of the challenges
Fact
Table
Fact
Table
Fact
Table
NULLS
NULLS NULLS
NULLS
NULLSNULLS
Dimension
Dimension
Dimension
Dimension
Dimension Dimension
Dimension
One big Fact tablePros:
- we have star schema
- supposed to be easy to build
Cons:
- the size of the table is big
- difficult to change just part of
the data
- cube creation process takes
a lot of time
19
One of the challenges
Fact Table
Fact Table
Fact Table
Dimension
Dimension
Bridge
Table
Dimension
Dimension
Dimension
Dimension
Dimension
Simplified Bus
Architecture
Factless fact table
Common dimensions
20
One of the challenges
common
dimensions
facts
fact specific
dimensions
facts
21
Solution in Spark
22
Solution in Spark
23
Statistics
24
Statistics
25
Kylin vs SSAS vs Redshift
Apache Kylin Amazon RedshiftMicrosoft SSAS
Cluster size: 2 x r4.2xlarge; 3 x
r4.xlarge; 10 x r4.2xlarge ~ 2TB
vvvvvvvvvvvvvvvvvvv
Data size: source table size 12GB
(ungzipped); cube size 1 GB
Processing time: ~ 17 min
vvvvvvvvvvvv
Price: ~ 450 € / month (including
whole infrastructure costs)
vvvvvvvvvvvv
Response time in Tableau: 3.7 sec
Cluster size: s1 instance memory 24
GB
Data size: 2.6 GB. vvvvvvv
vvvvvvvvvvv
Processing time: ~ 15 min
vvvvvvvvvvv
Price: ~ 1232 € / month + additional
infrastructure costs
vvvvvvvvvvvvvvvv
Response time in Tableau: 6.8 sec
Cluster size: 13 nodes X dc2.8xlarge
(memory 244 GB, storage 2.56TB
SSD)
Data size: 51 GB
vvvvvvvvvvvvvvvvvv
Processing time: ~ 2 min*
vvvvvvvvvvvvv
Price: ~ 2000 € / month**
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
vv
Response time in Tableau: 14.1 sec
Input data: 100,000,000 rows, 1 big table
Model: 7 distinct counts, 12 sums, 8 small dimensions
26
Table of
contents
● Introduction to OLX Group
● What problem we were trying to solve ?
● Data Engineering with Kylin
● Handling Kylin infrastructure
27
Handling kylin infrastructure - requirements (extended)
Needs to:
● Be easily accessible by analysts and engineers.
○ Implies: user management integrated with existing company-wide solution.
● Scale for multiple teams and projects, storage wise and performance wise.
○ Implies: user authorisation per project
○ Also implies: diminish "noisy neighbours" effect as much as possible, for both serving cube queries
and building cubes.
● Be highly available, to serve every tableau dashboard we have in the company.
○ Implies: components need to be resilient (self healing, fault tolerant)
● Be maintainable by the SRE team.
○ Implies: observability through existing monitoring, alerting and logging tools
○ Also implies: leverage existing operational knowledge and simplify it as much as possible.
● Be cost effective.
28
AAAAA
29
30
Key technical decisions
● Hbase on HDFS in EMR (build hbase on EC2 would be great, but would take too long)
○ Limitation: Can't read HDFS in S3 (EMRFS) from outside EMR (found out the hard way).
○ Single-AZ failure covered by backup/restore tool (internally developed).
○ Single-AZ deployment to minimize cross-AZ data traffic costs and increase speed.
● Spark on k8s for kylin and hive (running in client mode)
○ No spark master management required.
○ We have our own custom spark build (prevents jar conflicts, also required by hive on spark).
○ More cost-effective and scalable than running in EMR.
○ Leveraged existing k8s knowledge to operate and scale.
○ Limitation: no support for tolerations/pod anti-affinity (yet ).
31
Key technical decisions
● SAML federation w/ OKTA for user management
○ openldap + custom-developer okta sync script which allows creating system users as well.
○ Limitation: We noticed some issues with non-admin users in the UI as of version 2.6.3 (couldn't
select projects from the select box on top).
● Kylin in cluster mode
○ Memcached cluster for tomcat session sharing (enables rolling deployments with zero-downtime).
○ using zookeeper from hbase for job locking.
○ job running won't impact kylin query serving performance due to spark on k8s.
32
Spark on k8s
33
Spark on k8s
When bootstrapping (client mode):
34
HBase backup/restore
In-house developed tool for taking a live backup from an
Hbase cluster. Currently kylin-specific.
Components:
● backup agent
○ written in python
○ runs in k8s as a single-replica deployment (to auto-
restart in case of failures)
○ performs daily backups from HDFS to S3 using
distcp
● restore-agent
○ restores the latest finished backup from an S3
bucket path.
○ written in jruby (easier to interact w/ hbase native
libraries for restoring a distcp snapshot in a reliable
way).
○ runs in hbase, current as an EMR step
35
References
Kylin in cluster mode
http://kylin.apache.org/docs/install/kylin_cluster.html
SAML federation for user management
http://kylin.apache.org/docs/howto/howto_ldap_and_sso.html
Memcached driver for tomcat
https://github.com/magro/memcached-session-manager
Spark on K8s
https://spark.apache.org/docs/latest/running-on-kubernetes.html
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Hive on Spark
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
36
Thank you ! Questions ?
Victor Fujihara
Data Engineering Manager
victor.Fujihara@olx.com
Mateusz Jerzyk
Senior Data Engineer
mateusz.jerzyk@olx.com
Rafael Correa
Senior System Engineer
rafael.correa@olx.com
OLX Group
www.olx.com
We are hiring!
www.joinolx.com
Roles:
• Data engineers
• Data scientists
• Java / Android
/ iOS developers
Locations: Berlin, Lisbon,
Buenos Aires, Dubai, Barcelona,
Moscow, Delhi
37

More Related Content

What's hot

"Smooth Operator" [Bay Area NewSQL meetup]
"Smooth Operator" [Bay Area NewSQL meetup]"Smooth Operator" [Bay Area NewSQL meetup]
"Smooth Operator" [Bay Area NewSQL meetup]
Kevin Xu
 
TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)
Kevin Xu
 
Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"
Lviv Startup Club
 
Under the hood: SkySQL monitoring
Under the hood: SkySQL monitoringUnder the hood: SkySQL monitoring
Under the hood: SkySQL monitoring
MariaDB plc
 
KAYA 3 Yr Plan Recommendation with Notes
KAYA 3 Yr Plan Recommendation with NotesKAYA 3 Yr Plan Recommendation with Notes
KAYA 3 Yr Plan Recommendation with Notes
Bryan Horton
 
Introducing TiDB Operator [Cologne, Germany]
Introducing TiDB Operator [Cologne, Germany]Introducing TiDB Operator [Cologne, Germany]
Introducing TiDB Operator [Cologne, Germany]
Kevin Xu
 
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
Bowen Li
 
How Pixid dropped Oracle and went hybrid with MariaDB
How Pixid dropped Oracle and went hybrid with MariaDBHow Pixid dropped Oracle and went hybrid with MariaDB
How Pixid dropped Oracle and went hybrid with MariaDB
MariaDB plc
 
CCV: migrating our payment processing system to MariaDB
CCV: migrating our payment processing system to MariaDBCCV: migrating our payment processing system to MariaDB
CCV: migrating our payment processing system to MariaDB
MariaDB plc
 
Planes de ejecucion 2016
Planes de ejecucion 2016Planes de ejecucion 2016
Planes de ejecucion 2016
Enrique Catala Bañuls
 
FOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends DevroomFOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends Devroom
Morgan Tocker
 
TiDB as an HTAP Database
TiDB as an HTAP DatabaseTiDB as an HTAP Database
TiDB as an HTAP Database
PingCAP
 
ClustrixDB at Samsung Cloud
ClustrixDB at Samsung CloudClustrixDB at Samsung Cloud
ClustrixDB at Samsung Cloud
MariaDB plc
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
PingCAP
 
Webinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDBWebinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDB
Severalnines
 
5 Steps to Smarter, Faster, Simpler Tableau Dashboards.
5 Steps to Smarter, Faster, Simpler Tableau Dashboards.5 Steps to Smarter, Faster, Simpler Tableau Dashboards.
5 Steps to Smarter, Faster, Simpler Tableau Dashboards.
Kinetica
 
TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote
PingCAP
 
TiDB Introduction - San Francisco MySQL Meetup
TiDB Introduction - San Francisco MySQL MeetupTiDB Introduction - San Francisco MySQL Meetup
TiDB Introduction - San Francisco MySQL Meetup
Morgan Tocker
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterBest Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise Cluster
InfluxData
 
TiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup GroupTiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup Group
Morgan Tocker
 

What's hot (20)

"Smooth Operator" [Bay Area NewSQL meetup]
"Smooth Operator" [Bay Area NewSQL meetup]"Smooth Operator" [Bay Area NewSQL meetup]
"Smooth Operator" [Bay Area NewSQL meetup]
 
TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)TiDB + Mobike by Kevin Xu (@kevinsxu)
TiDB + Mobike by Kevin Xu (@kevinsxu)
 
Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"
 
Under the hood: SkySQL monitoring
Under the hood: SkySQL monitoringUnder the hood: SkySQL monitoring
Under the hood: SkySQL monitoring
 
KAYA 3 Yr Plan Recommendation with Notes
KAYA 3 Yr Plan Recommendation with NotesKAYA 3 Yr Plan Recommendation with Notes
KAYA 3 Yr Plan Recommendation with Notes
 
Introducing TiDB Operator [Cologne, Germany]
Introducing TiDB Operator [Cologne, Germany]Introducing TiDB Operator [Cologne, Germany]
Introducing TiDB Operator [Cologne, Germany]
 
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
AthenaX - Unified Stream & Batch Processing using SQL at Uber, Zhenqiu Huang,...
 
How Pixid dropped Oracle and went hybrid with MariaDB
How Pixid dropped Oracle and went hybrid with MariaDBHow Pixid dropped Oracle and went hybrid with MariaDB
How Pixid dropped Oracle and went hybrid with MariaDB
 
CCV: migrating our payment processing system to MariaDB
CCV: migrating our payment processing system to MariaDBCCV: migrating our payment processing system to MariaDB
CCV: migrating our payment processing system to MariaDB
 
Planes de ejecucion 2016
Planes de ejecucion 2016Planes de ejecucion 2016
Planes de ejecucion 2016
 
FOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends DevroomFOSDEM MySQL and Friends Devroom
FOSDEM MySQL and Friends Devroom
 
TiDB as an HTAP Database
TiDB as an HTAP DatabaseTiDB as an HTAP Database
TiDB as an HTAP Database
 
ClustrixDB at Samsung Cloud
ClustrixDB at Samsung CloudClustrixDB at Samsung Cloud
ClustrixDB at Samsung Cloud
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
 
Webinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDBWebinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDB
 
5 Steps to Smarter, Faster, Simpler Tableau Dashboards.
5 Steps to Smarter, Faster, Simpler Tableau Dashboards.5 Steps to Smarter, Faster, Simpler Tableau Dashboards.
5 Steps to Smarter, Faster, Simpler Tableau Dashboards.
 
TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote TiDB DevCon 2020 Opening Keynote
TiDB DevCon 2020 Opening Keynote
 
TiDB Introduction - San Francisco MySQL Meetup
TiDB Introduction - San Francisco MySQL MeetupTiDB Introduction - San Francisco MySQL Meetup
TiDB Introduction - San Francisco MySQL Meetup
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterBest Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise Cluster
 
TiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup GroupTiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup Group
 

Similar to Apache kylin meetup berlin olx v1.0

PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
Stratebi
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Group, Inc.
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
Weaveworks
 
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Elasticsearch
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
OpenEBS
 
Google Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGoogle Cloud - Stand Out Features
Google Cloud - Stand Out Features
GDG Cloud Bengaluru
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson
 
Engage 2019 - SUSE Linux and Container update
Engage 2019  - SUSE Linux and Container updateEngage 2019  - SUSE Linux and Container update
Engage 2019 - SUSE Linux and Container update
Christian Holsing
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
Ambassador Labs
 
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with ScyllaiFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
ScyllaDB
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
Lars Albertsson
 
Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018
Romit Mehta
 
2018: State of the Dolphin, MySQL Keynote at Percona Live Europe 2018, Frankf...
2018: State of the Dolphin, MySQL Keynote at Percona Live Europe 2018, Frankf...2018: State of the Dolphin, MySQL Keynote at Percona Live Europe 2018, Frankf...
2018: State of the Dolphin, MySQL Keynote at Percona Live Europe 2018, Frankf...
Geir Høydalsvik
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
Hiram Fleitas León
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Lucas Jellema
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore
Stfalcon Meetups
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
Roman Nikitchenko
 
Data Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM ExellysData Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM Exellys
Wout Scheepers
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaC
DamienCarpy
 

Similar to Apache kylin meetup berlin olx v1.0 (20)

PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
 
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
 
OpenEBS hangout #4
OpenEBS hangout #4OpenEBS hangout #4
OpenEBS hangout #4
 
Google Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGoogle Cloud - Stand Out Features
Google Cloud - Stand Out Features
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Engage 2019 - SUSE Linux and Container update
Engage 2019  - SUSE Linux and Container updateEngage 2019  - SUSE Linux and Container update
Engage 2019 - SUSE Linux and Container update
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
 
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with ScyllaiFood on Delivering 100 Million Events a Month to Restaurants with Scylla
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018
 
2018: State of the Dolphin, MySQL Keynote at Percona Live Europe 2018, Frankf...
2018: State of the Dolphin, MySQL Keynote at Percona Live Europe 2018, Frankf...2018: State of the Dolphin, MySQL Keynote at Percona Live Europe 2018, Frankf...
2018: State of the Dolphin, MySQL Keynote at Percona Live Europe 2018, Frankf...
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Big data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymoreBig data & frameworks: no book for you anymore
Big data & frameworks: no book for you anymore
 
Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.Big data & frameworks: no book for you anymore.
Big data & frameworks: no book for you anymore.
 
Data Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM ExellysData Architecture at Vente-Exclusive.com - TOTM Exellys
Data Architecture at Vente-Exclusive.com - TOTM Exellys
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaC
 

Recently uploaded

一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
Madan Karki
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
TaghreedAltamimi
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
Roger Rozario
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
Madan Karki
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 

Recently uploaded (20)

一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 

Apache kylin meetup berlin olx v1.0

  • 2. 2 19:15 - Leveraging analytics at OLX Group with Kylin Victor Fujihara – Data Engineering Manager Mateusz Jerzyk – Senior Data Engineer Rafael Correa – Senior System Engineer 20:00 - Apache Kylin: Boost your SQLs on an extremely large dataset George Ni – Apache Kylin committer Agenda
  • 3. 3 Table of contents ● Introduction to OLX Group ● What problem we were trying to solve ? ● Data Engineering with Kylin ● Handling Kylin infrastructure
  • 4. 4 THE OLX GROUP IS PART OF NASPERS A collection of leading companies and exciting businesses! US$107Bn Founded 1915 South Africa Market Cap: $100B
  • 5. 5 THE OLX GROUP – NASPERS CLASSIFIEDS We improve lives by bringing people together for win-win exchanges
  • 6. 6 OLX GROUP TODAY: THE WORLD'S #1 CLASSIFIEDS BUSINESS HORIZONTALS REAL ESTATE VERTICALS OTHER VERTICALS CAR VERTICALS global global app-only Russia UAE Africa and Philippines Russia Portugal Poland Romania, Egypt Furniture, Europe Heavy machinery, global Services, Poland Poland South Africa Romania Portugal CONVENIENT TRANSACTIONS Global UAE Latin America South Africa* *Pending approval Jobs, India Jobs, Poland
  • 7. 7 7 OLX GROUP - WHO WE ARE We are a global product and tech group. ★ +30 countries ★ +35 offices ★ +5000 people ★ +1000 in Product & Tech ★ +350 MAU ★ +4B events per day
  • 8. 8 WE LOVE C2C EXCHANGES 1/4 of Russian population uses Avito every month >80% of secondhand car trade in India are through OLX 500,000 items are listed everyday at OLX Brazil letgo: fastest growing app to trade to buy and sell locally in US OLX is the most visited website in Romania Every second are listed on OLX: 2 houses 2 cars 3 fashion items 2.5 mobile phones Sources: Avito (Mediacorp); Romania (Audience and Traffic Measurement (SATI); Brazil, India and OLX (BI). People spend 2x more times in OLX apps versus competitors
  • 9. 9 Table of contents ● Introduction to OLX Group ● What problem we were trying to solve ? ● Data Engineering with Kylin ● Handling Kylin infrastructure
  • 10. 10 Challenges we faced in Global BI Performance Data democracyDesign • Daily night job • Slow response time for detailed data • It’s difficult for users to get numbers • Needs access to database ( SQL knowledge ) • Can’t access the database. It will compete with our daily job • Hard to add more metrics to dashboards • Very complex architecture • Low flexibility solution • Hard to maintain and to add new metrics
  • 13. 13 Technical Requirements Has very good response time Is easy to manage and maintain Scales horizontally Has good integration with Tableau
  • 14. 14 Table of contents ● Introduction to OLX Group ● What problem we were trying to solve ? ● Data Engineering with Kylin ● Handling Kylin infrastructure
  • 15. 15 Data pipeline CREATE PARTITIONS TRANSFORM THE DATA CREATE EXTERNAL TABLES Native ODBC LISTINGS REPLIES ... UNLOAD ORC Stay tuned! Rafael will talk more about this part
  • 16. 16 One of the challenges Dimension Fact Table Dimension Dimension Dimension Dimension ➔ Apache Kylin supports only star or snowflake schemas which means we need to use only a single fact table ➔ We can’t use separate cubes to aggregate measures from multiple facts in a single query
  • 17. 17 One of the challenges Dimension Fact Table Dimension Dimension Dimension Dimension Fact Table Dimension Dimension Fact Table
  • 18. 18 One of the challenges Fact Table Fact Table Fact Table NULLS NULLS NULLS NULLS NULLSNULLS Dimension Dimension Dimension Dimension Dimension Dimension Dimension One big Fact tablePros: - we have star schema - supposed to be easy to build Cons: - the size of the table is big - difficult to change just part of the data - cube creation process takes a lot of time
  • 19. 19 One of the challenges Fact Table Fact Table Fact Table Dimension Dimension Bridge Table Dimension Dimension Dimension Dimension Dimension Simplified Bus Architecture Factless fact table Common dimensions
  • 20. 20 One of the challenges common dimensions facts fact specific dimensions facts
  • 25. 25 Kylin vs SSAS vs Redshift Apache Kylin Amazon RedshiftMicrosoft SSAS Cluster size: 2 x r4.2xlarge; 3 x r4.xlarge; 10 x r4.2xlarge ~ 2TB vvvvvvvvvvvvvvvvvvv Data size: source table size 12GB (ungzipped); cube size 1 GB Processing time: ~ 17 min vvvvvvvvvvvv Price: ~ 450 € / month (including whole infrastructure costs) vvvvvvvvvvvv Response time in Tableau: 3.7 sec Cluster size: s1 instance memory 24 GB Data size: 2.6 GB. vvvvvvv vvvvvvvvvvv Processing time: ~ 15 min vvvvvvvvvvv Price: ~ 1232 € / month + additional infrastructure costs vvvvvvvvvvvvvvvv Response time in Tableau: 6.8 sec Cluster size: 13 nodes X dc2.8xlarge (memory 244 GB, storage 2.56TB SSD) Data size: 51 GB vvvvvvvvvvvvvvvvvv Processing time: ~ 2 min* vvvvvvvvvvvvv Price: ~ 2000 € / month** vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv vv Response time in Tableau: 14.1 sec Input data: 100,000,000 rows, 1 big table Model: 7 distinct counts, 12 sums, 8 small dimensions
  • 26. 26 Table of contents ● Introduction to OLX Group ● What problem we were trying to solve ? ● Data Engineering with Kylin ● Handling Kylin infrastructure
  • 27. 27 Handling kylin infrastructure - requirements (extended) Needs to: ● Be easily accessible by analysts and engineers. ○ Implies: user management integrated with existing company-wide solution. ● Scale for multiple teams and projects, storage wise and performance wise. ○ Implies: user authorisation per project ○ Also implies: diminish "noisy neighbours" effect as much as possible, for both serving cube queries and building cubes. ● Be highly available, to serve every tableau dashboard we have in the company. ○ Implies: components need to be resilient (self healing, fault tolerant) ● Be maintainable by the SRE team. ○ Implies: observability through existing monitoring, alerting and logging tools ○ Also implies: leverage existing operational knowledge and simplify it as much as possible. ● Be cost effective.
  • 29. 29
  • 30. 30 Key technical decisions ● Hbase on HDFS in EMR (build hbase on EC2 would be great, but would take too long) ○ Limitation: Can't read HDFS in S3 (EMRFS) from outside EMR (found out the hard way). ○ Single-AZ failure covered by backup/restore tool (internally developed). ○ Single-AZ deployment to minimize cross-AZ data traffic costs and increase speed. ● Spark on k8s for kylin and hive (running in client mode) ○ No spark master management required. ○ We have our own custom spark build (prevents jar conflicts, also required by hive on spark). ○ More cost-effective and scalable than running in EMR. ○ Leveraged existing k8s knowledge to operate and scale. ○ Limitation: no support for tolerations/pod anti-affinity (yet ).
  • 31. 31 Key technical decisions ● SAML federation w/ OKTA for user management ○ openldap + custom-developer okta sync script which allows creating system users as well. ○ Limitation: We noticed some issues with non-admin users in the UI as of version 2.6.3 (couldn't select projects from the select box on top). ● Kylin in cluster mode ○ Memcached cluster for tomcat session sharing (enables rolling deployments with zero-downtime). ○ using zookeeper from hbase for job locking. ○ job running won't impact kylin query serving performance due to spark on k8s.
  • 33. 33 Spark on k8s When bootstrapping (client mode):
  • 34. 34 HBase backup/restore In-house developed tool for taking a live backup from an Hbase cluster. Currently kylin-specific. Components: ● backup agent ○ written in python ○ runs in k8s as a single-replica deployment (to auto- restart in case of failures) ○ performs daily backups from HDFS to S3 using distcp ● restore-agent ○ restores the latest finished backup from an S3 bucket path. ○ written in jruby (easier to interact w/ hbase native libraries for restoring a distcp snapshot in a reliable way). ○ runs in hbase, current as an EMR step
  • 35. 35 References Kylin in cluster mode http://kylin.apache.org/docs/install/kylin_cluster.html SAML federation for user management http://kylin.apache.org/docs/howto/howto_ldap_and_sso.html Memcached driver for tomcat https://github.com/magro/memcached-session-manager Spark on K8s https://spark.apache.org/docs/latest/running-on-kubernetes.html https://github.com/GoogleCloudPlatform/spark-on-k8s-operator Hive on Spark https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
  • 36. 36 Thank you ! Questions ? Victor Fujihara Data Engineering Manager victor.Fujihara@olx.com Mateusz Jerzyk Senior Data Engineer mateusz.jerzyk@olx.com Rafael Correa Senior System Engineer rafael.correa@olx.com OLX Group www.olx.com We are hiring! www.joinolx.com Roles: • Data engineers • Data scientists • Java / Android / iOS developers Locations: Berlin, Lisbon, Buenos Aires, Dubai, Barcelona, Moscow, Delhi
  • 37. 37