SlideShare a Scribd company logo
Data analytics at a PB scale
200 researchers using just Presto and a data-lake
Or Koren | Head of Data @ ironSource
About Me ● Married ● 35 ● Tel Aviv ● Coding
Agenda:
ironSource overview
The past: 2016-2017
The present: 2018-2019
The future: 2020-2021
ironSource overview
ESTABLISHE
D
ACQUISITIONS TO
DATE
EMPLOYEES
San Francisco
United States
New York
United States
London
United Kingdom
Berlin
Germany
Kiev
Ukraine
Tel Aviv
Israel
Bangalore
India
Hong Kong
China
Tokyo
Japan
Seoul
South Korea
Beijing
Shanghai
Shenzhen
China
53
7
11
1
39
5
3
1
5
624
30
ironSource Overview
ESTABLISHED
SEP. 2010
ACQUISITIONS TO DATE
8
EMPLOYEES
779
R&D EMPLOYEES
395
ironSource Solutions & Products
Developer Solutions
In-app advertising network and
mediation platform for app developers
PRODUCTS & PLATFORMS
ironSource Mediation
In-App Advertising Network
PRODUCTS & PLATFORMS
Enterprise Solutions
Engagement platform for Carriers & OEMs
PRODUCTS & PLATFORMS
ironSource Aura
PRODUCTS & PLATFORMS
Digital Solutions
Software delivery platform and B2C
security products
PRODUCTS & PLATFORMS
Delivery & Ad Monetization Platform
Security products
PRODUCTS & PLATFORMS
*
*In advanced negotiations
The past
Our old data architecture
● 10 redshift clusters
● 5 RDS clusters
● 1000+ ETLs
● 1 Tableau
● Hard to scale
● Hard to maintain
● Hard to work
● Limited data
● Expensive
Our data scientist…
To the future
● Lifetime data
● Fast SQL
● Easy scale
● Data science
● Open source
The present
Data Lake
Parquet
● Files based
● Open Source
● Column oriented
● S3 bucket
Hive
Apache Hive is a data warehouse software project
which was built on top of Apache Hadoop in order
to provide data query and analysis.
● One place to rule them all
● Hadoop Ecosystem
● Presto
● Spark
● Athena
Data Lake
Presto & Qubole
Qubole delivers a Self-Service Platform for Big Data
Analytics built on Amazon Web Services, Microsoft
and Google Clouds.
Scalable Clusters
Qubole configuration scales clusters up and down by
looking over the execution plan of the queries.
Spots
Maintenance & Versions
Qubole takes care of new versions & 24/7 support
For Every Query
Auto scaling demo
Presto UI
Our Volume
500TB
Daily scan (from S3)
70K
Daily queries over
Presto 200
Users
500
Dashboards
Our data scientist …
● 10 redshift clusters
● 5 RDS clusters
● 1000+ ETLs
● 1 Tableau
● Hard to scale
● Hard to maintain
● Hard to work
● Limited data
● Expensive
Our data architecture
● 1 redshift cluster
● 0 RDS clusters
● 300 ETLs
● 1 Tableau & 1 Re/dash
● Reduce costs by 50%
● Agile to the business
+
Our new data architecture
The future
● Replace 90% of our ETLs to ELTs
● Help our data science team by being more clear
on the logic, reducing their work time by 80%
● Keeping raw data without any manipulation
Reduce ML Model deployment time by 50%
● No ETL time - no schedule
The New ETL
is ELT
Extract,
Load,
Transform.
Presto Connectors
Kafka
Real-time alerts over presto
ScyllaDB
Increase our insights with our ML
models
Elasticsearch
Join business KPIs with R&D logs
Key notes to take home
Data-Lake Keep all your raw data in one place.
It will help you in the future with costs, research, reduce resources and ML models
Qubole Enjoy the benefits of 3rd party companies and continue to work on your business
Scale Reach endless data with big clusters that scale per query
ELT Move 90% of your ETLs to ELTs, to reduce lags and costs
Agile Promote your business with quick insights
Free to Learn Take 10% of your time and learn!
Try and play with the data :)
Thank You
Or Koren
or@ironsrc.com
Linkedin: korenor

More Related Content

What's hot

Detecting Mobile Malware with Apache Spark with David Pryce
Detecting Mobile Malware with Apache Spark with David PryceDetecting Mobile Malware with Apache Spark with David Pryce
Detecting Mobile Malware with Apache Spark with David Pryce
Databricks
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al Essa
Spark Summit
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Databricks
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
Find your data
Find your dataFind your data
Find your data
Oliver Busse
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
Albert Wong
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
kbajda
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
Elasticsearch
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Databricks
 
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiHow Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
Databricks
 
A (XPages) developers guide to Cloudant
A (XPages) developers guide to CloudantA (XPages) developers guide to Cloudant
A (XPages) developers guide to Cloudant
Frank van der Linden
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Databricks
 
Bridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on DatabricksBridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on Databricks
Databricks
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Databricks
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
Databricks
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Databricks
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
Roman Chukh
 
How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...
Databricks
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
Vaclav Kosar
 

What's hot (20)

Detecting Mobile Malware with Apache Spark with David Pryce
Detecting Mobile Malware with Apache Spark with David PryceDetecting Mobile Malware with Apache Spark with David Pryce
Detecting Mobile Malware with Apache Spark with David Pryce
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al Essa
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Find your data
Find your dataFind your data
Find your data
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiHow Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
 
A (XPages) developers guide to Cloudant
A (XPages) developers guide to CloudantA (XPages) developers guide to Cloudant
A (XPages) developers guide to Cloudant
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
 
Bridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on DatabricksBridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on Databricks
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
 

Similar to Data analytics at a petabyte scale final

AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - Slides
TobyWilman
 
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
George Walters
 
Platform Requirements for CI/CD Success—and the Enterprises Leading the Way
Platform Requirements for CI/CD Success—and the Enterprises Leading the WayPlatform Requirements for CI/CD Success—and the Enterprises Leading the Way
Platform Requirements for CI/CD Success—and the Enterprises Leading the Way
VMware Tanzu
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Denodo
 
Microsoft SQL Server 2016 - Everything Built In
Microsoft SQL Server 2016 - Everything Built InMicrosoft SQL Server 2016 - Everything Built In
Microsoft SQL Server 2016 - Everything Built In
David J Rosenthal
 
Zakir_Hussain_cv
Zakir_Hussain_cvZakir_Hussain_cv
Zakir_Hussain_cv
zakir hussain
 
Data Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecData Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd Dec
Jonathan Woodward
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Dataconomy Media
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
David Talby
 
Customer migration to Azure SQL database, December 2019
Customer migration to Azure SQL database, December 2019Customer migration to Azure SQL database, December 2019
Customer migration to Azure SQL database, December 2019
George Walters
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
Andy Lathrop
 
RedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter CailliauRedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter Cailliau
Redis Labs
 
Data Amp South Africa - SQL Server 2017
Data Amp South Africa - SQL Server 2017Data Amp South Africa - SQL Server 2017
Data Amp South Africa - SQL Server 2017
Travis Wright
 
Digital transformation with microsoft data and ai
Digital transformation with microsoft data and ai Digital transformation with microsoft data and ai
Digital transformation with microsoft data and ai
MichaelRoenker
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
StampedeCon
 
TestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data TestingTestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data Testing
RTTS
 
Sql 2017 net raf
Sql 2017  net rafSql 2017  net raf
Sql 2017 net raf
Maximiliano Accotto
 
Sql 2016 2017 full
Sql 2016   2017 fullSql 2016   2017 full
Sql 2016 2017 full
Maximiliano Accotto
 
Bringing your data to life using Power BI - SPS London 2016
Bringing your data to life using Power BI - SPS London 2016Bringing your data to life using Power BI - SPS London 2016
Bringing your data to life using Power BI - SPS London 2016
Chirag Patel
 
Modern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced AnalyticsModern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced Analytics
Collective Intelligence Inc.
 

Similar to Data analytics at a petabyte scale final (20)

AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - Slides
 
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
 
Platform Requirements for CI/CD Success—and the Enterprises Leading the Way
Platform Requirements for CI/CD Success—and the Enterprises Leading the WayPlatform Requirements for CI/CD Success—and the Enterprises Leading the Way
Platform Requirements for CI/CD Success—and the Enterprises Leading the Way
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
 
Microsoft SQL Server 2016 - Everything Built In
Microsoft SQL Server 2016 - Everything Built InMicrosoft SQL Server 2016 - Everything Built In
Microsoft SQL Server 2016 - Everything Built In
 
Zakir_Hussain_cv
Zakir_Hussain_cvZakir_Hussain_cv
Zakir_Hussain_cv
 
Data Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecData Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd Dec
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Customer migration to Azure SQL database, December 2019
Customer migration to Azure SQL database, December 2019Customer migration to Azure SQL database, December 2019
Customer migration to Azure SQL database, December 2019
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
 
RedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter CailliauRedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter Cailliau
 
Data Amp South Africa - SQL Server 2017
Data Amp South Africa - SQL Server 2017Data Amp South Africa - SQL Server 2017
Data Amp South Africa - SQL Server 2017
 
Digital transformation with microsoft data and ai
Digital transformation with microsoft data and ai Digital transformation with microsoft data and ai
Digital transformation with microsoft data and ai
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
TestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data TestingTestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data Testing
 
Sql 2017 net raf
Sql 2017  net rafSql 2017  net raf
Sql 2017 net raf
 
Sql 2016 2017 full
Sql 2016   2017 fullSql 2016   2017 full
Sql 2016 2017 full
 
Bringing your data to life using Power BI - SPS London 2016
Bringing your data to life using Power BI - SPS London 2016Bringing your data to life using Power BI - SPS London 2016
Bringing your data to life using Power BI - SPS London 2016
 
Modern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced AnalyticsModern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced Analytics
 

Recently uploaded

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 

Recently uploaded (20)

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 

Data analytics at a petabyte scale final

  • 1. Data analytics at a PB scale 200 researchers using just Presto and a data-lake Or Koren | Head of Data @ ironSource
  • 2. About Me ● Married ● 35 ● Tel Aviv ● Coding
  • 3. Agenda: ironSource overview The past: 2016-2017 The present: 2018-2019 The future: 2020-2021
  • 5. ESTABLISHE D ACQUISITIONS TO DATE EMPLOYEES San Francisco United States New York United States London United Kingdom Berlin Germany Kiev Ukraine Tel Aviv Israel Bangalore India Hong Kong China Tokyo Japan Seoul South Korea Beijing Shanghai Shenzhen China 53 7 11 1 39 5 3 1 5 624 30 ironSource Overview ESTABLISHED SEP. 2010 ACQUISITIONS TO DATE 8 EMPLOYEES 779 R&D EMPLOYEES 395
  • 6. ironSource Solutions & Products Developer Solutions In-app advertising network and mediation platform for app developers PRODUCTS & PLATFORMS ironSource Mediation In-App Advertising Network PRODUCTS & PLATFORMS Enterprise Solutions Engagement platform for Carriers & OEMs PRODUCTS & PLATFORMS ironSource Aura PRODUCTS & PLATFORMS Digital Solutions Software delivery platform and B2C security products PRODUCTS & PLATFORMS Delivery & Ad Monetization Platform Security products PRODUCTS & PLATFORMS * *In advanced negotiations
  • 8. Our old data architecture ● 10 redshift clusters ● 5 RDS clusters ● 1000+ ETLs ● 1 Tableau ● Hard to scale ● Hard to maintain ● Hard to work ● Limited data ● Expensive
  • 10. To the future ● Lifetime data ● Fast SQL ● Easy scale ● Data science ● Open source
  • 12. Data Lake Parquet ● Files based ● Open Source ● Column oriented ● S3 bucket
  • 13. Hive Apache Hive is a data warehouse software project which was built on top of Apache Hadoop in order to provide data query and analysis. ● One place to rule them all ● Hadoop Ecosystem ● Presto ● Spark ● Athena Data Lake
  • 14. Presto & Qubole Qubole delivers a Self-Service Platform for Big Data Analytics built on Amazon Web Services, Microsoft and Google Clouds. Scalable Clusters Qubole configuration scales clusters up and down by looking over the execution plan of the queries. Spots Maintenance & Versions Qubole takes care of new versions & 24/7 support For Every Query
  • 16. Our Volume 500TB Daily scan (from S3) 70K Daily queries over Presto 200 Users 500 Dashboards
  • 18. ● 10 redshift clusters ● 5 RDS clusters ● 1000+ ETLs ● 1 Tableau ● Hard to scale ● Hard to maintain ● Hard to work ● Limited data ● Expensive Our data architecture ● 1 redshift cluster ● 0 RDS clusters ● 300 ETLs ● 1 Tableau & 1 Re/dash ● Reduce costs by 50% ● Agile to the business + Our new data architecture
  • 20. ● Replace 90% of our ETLs to ELTs ● Help our data science team by being more clear on the logic, reducing their work time by 80% ● Keeping raw data without any manipulation Reduce ML Model deployment time by 50% ● No ETL time - no schedule The New ETL is ELT Extract, Load, Transform.
  • 21. Presto Connectors Kafka Real-time alerts over presto ScyllaDB Increase our insights with our ML models Elasticsearch Join business KPIs with R&D logs
  • 22. Key notes to take home Data-Lake Keep all your raw data in one place. It will help you in the future with costs, research, reduce resources and ML models Qubole Enjoy the benefits of 3rd party companies and continue to work on your business Scale Reach endless data with big clusters that scale per query ELT Move 90% of your ETLs to ELTs, to reduce lags and costs Agile Promote your business with quick insights Free to Learn Take 10% of your time and learn! Try and play with the data :)

Editor's Notes

  1. Good morning everyone!!! I am Or and I will show you today how we use presto and datalake at a PB scale
  2. Before we start i want to show you a bit about myself and my team, so this picture was taken at last Purim CUSTOME party here next by, at hangar 11 (eleven). For those have that noticed, i heart my knee 4 weeks ago Skiing in Val-Morel, france So i had to sit in the sun, drink and relax… I am: Married, 35, leave in Tel-Aviv. And Coding has been my life, since i was eleven….
  3. I will show you a bita about ironSource. Then I will take you to a journey of time since 2016 (before Presto) until today (with presto) and what we are going to use in the future.
  4. ironSource was created 10 years ago. We are almost 800 employees & more than 50% of us are R&D. Our headquarters & R&D center is located in Tel-Aviv & we have 9 more offices around the world.
  5. ironSource has few different business divisions: Developer solutions - This division focuses on providing tools and technology to mobile app developers - specifically game developers. We offer an SDK which essentially enables the developer to run ads in his app to make more money. We are very strong with rewarded video - so if any of you are gamers, you may be familiar with the moment in a game when you run out of lives and you are offered a rewarded video to watch in order to continue playing. That’s an example of what we do Enterprise solutions - Focusing on helping mobile device manufacturers and mobile carriers to engage with their customers. Instead of having 20 different applications pre-installed on your device, users have the power to set up their device the way they want to, with the apps they really want and need. Digital solutions - This is my division, we are focusing on the desktop world, (Mac, PC). We help software developers with technologies that help monetize their software and distribute it to new users.
  6. Lets have a look back on our -- AR-KI-TECH-TURE We had 10 different Redshift clusters One for BI One for Researchers One for R&D One for Data science One for Realtime data One for Historic Data One for DWH One for QA One for Critical ETLs One for Backups As you can imagine, it was really hard to work with. We had 5 RDS clusters - Mainly for our Applications (Like OLAP) We had more than 1000 ETLs... We had 1 Tableau Server And it was really hard. Hard to scale - Redshift scale very slow - from few hours till days... Hard to maintain - We had Vacum Tables, Delete old data, move data from one redshift to another. Hard to work - From two aspects: Not all the tables where on the same cluser. 30% of our Clusters power, went only for the insertion of the data. Limited data - We could not insert all the data into one cluster. Very Expensive
  7. This is how our data scientist looked like at the time. Or Even like that.
  8. So, We stop and thought where we want to be in the future. First of all, we wanted lifetime data, which is very important to our business Fast SQL - We wanted SQL that is fast enough for our dashboard usage We wanted the ability to scale very fast We focused on our data science team, as we know we are going to increase our data science team and ML models Open source - we did not wanted to be attached to a certain company
  9. So we started to create our Data lake, we choose to use parquet files, which is open source and column oriented. We keep all of our data in S3 as we convert the data from json into parquet in batch operations on near-realtime.
  10. Hive & Hive MetaStore - we have one source of truth for our table definition. Which works perfect with any Hadoop Ecosystem Such as: Presto Spark Athena And more.
  11. Presto and Qubole. We use presto to query our data-lake via Qubole. Qubole is self service platform that enables us to configure presto clusters that easily scales, uses SPOTS, and they take care of Maintenance, new Versions & 24h support. Once you configured your cluster, it can increase itself, from 3 to 50 nodes for example within seconds… And that is being done for every query you do
  12. Let’s see an example of Auto scaling over presto. I run around 50 different dashboards that uses presto and saved Presto UI snapshot every few seconds. As you can see, at the start there are 3 nodes and 4 queries. And as i run the dashboards, the number of nodes is increasing as the number of queries. After all queries are finished, the cluster is decreasing back to normal.
  13. A bit about our volume. We have around 70 thousands queries running via Presto every Day. We have 200 users 500 dashboards and increasing And half of Peta-Byte scan per day, just from S3, without the caching of Presto.
  14. Remember our data scientist? Well, I think this is the best picture to explain how he feels.
  15. Lets see how our -- AR-KI-TECH-TURE ---- looks like today We eliminate 9 of our Redshift’s, kept only one for Finance/DWH. We eliminate all our RDS’s - all the data is stored in the data-lake We reduce 70% of our ETL’s - as we don’t need to move data from one place to another. We have add to our Tableau Server, a Re/Dash server. Re/Dash is an open source BI tool, we use Re/Dash for the short terms solutions and Tableau to the long term solutions By adding the Data-Lake, all of our problems disappeared! In addition, we have reduced our costs by 50%! And most important, we became much more agile to the business, instead of having first insights for a new project in 2 to 8 weeks, we are giving the first insights in the first day OR even the first hour!
  16. What we expect to use more in the future.
  17. First of all. ELT. The new ETL is ELT. If you don’t know what is ELT: Extract, Load, Transform. It means you need to create your Business logic in a big query (OR VIEW) We are going to reduce around 90% of our ETL’s and move them to ELT. Why ELT? Data science. We see that the ELT reduce the Data science work by 80%! The main reason is that they can create a dataset within minutes. By cloning an ELT of specific business unit And add more features. 2. Deployment - ELT helps Data engineers deploy the ML model, since all the RAW Data is in one place, and the model was created upon this data and not aggregated data. 3. No lag - NO scheduler - you become more realtime.
  18. We are going to increase our usage in Presto Connectors. Kafka, we are going to change our alerts system ( for business KPI’s) from Data-lake to kafka. To ensure faster findings on real-time! Scylla DB - increase our insights into Scylla-DB for our ML models. ElasticSearch - We use ElasticSearch via Kibana to monitor server logs and r&d logs, we see strong needs to be able to join those logs with business KPI’s
  19. A few notes to take home Data-Lake - Keep all your data in one place, it will save you time, effort & money. Qubole - Use Big Data services like Qubole to be able to focus on your business and not on the maintenance Scale - Presto Scales just works perfect ELT - don’t do ETL’s, you don’t need them anymore. With Presto, you can be much more agile to your business Free to Learn As i always encourage my team TO DO, take 10% of your time, learn & play with the data.