SlideShare a Scribd company logo
1 of 17
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
Optimizing Data for Fast Querying
Andrei Ionescu – Adobe Romania, Data Platform
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
Agenda
 Data Flow & Architecture
 Some Initial Numbers
 The Problem
 Compaction Service
 On-Demand
 Real-Time
 New Numbers
 What’s Next
 Useful Links
2
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
Data Flow & Architecture
3
Solution
Solution
Solution
HDFS
Ingestion
Data
Science
Query
Service
Query
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
Some Initial Numbers
4
Courtesy of Paul Mackles
Each dataset per tenant has:
 Files per day – 10K of ~4MB each
 Rows per day – 500M
 Byte Size per day – 7GB
 Query time – time out due to> 10 min
 Files scan – tens of minutes
For a month for 10 datasets we have 3M files of ~4MB each.
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
The Problem – Client Issue
Interactivity when querying the data
is mandatory – any query taking
over 10 minutes is dropped, even
though the process will eventually
reach and end after a while.
Due to the way the data is ingested
and files are written the queries do
reach 10 minutes timeout every
time, even for simple queries on a
month worth of data.
5
SELECT
_Y as Year,
_M as Month,
count(*) as ECount
FROM midvalues_1
WHERE
_Y = 2018 OR
_Y=2019
GROUP BY 1, 2
ORDER BY 1 Desc, 2 Desc;
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
The Problem – Technical issue
The problem is known as “Small File Problem”.
 HDFS allocates for each file about 150 bytes that are stored in the namenode
memory. 10M files results in 3GB of memory usage.
 HDFS is not geared up to efficiently accessing small files: it is primarily designed
for streaming access of large files. Reading through small files normally causes
lots of seeks and lots of hopping from datanode to datanode to retrieve each
small file, all of which is an inefficient data access pattern.
 Each small file is passed to map() function, which is not very efficient because it
will create a large number of mappers. For example, the 1,000’s files of size (2 to
3 MB) will need 1,000 mappers which very inefficient.
6
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
Compaction Service
Compaction is a service with the purpose of optimizing the ingested data into proper
file size in regards to block size (128MB, 256MB, 512MB).
Compaction Service takes small files and compacts them into bigger files of a
specific size.
Compaction Service has 2 functionalities that maps over its 2 components:
 On-demand compaction – Compaction Job
 Real-time compaction (any new file or group of files arriving is taken into account
and compacted according to defined rules) – Compaction Tracker
7
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
On-Demand – Compaction Job
 One time job Spark job that gets the files and compacts them into bigger files
 For example: 7000 small files into 50 files
 Compaction Job is the core of the service
 Processing steps:
 Given path patterns (ie: /mystore/mydataset/_Y=2019/_M=1/_D=7/batch=*/)
 Scan all files and get their file size
 Knowing total size, target size, schema size, etc., we can find n=totalSize/targetSize
 Two modes of running
 Load the data and apply repartition(n) to get the proper number of files (one DataSet with all data)
 Group files into n buckets and repartition(1) each bucket (multiple DataSets)
8
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
On-Demand – Compaction Job – Running Modes
9
/myDs/_Y=2019/_M=1/_D=1/batch=1/File-01.part
/myDs/_Y=2019/_M=1/_D=1/batch=1/File-02.part
/myDs/_Y=2019/_M=1/_D=1/batch=1/File-03.part
/myDs/_Y=2019/_M=1/_D=1/batch=1/*
/myDs/_Y=2019/_M=1/_D=1/batch=12/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=12/*
/myDs/_Y=2019/_M=1/_D=1/batch=22/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=22/*
/myDs/_Y=2019/_M=1/_D=1/batch=13/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=13/*
DataS
et
/myDs/_Y=2019/_M=1/_D=1/batch=199898/
Repartitionn
1 DataSet with repartition n
/myDs/_Y=2019/_M=1/_D=1/batch=1/File-01.part
/myDs/_Y=2019/_M=1/_D=1/batch=1/File-02.part
/myDs/_Y=2019/_M=1/_D=1/batch=1/File-03.part
/myDs/_Y=2019/_M=1/_D=1/batch=1/*
/myDs/_Y=2019/_M=1/_D=1/batch=12/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=12/*
/myDs/_Y=2019/_M=1/_D=1/batch=22/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=22/*
/myDs/_Y=2019/_M=1/_D=1/batch=13/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=13/*
DataS
et
/myDs/_Y=2019/_M=1/_D=1/
batch=199898/__p1/
Repartition
1
n DataSets with repartition 1
DataS
et
Repartition
1
/myDs/_Y=2019/_M=1/_D=1/
batch=199898/__p2/
Movefilestoparentfolder
/myDs/_Y=2019/_M=1/_D=1/
batch=199898/
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
On-Demand – Compaction Job – Running Modes
1 DataSet with repartition n
 [-] File bloating due to random shuffling after repartition(n)
 [-] Twice the file size
n DataSets with repartition 1
 [+] File size is similar to the source size
 [-] Each DataSet can write only in its own folder, thus one more step to move
the files back to parent path and cleanup
10
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
Real-Time – Compaction Tracker
Any new batch of files arriving into the storage is taken into account and compacted
by specific rules:
 Should be part of the same tenant
 Should be part of the same partition
 Should be grouped together to similar files so that it would target the size of a
HDFS block size
 Triggers a Compaction Job (On-demand) for the ready to be compacted group
11
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
Real-Time – Compaction Tracker
12
batch of files ready
Stateful Streaming
State:
• Number of batches
• File size
• Partition
• Etc.
Trigger
One Time Job
(On-Demand)
/myDS/_Y=2019/_M=1/_D=1/batch=12/
/myDS/_Y=2019/_M=1/_D=1/batch=13/
/myDS/_Y=2019/_M=1/_D=1/batch=14/
…
/myDS/_Y=2019/_M=1/_D=1/batch=1000/
HDFS
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
Architecture with Compaction Service
13
Solution
Solution
Solution
HDFS
Ingestion
Data
Science
Workspace
Query
Service
QueryCompaction
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
New Numbers
Now, each dataset per tenant has:
 Files per day – 50
 Rows per day – 500M
 Byte Size per day – 7GB
 Query time – < 10 min
 Files scan – under a minute
14
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
What’s next
 Use of meta-store (Netflix Iceberg)
 Better scaling resources for Compaction Jobs based on the data
 Use of ML to fit the Compaction Job needs
15
© 2019 Adobe. All Rights Reserved. Adobe Confidential.
Useful Links
 http://dataottam.com/2016/09/09/3-solutions-for-big-datas-small-files-problem/
 https://blog.cloudera.com/blog/2009/02/the-small-files-problem/
 https://community.hortonworks.com/questions/167615/what-is-small-file-problem-in-hdfs.html
 https://github.com/Netflix/iceberg
 https://parquet.apache.org/
Keep in Touch
 LinkedIn: https://www.linkedin.com/in/andreiionescu
 Email: aionescu@adobe.com
16
Optimizing Data for Fast Querying

More Related Content

What's hot

S106195 cos-use cases-istanbul-v1902a
S106195 cos-use cases-istanbul-v1902aS106195 cos-use cases-istanbul-v1902a
S106195 cos-use cases-istanbul-v1902aTony Pearson
 
S100295 reporting-monitoring-orlando-v1804a
S100295 reporting-monitoring-orlando-v1804aS100295 reporting-monitoring-orlando-v1804a
S100295 reporting-monitoring-orlando-v1804aTony Pearson
 
S100293 hybrid-cloud-orlando-v1804a
S100293 hybrid-cloud-orlando-v1804aS100293 hybrid-cloud-orlando-v1804a
S100293 hybrid-cloud-orlando-v1804aTony Pearson
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothmanDenis Rothman
 
Spectrum Scale final
Spectrum Scale finalSpectrum Scale final
Spectrum Scale finalJoe Krotz
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
S100297 ilm-archive-orlando-v1804c
S100297 ilm-archive-orlando-v1804cS100297 ilm-archive-orlando-v1804c
S100297 ilm-archive-orlando-v1804cTony Pearson
 
Why Hadoop is important to Syncsort
Why Hadoop is important to SyncsortWhy Hadoop is important to Syncsort
Why Hadoop is important to Syncsorthuguk
 
Voices presentation
Voices presentationVoices presentation
Voices presentationDhan12
 
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationMove to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationDataWorks Summit
 
AWS Partner Presentation - StorSimple - Cloud-Integrated Storage, AWS Summit ...
AWS Partner Presentation - StorSimple - Cloud-Integrated Storage, AWS Summit ...AWS Partner Presentation - StorSimple - Cloud-Integrated Storage, AWS Summit ...
AWS Partner Presentation - StorSimple - Cloud-Integrated Storage, AWS Summit ...Amazon Web Services
 
TCO Comparison MongoDB & Oracle
TCO Comparison MongoDB & OracleTCO Comparison MongoDB & Oracle
TCO Comparison MongoDB & OracleEl Taller Web
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableStefan Kupstaitis-Dunkler
 
S de0882 new-generation-tiering-edge2015-v3
S de0882 new-generation-tiering-edge2015-v3S de0882 new-generation-tiering-edge2015-v3
S de0882 new-generation-tiering-edge2015-v3Tony Pearson
 
Hadoop & distributed cloud computing
Hadoop & distributed cloud computingHadoop & distributed cloud computing
Hadoop & distributed cloud computingRajan Kumar Upadhyay
 
IBM general parallel file system - introduction
IBM general parallel file system - introductionIBM general parallel file system - introduction
IBM general parallel file system - introductionIBM Danmark
 
L'agilité du cloud public dans votre datacenter avec ECS & Neutrino
L'agilité du cloud public dans votre datacenter avec ECS & NeutrinoL'agilité du cloud public dans votre datacenter avec ECS & Neutrino
L'agilité du cloud public dans votre datacenter avec ECS & NeutrinoRSD
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 

What's hot (20)

S106195 cos-use cases-istanbul-v1902a
S106195 cos-use cases-istanbul-v1902aS106195 cos-use cases-istanbul-v1902a
S106195 cos-use cases-istanbul-v1902a
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
S100295 reporting-monitoring-orlando-v1804a
S100295 reporting-monitoring-orlando-v1804aS100295 reporting-monitoring-orlando-v1804a
S100295 reporting-monitoring-orlando-v1804a
 
S100293 hybrid-cloud-orlando-v1804a
S100293 hybrid-cloud-orlando-v1804aS100293 hybrid-cloud-orlando-v1804a
S100293 hybrid-cloud-orlando-v1804a
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothman
 
Spectrum Scale final
Spectrum Scale finalSpectrum Scale final
Spectrum Scale final
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
S100297 ilm-archive-orlando-v1804c
S100297 ilm-archive-orlando-v1804cS100297 ilm-archive-orlando-v1804c
S100297 ilm-archive-orlando-v1804c
 
Why Hadoop is important to Syncsort
Why Hadoop is important to SyncsortWhy Hadoop is important to Syncsort
Why Hadoop is important to Syncsort
 
Voices presentation
Voices presentationVoices presentation
Voices presentation
 
Hadoop
Hadoop Hadoop
Hadoop
 
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationMove to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
 
AWS Partner Presentation - StorSimple - Cloud-Integrated Storage, AWS Summit ...
AWS Partner Presentation - StorSimple - Cloud-Integrated Storage, AWS Summit ...AWS Partner Presentation - StorSimple - Cloud-Integrated Storage, AWS Summit ...
AWS Partner Presentation - StorSimple - Cloud-Integrated Storage, AWS Summit ...
 
TCO Comparison MongoDB & Oracle
TCO Comparison MongoDB & OracleTCO Comparison MongoDB & Oracle
TCO Comparison MongoDB & Oracle
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
 
S de0882 new-generation-tiering-edge2015-v3
S de0882 new-generation-tiering-edge2015-v3S de0882 new-generation-tiering-edge2015-v3
S de0882 new-generation-tiering-edge2015-v3
 
Hadoop & distributed cloud computing
Hadoop & distributed cloud computingHadoop & distributed cloud computing
Hadoop & distributed cloud computing
 
IBM general parallel file system - introduction
IBM general parallel file system - introductionIBM general parallel file system - introduction
IBM general parallel file system - introduction
 
L'agilité du cloud public dans votre datacenter avec ECS & Neutrino
L'agilité du cloud public dans votre datacenter avec ECS & NeutrinoL'agilité du cloud public dans votre datacenter avec ECS & Neutrino
L'agilité du cloud public dans votre datacenter avec ECS & Neutrino
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 

Similar to Optimizing Data for Fast Querying

Aem asset optimizations & best practices
Aem asset optimizations & best practicesAem asset optimizations & best practices
Aem asset optimizations & best practicesKanika Gera
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11CloudExpoEurope
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11aseager
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11aseager
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file systemJohn Veigas
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageSandeep Patil
 
Cloud storage or computing & its working
Cloud storage or computing & its workingCloud storage or computing & its working
Cloud storage or computing & its workingpiyush mishra
 
New Framework for Improving Bigdata Analaysis Using Mobile Agent
New Framework for Improving Bigdata Analaysis Using Mobile AgentNew Framework for Improving Bigdata Analaysis Using Mobile Agent
New Framework for Improving Bigdata Analaysis Using Mobile AgentMohammed Adam
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...dbpublications
 
IBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use casesIBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use casesTony Pearson
 
A Brave new object store world
A Brave new object store worldA Brave new object store world
A Brave new object store worldEffi Ofer
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)Ulrich Krause
 
S100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cS100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cTony Pearson
 
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...IRJET Journal
 
Advantages Of SAMBA
Advantages Of SAMBAAdvantages Of SAMBA
Advantages Of SAMBAAngela Hays
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
 

Similar to Optimizing Data for Fast Querying (20)

Aem asset optimizations & best practices
Aem asset optimizations & best practicesAem asset optimizations & best practices
Aem asset optimizations & best practices
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11
 
final report
final reportfinal report
final report
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
 
Cloud storage or computing & its working
Cloud storage or computing & its workingCloud storage or computing & its working
Cloud storage or computing & its working
 
New Framework for Improving Bigdata Analaysis Using Mobile Agent
New Framework for Improving Bigdata Analaysis Using Mobile AgentNew Framework for Improving Bigdata Analaysis Using Mobile Agent
New Framework for Improving Bigdata Analaysis Using Mobile Agent
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
 
IBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use casesIBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use cases
 
A Brave new object store world
A Brave new object store worldA Brave new object store world
A Brave new object store world
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)Compact, Compress, De-Duplicate (DAOS)
Compact, Compress, De-Duplicate (DAOS)
 
Final Presentation
Final PresentationFinal Presentation
Final Presentation
 
S100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cS100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804c
 
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
 
Advantages Of SAMBA
Advantages Of SAMBAAdvantages Of SAMBA
Advantages Of SAMBA
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 

Recently uploaded

%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 

Recently uploaded (20)

%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 

Optimizing Data for Fast Querying

  • 1. © 2019 Adobe. All Rights Reserved. Adobe Confidential. Optimizing Data for Fast Querying Andrei Ionescu – Adobe Romania, Data Platform
  • 2. © 2019 Adobe. All Rights Reserved. Adobe Confidential. Agenda  Data Flow & Architecture  Some Initial Numbers  The Problem  Compaction Service  On-Demand  Real-Time  New Numbers  What’s Next  Useful Links 2
  • 3. © 2019 Adobe. All Rights Reserved. Adobe Confidential. Data Flow & Architecture 3 Solution Solution Solution HDFS Ingestion Data Science Query Service Query
  • 4. © 2019 Adobe. All Rights Reserved. Adobe Confidential. Some Initial Numbers 4 Courtesy of Paul Mackles Each dataset per tenant has:  Files per day – 10K of ~4MB each  Rows per day – 500M  Byte Size per day – 7GB  Query time – time out due to> 10 min  Files scan – tens of minutes For a month for 10 datasets we have 3M files of ~4MB each.
  • 5. © 2019 Adobe. All Rights Reserved. Adobe Confidential. The Problem – Client Issue Interactivity when querying the data is mandatory – any query taking over 10 minutes is dropped, even though the process will eventually reach and end after a while. Due to the way the data is ingested and files are written the queries do reach 10 minutes timeout every time, even for simple queries on a month worth of data. 5 SELECT _Y as Year, _M as Month, count(*) as ECount FROM midvalues_1 WHERE _Y = 2018 OR _Y=2019 GROUP BY 1, 2 ORDER BY 1 Desc, 2 Desc;
  • 6. © 2019 Adobe. All Rights Reserved. Adobe Confidential. The Problem – Technical issue The problem is known as “Small File Problem”.  HDFS allocates for each file about 150 bytes that are stored in the namenode memory. 10M files results in 3GB of memory usage.  HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.  Each small file is passed to map() function, which is not very efficient because it will create a large number of mappers. For example, the 1,000’s files of size (2 to 3 MB) will need 1,000 mappers which very inefficient. 6
  • 7. © 2019 Adobe. All Rights Reserved. Adobe Confidential. Compaction Service Compaction is a service with the purpose of optimizing the ingested data into proper file size in regards to block size (128MB, 256MB, 512MB). Compaction Service takes small files and compacts them into bigger files of a specific size. Compaction Service has 2 functionalities that maps over its 2 components:  On-demand compaction – Compaction Job  Real-time compaction (any new file or group of files arriving is taken into account and compacted according to defined rules) – Compaction Tracker 7
  • 8. © 2019 Adobe. All Rights Reserved. Adobe Confidential. On-Demand – Compaction Job  One time job Spark job that gets the files and compacts them into bigger files  For example: 7000 small files into 50 files  Compaction Job is the core of the service  Processing steps:  Given path patterns (ie: /mystore/mydataset/_Y=2019/_M=1/_D=7/batch=*/)  Scan all files and get their file size  Knowing total size, target size, schema size, etc., we can find n=totalSize/targetSize  Two modes of running  Load the data and apply repartition(n) to get the proper number of files (one DataSet with all data)  Group files into n buckets and repartition(1) each bucket (multiple DataSets) 8
  • 9. © 2019 Adobe. All Rights Reserved. Adobe Confidential. On-Demand – Compaction Job – Running Modes 9 /myDs/_Y=2019/_M=1/_D=1/batch=1/File-01.part /myDs/_Y=2019/_M=1/_D=1/batch=1/File-02.part /myDs/_Y=2019/_M=1/_D=1/batch=1/File-03.part /myDs/_Y=2019/_M=1/_D=1/batch=1/* /myDs/_Y=2019/_M=1/_D=1/batch=12/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=12/* /myDs/_Y=2019/_M=1/_D=1/batch=22/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=22/* /myDs/_Y=2019/_M=1/_D=1/batch=13/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=13/* DataS et /myDs/_Y=2019/_M=1/_D=1/batch=199898/ Repartitionn 1 DataSet with repartition n /myDs/_Y=2019/_M=1/_D=1/batch=1/File-01.part /myDs/_Y=2019/_M=1/_D=1/batch=1/File-02.part /myDs/_Y=2019/_M=1/_D=1/batch=1/File-03.part /myDs/_Y=2019/_M=1/_D=1/batch=1/* /myDs/_Y=2019/_M=1/_D=1/batch=12/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=12/* /myDs/_Y=2019/_M=1/_D=1/batch=22/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=22/* /myDs/_Y=2019/_M=1/_D=1/batch=13/File-01.part/myDs/_Y=2019/_M=1/_D=1/batch=13/* DataS et /myDs/_Y=2019/_M=1/_D=1/ batch=199898/__p1/ Repartition 1 n DataSets with repartition 1 DataS et Repartition 1 /myDs/_Y=2019/_M=1/_D=1/ batch=199898/__p2/ Movefilestoparentfolder /myDs/_Y=2019/_M=1/_D=1/ batch=199898/
  • 10. © 2019 Adobe. All Rights Reserved. Adobe Confidential. On-Demand – Compaction Job – Running Modes 1 DataSet with repartition n  [-] File bloating due to random shuffling after repartition(n)  [-] Twice the file size n DataSets with repartition 1  [+] File size is similar to the source size  [-] Each DataSet can write only in its own folder, thus one more step to move the files back to parent path and cleanup 10
  • 11. © 2019 Adobe. All Rights Reserved. Adobe Confidential. Real-Time – Compaction Tracker Any new batch of files arriving into the storage is taken into account and compacted by specific rules:  Should be part of the same tenant  Should be part of the same partition  Should be grouped together to similar files so that it would target the size of a HDFS block size  Triggers a Compaction Job (On-demand) for the ready to be compacted group 11
  • 12. © 2019 Adobe. All Rights Reserved. Adobe Confidential. Real-Time – Compaction Tracker 12 batch of files ready Stateful Streaming State: • Number of batches • File size • Partition • Etc. Trigger One Time Job (On-Demand) /myDS/_Y=2019/_M=1/_D=1/batch=12/ /myDS/_Y=2019/_M=1/_D=1/batch=13/ /myDS/_Y=2019/_M=1/_D=1/batch=14/ … /myDS/_Y=2019/_M=1/_D=1/batch=1000/ HDFS
  • 13. © 2019 Adobe. All Rights Reserved. Adobe Confidential. Architecture with Compaction Service 13 Solution Solution Solution HDFS Ingestion Data Science Workspace Query Service QueryCompaction
  • 14. © 2019 Adobe. All Rights Reserved. Adobe Confidential. New Numbers Now, each dataset per tenant has:  Files per day – 50  Rows per day – 500M  Byte Size per day – 7GB  Query time – < 10 min  Files scan – under a minute 14
  • 15. © 2019 Adobe. All Rights Reserved. Adobe Confidential. What’s next  Use of meta-store (Netflix Iceberg)  Better scaling resources for Compaction Jobs based on the data  Use of ML to fit the Compaction Job needs 15
  • 16. © 2019 Adobe. All Rights Reserved. Adobe Confidential. Useful Links  http://dataottam.com/2016/09/09/3-solutions-for-big-datas-small-files-problem/  https://blog.cloudera.com/blog/2009/02/the-small-files-problem/  https://community.hortonworks.com/questions/167615/what-is-small-file-problem-in-hdfs.html  https://github.com/Netflix/iceberg  https://parquet.apache.org/ Keep in Touch  LinkedIn: https://www.linkedin.com/in/andreiionescu  Email: aionescu@adobe.com 16