SlideShare a Scribd company logo
1 of 26
BDaaS On The Cloud:
Challenges And Optimizations
Abhishek Somani
20th January 2017
Why Cloud?
Where Big
Data falls
short:
• 6-18 month implementation time
• Only 27% of Big Data initiatives are
classified as “Successful” in 2014
Rigid and
inflexible
infrastructure
Non adaptive
software
services
Highly
specialized
systems
Difficult to
build and
operate
• Only 13% of organizations achieve full-scale production
• 57% of organizations cite skills gap as a major inhibitor
3
1. Flexible Infrastructure
2. Pay only for what you actually use
3. Shared Storage
4. Heterogenous Clusters
4
Why Cloud?
• Cloud Compute(Cluster) management
– Challenges
– Solutions
– Advanced Optimizations
• Cloud Storage
– Challenges
– Solutions and Optimizations
5
Agenda
1. Properties:
a. Ephemeral
b. Volatile(Spot for AWS, Preemptible for GCP)
2. Challenges:
a. Scale as per workload
b. Separation of compute and storage
c. Job histories, log files, results all need to be persisted.
d. Adapting YARN/HDFS to take into account ephemeral cloud nodes.
6
Cloud Compute
Up-scaling for MR jobs
Resource
Manager
Node 1
Node 2
User
Submit Job
Launches
MR AM
NodeManager
MR AppMaster
Container
Request
Allocate
Resources
NodeManager
C1 C2
Task
Progress
Up Scale
Request
Cluster
Manager
Add Node
NodeManager
C3 C4
Node 3
Generic Up-scaling
Resource
Manager
Cluster
Manager
MR
AppMaster
Spark
AppMaster
Tez
AppMaster
Up Scale
Request
Add
Node
Node 2
Down-scaling
Resource
Manager
NodeManager
C1 C2
C3 C4
NodeManager
C1 C2
C3 C4
NodeManager
C1 C2
C4C3
Status
Update
Evaluates cluster is
being underutilized and
can be down scaled
Selects node whose
estimated task
completion time is
lowest
Graceful
Shutdown
User
Submits
Job
Allocates
container
Job1
Completes
Cluster
Manager
Remove
Node
Job 1
Job 2
Job 3
Decommission
Node
Node 1
Node 3
C3
C1
C1
C3
1. Upscaling
a. Engine specific algorithms
b. Cannot just look at expected time(parallelism matters)
2. Downscaling
a. Decommissioning takes time
b. Need to consider hour boundaries
c. Stuck on mapper output
10
Why is it hard?
Job History – Terminated Cluster
Job History – Terminated Cluster
Qubole
UI
User
Cluster
Proxy
Job
History
Server
Clicks
UI link
Authenticates
the request
Finds cluster
is down
Fetches jhist
file from cloud
Jhist file
Rendered
JobHist
Proxifies Link
1. Volatile Nodes
a. Lower priced nodes bought in an auction (Spot Nodes in AWS, Preemptible in
GCE)
2. Hybrid Clusters
a. Mix of stable and volatile nodes to improve stability
3. Heterogenous Clusters
a. Preferred machine types may not be available
b. Preferred machine types may be more expensive than larger machines
4. Autoscaling Optimizations
a. Packing of tasks
b. Upload intermediate data to cloud storage
c. Recommission nodes
13
Advanced Optimizations
1. Cloud Compute(Cluster) management
a. Challenges
b. Scaling
c. Advanced Optimizations
2. Cloud Storage
a. Challenges
b. Solutions and Optimizations
14
Agenda
1. Properties:
a. Simple key value store
b. Inexpensive.
c. Accessed via REST APIs/SDK
d. Is the source of truth.
2. Challenges:
a. Connection establishment is expensive
b. Copying/Moving is expensive... no rename
3. Some positives:
a. Prefix listing.
b. PUTs are atomic: File is created when file is uploaded, unlike HDFS where it is
created on first write.
c. Multipart
15
Cloud Storage
• Naive
• Smart
• Up to 1000x improvement
16
Prefix Listing
for path in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:
result << listObject(path)
pathList = listPrefix(‘/x’)
while (entry = pathList.next()):
if entry in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:
result << entry
Storage Optimizations
C
1. Split Computation : Divide input files into tasks for Map-Reduce/Spark/Presto
2. Recovering Partitions
3. List Paths matching regex pattern (‘/x/y/z/*/*’)
4. and many more ..
17
Prefix Listing - Use Cases
Storage Optimizations
C
• Normally:
– Write data to temporary location - atomically rename to final location
• With S3:
– Write data to final location
– Atomic PUTs deal with speculation/retries
• By default in Hive, DirectFileOutputCommitter in MR/Spark
• Tricky: retries/speculation must use same path
18
Direct Writes
Storage Optimizations
C
• Object caches(per bucket): High gain for roles based accounts
• Connection pools
• Read ahead optimizations
• Streaming upload
19
S3 Optimizations
• RubiX: Block level file cache
• Metadata caching for ORC and Parquet
20
Cache! Cache! Cache!
Storage Optimizations
C
• Cache blocks on local disks
• Open Source
• Engine agnostic
• Works well with auto-scaling
• Consistent Hashing to assign files or blocks to nodes.
21
RubiX
Storage Optimizations
C
22
RubiX
Storage Optimizations
C
23
Metadata Caching
ORC File Format
24
Metadata Caching
Parquet File Format
• Cache on a Redis server running on master
• Effective and efficient split computation with PPD
• ORC and Parquet
• Engine agnostic
25
Metadata Caching
Thank You!
20th January 2017

More Related Content

What's hot

Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3Alluxio, Inc.
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Alluxio, Inc.
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Greenplum-Spark November 2018
Greenplum-Spark November 2018Greenplum-Spark November 2018
Greenplum-Spark November 2018KongYew Chan, MBA
 
Accelerating Data Computation on Ceph Objects
Accelerating Data Computation on Ceph ObjectsAccelerating Data Computation on Ceph Objects
Accelerating Data Computation on Ceph ObjectsAlluxio, Inc.
 
Alluxio on AWS EMR Fast Storage Access & Sharing for Spark
Alluxio on AWS EMR Fast Storage Access & Sharing for SparkAlluxio on AWS EMR Fast Storage Access & Sharing for Spark
Alluxio on AWS EMR Fast Storage Access & Sharing for SparkAlluxio, Inc.
 
Productionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure MLProductionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure MLDatabricks
 
Building Cloud Native Analytical Pipelines on AWS
Building Cloud Native Analytical Pipelines on AWS Building Cloud Native Analytical Pipelines on AWS
Building Cloud Native Analytical Pipelines on AWS Alluxio, Inc.
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012Joydeep Sen Sarma
 
Accelerate Cloud Training with Alluxio
Accelerate Cloud Training with AlluxioAccelerate Cloud Training with Alluxio
Accelerate Cloud Training with AlluxioAlluxio, Inc.
 
HybridAzureCloud
HybridAzureCloudHybridAzureCloud
HybridAzureCloudChris Condo
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...Cloudera, Inc.
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache SparkDatabricks
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
 
Fast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonFast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonAlluxio, Inc.
 

What's hot (20)

Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
Cybersecurity and fraud detection at ING Bank using Presto & Alluxio on S3
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Greenplum-Spark November 2018
Greenplum-Spark November 2018Greenplum-Spark November 2018
Greenplum-Spark November 2018
 
Accelerating Data Computation on Ceph Objects
Accelerating Data Computation on Ceph ObjectsAccelerating Data Computation on Ceph Objects
Accelerating Data Computation on Ceph Objects
 
Alluxio on AWS EMR Fast Storage Access & Sharing for Spark
Alluxio on AWS EMR Fast Storage Access & Sharing for SparkAlluxio on AWS EMR Fast Storage Access & Sharing for Spark
Alluxio on AWS EMR Fast Storage Access & Sharing for Spark
 
Productionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure MLProductionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure ML
 
Tame that Beast
Tame that BeastTame that Beast
Tame that Beast
 
Building Cloud Native Analytical Pipelines on AWS
Building Cloud Native Analytical Pipelines on AWS Building Cloud Native Analytical Pipelines on AWS
Building Cloud Native Analytical Pipelines on AWS
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
 
Accelerate Cloud Training with Alluxio
Accelerate Cloud Training with AlluxioAccelerate Cloud Training with Alluxio
Accelerate Cloud Training with Alluxio
 
HybridAzureCloud
HybridAzureCloudHybridAzureCloud
HybridAzureCloud
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Fast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonFast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on Tachyon
 

Viewers also liked

Teorías de la organización
Teorías de la organizaciónTeorías de la organización
Teorías de la organizaciónCarlos091016
 
Diagrama de flujo
Diagrama de flujoDiagrama de flujo
Diagrama de flujoElianMata28
 
1 ancient greek ideas about atom alino
1 ancient greek ideas about atom alino1 ancient greek ideas about atom alino
1 ancient greek ideas about atom alinoRupert Capellan
 
таблиця ділення на 4
таблиця ділення на 4таблиця ділення на 4
таблиця ділення на 4Olha Hrodzitska
 
Maroko indonesia-ikatan-dua-permata-yang-abadi-378m
Maroko indonesia-ikatan-dua-permata-yang-abadi-378mMaroko indonesia-ikatan-dua-permata-yang-abadi-378m
Maroko indonesia-ikatan-dua-permata-yang-abadi-378mFora Indah
 
Provoli syntagis
Provoli syntagisProvoli syntagis
Provoli syntagisdeppydou
 
control y calidad de proyectos de obras viales - tercera parte
control y calidad de proyectos de obras viales - tercera partecontrol y calidad de proyectos de obras viales - tercera parte
control y calidad de proyectos de obras viales - tercera parteLuis Bendezu Carbajal
 
Organizaciones tradicionales y modernas
Organizaciones tradicionales y modernasOrganizaciones tradicionales y modernas
Organizaciones tradicionales y modernasGerman Orbegozo
 
Decreto unico 1072_diego_valdivieso
Decreto unico 1072_diego_valdiviesoDecreto unico 1072_diego_valdivieso
Decreto unico 1072_diego_valdiviesoJuliieth Muñoz
 
Napoleon Bonaparti
Napoleon BonapartiNapoleon Bonaparti
Napoleon BonapartiIva Pilika
 
social responsible business
social responsible businesssocial responsible business
social responsible businessAnastasia Deeva
 
organización actual y tradicional
organización actual y tradicionalorganización actual y tradicional
organización actual y tradicionalErika Jimenez
 

Viewers also liked (19)

O que são NUDGES? #wiad
O que são NUDGES? #wiadO que são NUDGES? #wiad
O que são NUDGES? #wiad
 
Teorías de la organización
Teorías de la organizaciónTeorías de la organización
Teorías de la organización
 
Diagrama de flujo
Diagrama de flujoDiagrama de flujo
Diagrama de flujo
 
1 ancient greek ideas about atom alino
1 ancient greek ideas about atom alino1 ancient greek ideas about atom alino
1 ancient greek ideas about atom alino
 
Minerales en nuestra vida diaria
Minerales en nuestra vida diariaMinerales en nuestra vida diaria
Minerales en nuestra vida diaria
 
Portfolio
PortfolioPortfolio
Portfolio
 
таблиця ділення на 4
таблиця ділення на 4таблиця ділення на 4
таблиця ділення на 4
 
Maroko indonesia-ikatan-dua-permata-yang-abadi-378m
Maroko indonesia-ikatan-dua-permata-yang-abadi-378mMaroko indonesia-ikatan-dua-permata-yang-abadi-378m
Maroko indonesia-ikatan-dua-permata-yang-abadi-378m
 
Provoli syntagis
Provoli syntagisProvoli syntagis
Provoli syntagis
 
control y calidad de proyectos de obras viales - tercera parte
control y calidad de proyectos de obras viales - tercera partecontrol y calidad de proyectos de obras viales - tercera parte
control y calidad de proyectos de obras viales - tercera parte
 
Organizaciones tradicionales y modernas
Organizaciones tradicionales y modernasOrganizaciones tradicionales y modernas
Organizaciones tradicionales y modernas
 
Decreto unico 1072_diego_valdivieso
Decreto unico 1072_diego_valdiviesoDecreto unico 1072_diego_valdivieso
Decreto unico 1072_diego_valdivieso
 
La narración literaria
La narración literariaLa narración literaria
La narración literaria
 
South indian bridal makeup
South indian bridal makeupSouth indian bridal makeup
South indian bridal makeup
 
Napoleon Bonaparti
Napoleon BonapartiNapoleon Bonaparti
Napoleon Bonaparti
 
Trabajo lgl
Trabajo lglTrabajo lgl
Trabajo lgl
 
social responsible business
social responsible businesssocial responsible business
social responsible business
 
Jsu
JsuJsu
Jsu
 
organización actual y tradicional
organización actual y tradicionalorganización actual y tradicional
organización actual y tradicional
 

Similar to BDAAS on the Cloud

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Containerized Storage for Containers
Containerized Storage for ContainersContainerized Storage for Containers
Containerized Storage for ContainersOpenEBS
 
Containerized Storage for Containers
Containerized Storage for ContainersContainerized Storage for Containers
Containerized Storage for ContainersMurat Karslioglu
 
How @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudlohitvijayarenu
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
Cloud Has Become the New Normal: TCS
Cloud Has Become the New Normal: TCS Cloud Has Become the New Normal: TCS
Cloud Has Become the New Normal: TCS Amazon Web Services
 
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the Cloud
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the CloudAWS Sydney Summit 2013 - Technical Lessons on How to do DR in the Cloud
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the CloudAmazon Web Services
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platformnvvrajesh
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak DataClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak DataAltinity Ltd
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceBlueData, Inc.
 

Similar to BDAAS on the Cloud (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Containerized Storage for Containers
Containerized Storage for ContainersContainerized Storage for Containers
Containerized Storage for Containers
 
Containerized Storage for Containers
Containerized Storage for ContainersContainerized Storage for Containers
Containerized Storage for Containers
 
How @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloud
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
Cloud Has Become the New Normal: TCS
Cloud Has Become the New Normal: TCS Cloud Has Become the New Normal: TCS
Cloud Has Become the New Normal: TCS
 
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the Cloud
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the CloudAWS Sydney Summit 2013 - Technical Lessons on How to do DR in the Cloud
AWS Sydney Summit 2013 - Technical Lessons on How to do DR in the Cloud
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak DataClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 

Recently uploaded

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 

BDAAS on the Cloud

  • 1. BDaaS On The Cloud: Challenges And Optimizations Abhishek Somani 20th January 2017
  • 3. Where Big Data falls short: • 6-18 month implementation time • Only 27% of Big Data initiatives are classified as “Successful” in 2014 Rigid and inflexible infrastructure Non adaptive software services Highly specialized systems Difficult to build and operate • Only 13% of organizations achieve full-scale production • 57% of organizations cite skills gap as a major inhibitor 3
  • 4. 1. Flexible Infrastructure 2. Pay only for what you actually use 3. Shared Storage 4. Heterogenous Clusters 4 Why Cloud?
  • 5. • Cloud Compute(Cluster) management – Challenges – Solutions – Advanced Optimizations • Cloud Storage – Challenges – Solutions and Optimizations 5 Agenda
  • 6. 1. Properties: a. Ephemeral b. Volatile(Spot for AWS, Preemptible for GCP) 2. Challenges: a. Scale as per workload b. Separation of compute and storage c. Job histories, log files, results all need to be persisted. d. Adapting YARN/HDFS to take into account ephemeral cloud nodes. 6 Cloud Compute
  • 7. Up-scaling for MR jobs Resource Manager Node 1 Node 2 User Submit Job Launches MR AM NodeManager MR AppMaster Container Request Allocate Resources NodeManager C1 C2 Task Progress Up Scale Request Cluster Manager Add Node NodeManager C3 C4 Node 3
  • 9. Node 2 Down-scaling Resource Manager NodeManager C1 C2 C3 C4 NodeManager C1 C2 C3 C4 NodeManager C1 C2 C4C3 Status Update Evaluates cluster is being underutilized and can be down scaled Selects node whose estimated task completion time is lowest Graceful Shutdown User Submits Job Allocates container Job1 Completes Cluster Manager Remove Node Job 1 Job 2 Job 3 Decommission Node Node 1 Node 3 C3 C1 C1 C3
  • 10. 1. Upscaling a. Engine specific algorithms b. Cannot just look at expected time(parallelism matters) 2. Downscaling a. Decommissioning takes time b. Need to consider hour boundaries c. Stuck on mapper output 10 Why is it hard?
  • 11. Job History – Terminated Cluster
  • 12. Job History – Terminated Cluster Qubole UI User Cluster Proxy Job History Server Clicks UI link Authenticates the request Finds cluster is down Fetches jhist file from cloud Jhist file Rendered JobHist Proxifies Link
  • 13. 1. Volatile Nodes a. Lower priced nodes bought in an auction (Spot Nodes in AWS, Preemptible in GCE) 2. Hybrid Clusters a. Mix of stable and volatile nodes to improve stability 3. Heterogenous Clusters a. Preferred machine types may not be available b. Preferred machine types may be more expensive than larger machines 4. Autoscaling Optimizations a. Packing of tasks b. Upload intermediate data to cloud storage c. Recommission nodes 13 Advanced Optimizations
  • 14. 1. Cloud Compute(Cluster) management a. Challenges b. Scaling c. Advanced Optimizations 2. Cloud Storage a. Challenges b. Solutions and Optimizations 14 Agenda
  • 15. 1. Properties: a. Simple key value store b. Inexpensive. c. Accessed via REST APIs/SDK d. Is the source of truth. 2. Challenges: a. Connection establishment is expensive b. Copying/Moving is expensive... no rename 3. Some positives: a. Prefix listing. b. PUTs are atomic: File is created when file is uploaded, unlike HDFS where it is created on first write. c. Multipart 15 Cloud Storage
  • 16. • Naive • Smart • Up to 1000x improvement 16 Prefix Listing for path in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]: result << listObject(path) pathList = listPrefix(‘/x’) while (entry = pathList.next()): if entry in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]: result << entry Storage Optimizations C
  • 17. 1. Split Computation : Divide input files into tasks for Map-Reduce/Spark/Presto 2. Recovering Partitions 3. List Paths matching regex pattern (‘/x/y/z/*/*’) 4. and many more .. 17 Prefix Listing - Use Cases Storage Optimizations C
  • 18. • Normally: – Write data to temporary location - atomically rename to final location • With S3: – Write data to final location – Atomic PUTs deal with speculation/retries • By default in Hive, DirectFileOutputCommitter in MR/Spark • Tricky: retries/speculation must use same path 18 Direct Writes Storage Optimizations C
  • 19. • Object caches(per bucket): High gain for roles based accounts • Connection pools • Read ahead optimizations • Streaming upload 19 S3 Optimizations
  • 20. • RubiX: Block level file cache • Metadata caching for ORC and Parquet 20 Cache! Cache! Cache! Storage Optimizations C
  • 21. • Cache blocks on local disks • Open Source • Engine agnostic • Works well with auto-scaling • Consistent Hashing to assign files or blocks to nodes. 21 RubiX Storage Optimizations C
  • 25. • Cache on a Redis server running on master • Effective and efficient split computation with PPD • ORC and Parquet • Engine agnostic 25 Metadata Caching