SlideShare a Scribd company logo
1 of 34
Optimizing Big Data to run in the Public Cloud
April 23, 2015
NYC Hadoop Meetup
A little bit about Qubole
Ashish Thusoo
Founder & CEO
Joydeep Sen Sarma
Founder & CTO
Founded in 2011 by the pioneers of “big data” @
Facebook and the creator’s of the Apache Hive Project
Based in Mountain View, CA with offices in Bangalore,
India. Investments by Charles River, LightSpeed, Norwest
Ventures.
World class product and engineering team from:
Team
Qubole QDS
Qubole works in:
• Adtech
• Media & Entertainment
• Healthcare
• Retail
• eCommerce
Qubole works best when:
• Born in Cloud
• Commitment to Public Cloud
• Data Driven
• Large scale data
• Lack Hadoop Skills
• Analysts & scientist need access
Standard Hadoop (on premises)
Standard Hadoop (on premises)
- JobTrackers
- TaskTrackers
- NameNodes
- DataNodes
= Datacenter, servers, VMs, wires…
How about adding more capacity? Dev/Test environments?
Non-Technical Users? Version upgrades?
Hadoop in the Cloud
Hadoop in the Cloud
= Someone else’s datacenter, servers, VMs, wires…
Designed for capacity scaling (and reduction)
Designed for multiple Dev/Test environments
Potential UI for Non-Technical Users
Potential support for version upgrades
Ad hoc queries can spin up a cluster on-demand exactly when needed
= Cost reduction, self-service, custom configurable clusters
*Security (important and production ready, but we won’t focus on it here)
Great, what about S3?
(so what?)
Data stored in HDFS requires nodes to be kept running
continuously (an EC2 instance). This can be expensive.
Data stored in S3 means it is remote from the compute nodes.
S3 performs well in general, but it’s not uncommon to see
significant variance in performance.
7
Split Computation and File I/O
Split Computation and File I/O
Multiple map tasks are instantiated and each of these is assigned a split.
Hadoop needs to know the size of input files so that they can grouped
into equal sized splits.
Input files are spread across many directories.
For example, two years of data, organized into hourly directories, results
in 17520 directories. If each directory contains 6 files, this makes a grand
total of 105,120 files.
Map-Reduce calls the generic Hadoop file listing API against each input
directory to get the size of all files in the directory.
Split Computation and File I/O
Split Computation and File I/O
This is okay when on HDFS. But, in our example, this results in
17520 API calls. This is not a big deal in HDFS, but results in very
bad performance in S3.
Every listing call in S3 involves using a Rest API call and parsing of
XML results which has very high overhead and latency.
Furthermore, Amazon employs protection mechanisms against high
rate of API calls. For certain workloads, split computation becomes
a huge bottleneck.
Data Consistency
Data Consistency
From AWS FAQ:
What data consistency model does Amazon S3 employ?
Amazon S3 buckets in the US Standard region provide eventual
consistency. Amazon S3 buckets in all other regions provide read-after-
write consistency for PUTS of new objects and eventual consistency for
overwrite PUTS and DELETES.
Data Consistency
Data Consistency
• Eventual Consistency
• Read-After-Write Consistency
• Why is Read-After-Write Consistency Useful?
Data Consistency
Data Consistency
• Read-after-update consistency
• Read-after-delete
Data Consistency
Data Consistency
Specify named S3 endpoints instead of US Standard.
For example, replace:
http://mybucket.s3.amazonaws.com/somekey.ext
with:
http://mybucket.s3-external-1.amazonaws.com/somekey.ext
http://docs.aws.amazon.com/redshift/latest/dg/managing-data-
consistency.html
Qubole DataFlow Diagram
Qubole UI via
Browser
SDK
ODBC
User Access
Qubole’s
AWS Account
Customer’s AWS Account
REST API
(HTTPS)
SSH
Ephemeral Hadoop Clusters,
Managed by Qubole
Slave
Master
Data Flow within
Customer’s AWS
(optional)
Other RDS,
Redshift
Ephemeral
Web Tier
Web Servers
Encrypted
Result
Cache
Encrypted
HDFS
Slave
Encrypted
HDFS
RDS – Qubole
User, Account
Configurations
(Encrypted
credentials
Amazon S3
w/S3 Server Side
Encryption
Default Hive
Metastore
Encryption Options:
a) Qubole can encrypt the result cache
b) Qubole supports encryption of the ephemeral drives used for HDFS
c) Qubole supports S3 Server Side Encryption
(c)(b)
(a)
(optional)
Custom Hive
Metastore
SSH
QDS Platform Features
Auto-Scaling self managed Hadoop Clusters in Cloud
– Including Amazon EC2, Rackspace, Google Compute & OpenStack
The Fastest Hadoop running on the cloud
– Numerous Optimizations that provide 4 to 8 times faster performance than Amazon
Elastic MapReduce (EMR)
Pre-built connectors
– Traditional RDBMS, MongoDB and other NoSQL solutions
– Incremental Data Scrapes
Job Scheduler
– Dependencies, Workflows, Incremental Jobs
Multi-Platform Support
– Supports AWS, Google & Azure Credentials
15
QDS Capabilities
Mix and Match Reserved & Spot instances
– To reduce the cost of compute hours on the cloud
Perform data exploration and analysis on raw multi-structured
data formats.
Integration with data visualization & BI tools via ODBC
– Tableau Software, Pentaho, Excel
All functionality also available through API’s and toolkits
16
QDS Platform Features
Faster split computations on S3
Faster split computations on S3
To solve this problem, we modified split computation to invoke listing at the level of the parent directory.
This call returns all files (and their sizes) in all subdirectories in blocks of 1000.
Some subdirectories and files may not be of interest to job/query e.g. partition elimination may be eliminated
some of them.
We take advantage of the fact that file listing is in lexicographic order and perform a modified merge join of the
list of files and list of directories of interest.
This allows us to efficiently identify files sizes of interesting files. The modified algorithm results in only 106 API
calls (each call returns 1000 files) compared to 17520 API calls in the original implementation. We compared the
two approaches using a simple Hive test. In this test, we take a partitioned table T with 15,000 files but vary the
number of partitions (a partition corresponds to a directory). We compare the performance of ‘select count(*)
from T’. In the extreme case, this optimization shows a speedup of 8x!
Faster reads from S3
Faster reads from S3
Opening of files take a significant amount of time – at least 50 milliseconds per file.
This problem becomes pronounced when the input dataset has lots of small files and file open latency
forms a significant portion of overall execution time.
To alleviate this problem, we included an optimization wherein we open an S3 file in a background thread
a little while before it is actually required by the map task. This hides the file open latency.
One thing to be aware of is that if a S3 file is opened, but not read from for a while, S3 returns a
RequestTimeout and potentially penalizes the caller.
We tested this optimization with a simple hive test. Our dataset consisted of 80000 files, each of size
640KB. We noticed an improvement of 30% in a count(*) query as a result of this optimization.
Attaching to EBS Volumes
Attaching to EBS Volumes
Storage on EC2 Instances.
Instance Type Instance Storage (GB)
c1.xlarge 1680 [4 x 420]
c3.2xlarge 160 [2 x 80 SSD]
c3.4xlarge 320 [2 x 160 SSD]
c3.8xlarge 640 [2 x 320 SSD]
The amount of storage per instance might not be
sufficient for running Hadoop clusters with high
volumes of data.
Attaching to EBS Volumes
Attaching to EBS Volumes
• AWS offers raw block devices called EBS
volumes which can be attached to EC2
instances.
• It is possible to attach multiple EBS volumes
with a size up to 1 TB per volume. This can
easily compensate for the low instance
storage available on the new generation
instances. Also, use of EBS volumes for
storage purpose costs much less than adding
cheaper instances with more storage capacity.
Attaching to EBS Volumes
Attaching to EBS Volumes
• Configurable Reserved Volumes
• On Qubole platform, using new generation
instances, users have an option to use
reserved volumes if data requirement exceeds
the local storage available in the cluster.
• AWS EBS volumes come in various flavors
e.g. magnetic, SSD backed. Users can select
the size and type of EBS volumes based on
the data and performance requirements.
SSD
Reserved
Disk
Disk Access
HDFSMapReduce
Protection Against Bad Jobs
Protection Against Bad Jobs
A single hadoop cluster is usually shared across many users.
- common occurrence that a certain user may issue a bad job which may
degrade the performance of the entire cluster.
Running out of Disk:
- Single mapper issuing too much output.
- A map/reduce job may have a lot of mappers outputting too much map data
- Reducer tasks copying a lot of map output data during the shuffle phase
Protection Against Bad Jobs
Protection Against Bad Jobs
Qubole’s hadoop distribution provides protection of the clusters against such
jobs. Its clusters periodically monitors the jobs and kills any job that may be
affecting the entire cluster.
Kill job when …
- Total map output of a job is beyond a configurable value
- Any tasks produce more map output than a set disk percent
- A job produces a lot of logs (configurable value)
- Reducers read a lot of map data (configurable value)
Direct output commit to S3
Direct output commit to S3
Using default S3 code path involves writing to a temp directory and then moving
the temp directory into its final location.
Move on S3 is really a copy and then delete.
Instead, write into the target directory.
Direct output commit to S3 - Hive
Direct output commit to S3 - Hive
Changed the naming scheme for the files Hive creates.
• Instead of 00000 we use names like UUID_000000 where all files generated by a
single insert into use the same prefix.
• Guarantees that a new insert into will not stomp on data produced by an earlier query.
To support insert overwrite with dynamic partitions the tasks that write into the directory
must delete any existing files.
Before the insert overwrite begins we generate a UUID to use for this statement. All the
mappers/reducers when deleting from a directory will delete all files that don't begin with
this UUID.
Direct output commit to S3 - MapReduce
Direct output commit to S3 - MapReduce
By default the MR code also writes to a temp location and then moves to final. The moves
are done by listing the temp location and moving all the files there.
• To avoid this we track expected file counts for a couple of File formats
• Provide Direct committers which avoids the move completely
Spot Instances, Placement Policy, and Fallback to on-demand
Spot Instances
Spot Instances allow users to bid on unused Amazon EC2 capacity and run those instances for as
long as their bid exceeds the Spot Price. QDS enables you to realize cost savings of as much as
50% to 60% by supporting the Spot Instance pricing model in addition to the Reserved Instance
pricing.
Use Qubole Placement Policy:
When using spot instances for slaves, this ensures that at least one replica of each HDFS block is
placed on Stable instances. It is recommended to keep this enabled when using spot instances.
Fallback to on-demand:
When upscaling the cluster, sometimes we may not be able to procure Spot Instances because of
low availability or high price. This option specifies that autoscaling should then fall back to procuring
On-Demand instances. This will increase the cost of running the cluster, but ensures that the
processing completes relatively quickly. Enable this if command processing time is important to you.
DIY vs. Qubole
28
Operations
Analyst
Marketing Ops
Analyst
Data
Architects
Business
Users
Product
Support
Customer
Support
Developer
Sales
Ops
Product
Managers
Developer
Tools
Service
Management
Data Workbench
Cloud Data Platform
BI & DW
Systems
• SDK
• API
• Analysis
• Security
• Job Scheduler
• Data Governance
• Analytics templates
• Monitoring
• Support
• Collaboration
• Workflow &
Map/Reduce
• Auto Scaling
• Cloud Optimization
• Data Connectors
• YARN
• Spark & Pig
• Presto & Hive
30
Features
S3 Caching: S3 caching utilizes resources more efficiently and
brings up clusters faster – up to 10x faster on some client instances.
Variable Spot Instance Pricing: QDS allows you
to vary the number of spots vs. on-demand nodes,
providing the benefits of spot pricing (up to 90%
less expensive) with the certainty of getting your
job done.
Searchable Queries and Log Files: QDS shows
you all of your jobs, allowing you to compare the
efficiency of queries and avoid having to re-create
queries from scratch.
Built in Job Tracker: QDS provides a job tracker
accessed directly through the UI, allowing you to
identify resources, nodes, and tasks.
Security: QDS has a tight security environment,
with the ability to encrypt data at rest on nodes.
Use Cases and Additional Information
31
Why Qubole?
32
“Qubole has enabled more users within Pinterest to get to the
data and has made the data platform lot more scalable and
stable”
Mohammad Shahangian - Lead, Data Science and Infrastructure
Moved to Qubole from Amazon EMR because
of stability and rapidly expanded big data usage by
giving access to data to users beyond developers.
Rapid expansion of big data beyond developers (240 users
out of 600 person company)
Use CasesUser and Query Growth
Rapid expansion in use cases ranging from ETL, search,
adhoc querying, product analytics etc.
Rock solid infrastructure sees 50% less failures as
compared to AWS Elastic Map/Reduce
Enterprise scale processing and data access
Why Qubole?
33
“We needed something that was reliable and easy to learn,
setup, use and put into production without the risk and high
expectations that comes with committing millions of dollars in
upfront investment. Qubole was that thing.”
Marc Rosen - Sr. Director, Data Analytics
Moved to Big data on the cloud (from internal Oracle
clusters) because getting to analysis was much
quicker than operating infrastructure themselves.
Used to answer client queries and power client
dashboards.
Use Cases# Commands Per Month
0
1250
2500
3750
5000
Number of queries
Segment audiences based on their behavior including
such topics as user pathway and multi-dimensional
recency analysis
Build customer profiles (both uni/multivariate) across
thousands of first party (i.e., client CRM files) and third
party (i.e., demographic) segments
Simplify attribution insights showing the effects of upper
funnel prospecting on lower funnel remarketing media
strategies
34
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure
Links for more information
http://www.datacenterknowledge.com/archives/2
015/04/02/hybrid-clouds-need-for-speed/
http://engineering.pinterest.com/post/927423719
19/powering-big-data-at-pinterest
http://www.itbusinessedge.com/slideshows/six-
details-your-big-data-provider-wont-tell-you.html
http://www.marketwired.com/press-
release/qubole-reports-rapid-adoption-of-its-self-
service-big-data-analytics-platform-1990272.htm

More Related Content

What's hot

Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsDataWorks Summit
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Progress
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...✔ Eric David Benari, PMP
 
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeJoydeep Sen Sarma
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR Technologies
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph✔ Eric David Benari, PMP
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesJeff Kelly
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Is Cloud a right Companion for Hadoop
Is Cloud a right Companion for HadoopIs Cloud a right Companion for Hadoop
Is Cloud a right Companion for HadoopDataWorks Summit
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsDan Lynn
 

What's hot (20)

Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
 
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Hadoop at Ebay
Hadoop at EbayHadoop at Ebay
Hadoop at Ebay
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use cases
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Is Cloud a right Companion for Hadoop
Is Cloud a right Companion for HadoopIs Cloud a right Companion for Hadoop
Is Cloud a right Companion for Hadoop
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 

Similar to Optimizing Big Data to run in the Public Cloud

Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...Megha Shah
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones WebSantiago Coffey
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudAlluxio, Inc.
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 

Similar to Optimizing Big Data to run in the Public Cloud (20)

Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and Cloud
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 

More from Qubole

7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome ThemQubole
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data AdoptionQubole
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on YarnQubole
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on CloudQubole
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at PinterestQubole
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup Qubole
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleQubole
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data TipsQubole
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposalQubole
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloudQubole
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveQubole
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive QueriesQubole
 

More from Qubole (15)

7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data Industry
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive Queries
 

Recently uploaded

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Optimizing Big Data to run in the Public Cloud

  • 1. Optimizing Big Data to run in the Public Cloud April 23, 2015 NYC Hadoop Meetup
  • 2. A little bit about Qubole Ashish Thusoo Founder & CEO Joydeep Sen Sarma Founder & CTO Founded in 2011 by the pioneers of “big data” @ Facebook and the creator’s of the Apache Hive Project Based in Mountain View, CA with offices in Bangalore, India. Investments by Charles River, LightSpeed, Norwest Ventures. World class product and engineering team from: Team
  • 3. Qubole QDS Qubole works in: • Adtech • Media & Entertainment • Healthcare • Retail • eCommerce Qubole works best when: • Born in Cloud • Commitment to Public Cloud • Data Driven • Large scale data • Lack Hadoop Skills • Analysts & scientist need access
  • 4. Standard Hadoop (on premises) Standard Hadoop (on premises) - JobTrackers - TaskTrackers - NameNodes - DataNodes = Datacenter, servers, VMs, wires… How about adding more capacity? Dev/Test environments? Non-Technical Users? Version upgrades?
  • 5. Hadoop in the Cloud Hadoop in the Cloud = Someone else’s datacenter, servers, VMs, wires… Designed for capacity scaling (and reduction) Designed for multiple Dev/Test environments Potential UI for Non-Technical Users Potential support for version upgrades Ad hoc queries can spin up a cluster on-demand exactly when needed = Cost reduction, self-service, custom configurable clusters *Security (important and production ready, but we won’t focus on it here)
  • 6. Great, what about S3? (so what?)
  • 7. Data stored in HDFS requires nodes to be kept running continuously (an EC2 instance). This can be expensive. Data stored in S3 means it is remote from the compute nodes. S3 performs well in general, but it’s not uncommon to see significant variance in performance. 7
  • 8. Split Computation and File I/O Split Computation and File I/O Multiple map tasks are instantiated and each of these is assigned a split. Hadoop needs to know the size of input files so that they can grouped into equal sized splits. Input files are spread across many directories. For example, two years of data, organized into hourly directories, results in 17520 directories. If each directory contains 6 files, this makes a grand total of 105,120 files. Map-Reduce calls the generic Hadoop file listing API against each input directory to get the size of all files in the directory.
  • 9. Split Computation and File I/O Split Computation and File I/O This is okay when on HDFS. But, in our example, this results in 17520 API calls. This is not a big deal in HDFS, but results in very bad performance in S3. Every listing call in S3 involves using a Rest API call and parsing of XML results which has very high overhead and latency. Furthermore, Amazon employs protection mechanisms against high rate of API calls. For certain workloads, split computation becomes a huge bottleneck.
  • 10. Data Consistency Data Consistency From AWS FAQ: What data consistency model does Amazon S3 employ? Amazon S3 buckets in the US Standard region provide eventual consistency. Amazon S3 buckets in all other regions provide read-after- write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.
  • 11. Data Consistency Data Consistency • Eventual Consistency • Read-After-Write Consistency • Why is Read-After-Write Consistency Useful?
  • 12. Data Consistency Data Consistency • Read-after-update consistency • Read-after-delete
  • 13. Data Consistency Data Consistency Specify named S3 endpoints instead of US Standard. For example, replace: http://mybucket.s3.amazonaws.com/somekey.ext with: http://mybucket.s3-external-1.amazonaws.com/somekey.ext http://docs.aws.amazon.com/redshift/latest/dg/managing-data- consistency.html
  • 14. Qubole DataFlow Diagram Qubole UI via Browser SDK ODBC User Access Qubole’s AWS Account Customer’s AWS Account REST API (HTTPS) SSH Ephemeral Hadoop Clusters, Managed by Qubole Slave Master Data Flow within Customer’s AWS (optional) Other RDS, Redshift Ephemeral Web Tier Web Servers Encrypted Result Cache Encrypted HDFS Slave Encrypted HDFS RDS – Qubole User, Account Configurations (Encrypted credentials Amazon S3 w/S3 Server Side Encryption Default Hive Metastore Encryption Options: a) Qubole can encrypt the result cache b) Qubole supports encryption of the ephemeral drives used for HDFS c) Qubole supports S3 Server Side Encryption (c)(b) (a) (optional) Custom Hive Metastore SSH
  • 15. QDS Platform Features Auto-Scaling self managed Hadoop Clusters in Cloud – Including Amazon EC2, Rackspace, Google Compute & OpenStack The Fastest Hadoop running on the cloud – Numerous Optimizations that provide 4 to 8 times faster performance than Amazon Elastic MapReduce (EMR) Pre-built connectors – Traditional RDBMS, MongoDB and other NoSQL solutions – Incremental Data Scrapes Job Scheduler – Dependencies, Workflows, Incremental Jobs Multi-Platform Support – Supports AWS, Google & Azure Credentials 15
  • 16. QDS Capabilities Mix and Match Reserved & Spot instances – To reduce the cost of compute hours on the cloud Perform data exploration and analysis on raw multi-structured data formats. Integration with data visualization & BI tools via ODBC – Tableau Software, Pentaho, Excel All functionality also available through API’s and toolkits 16 QDS Platform Features
  • 17. Faster split computations on S3 Faster split computations on S3 To solve this problem, we modified split computation to invoke listing at the level of the parent directory. This call returns all files (and their sizes) in all subdirectories in blocks of 1000. Some subdirectories and files may not be of interest to job/query e.g. partition elimination may be eliminated some of them. We take advantage of the fact that file listing is in lexicographic order and perform a modified merge join of the list of files and list of directories of interest. This allows us to efficiently identify files sizes of interesting files. The modified algorithm results in only 106 API calls (each call returns 1000 files) compared to 17520 API calls in the original implementation. We compared the two approaches using a simple Hive test. In this test, we take a partitioned table T with 15,000 files but vary the number of partitions (a partition corresponds to a directory). We compare the performance of ‘select count(*) from T’. In the extreme case, this optimization shows a speedup of 8x!
  • 18. Faster reads from S3 Faster reads from S3 Opening of files take a significant amount of time – at least 50 milliseconds per file. This problem becomes pronounced when the input dataset has lots of small files and file open latency forms a significant portion of overall execution time. To alleviate this problem, we included an optimization wherein we open an S3 file in a background thread a little while before it is actually required by the map task. This hides the file open latency. One thing to be aware of is that if a S3 file is opened, but not read from for a while, S3 returns a RequestTimeout and potentially penalizes the caller. We tested this optimization with a simple hive test. Our dataset consisted of 80000 files, each of size 640KB. We noticed an improvement of 30% in a count(*) query as a result of this optimization.
  • 19. Attaching to EBS Volumes Attaching to EBS Volumes Storage on EC2 Instances. Instance Type Instance Storage (GB) c1.xlarge 1680 [4 x 420] c3.2xlarge 160 [2 x 80 SSD] c3.4xlarge 320 [2 x 160 SSD] c3.8xlarge 640 [2 x 320 SSD] The amount of storage per instance might not be sufficient for running Hadoop clusters with high volumes of data.
  • 20. Attaching to EBS Volumes Attaching to EBS Volumes • AWS offers raw block devices called EBS volumes which can be attached to EC2 instances. • It is possible to attach multiple EBS volumes with a size up to 1 TB per volume. This can easily compensate for the low instance storage available on the new generation instances. Also, use of EBS volumes for storage purpose costs much less than adding cheaper instances with more storage capacity.
  • 21. Attaching to EBS Volumes Attaching to EBS Volumes • Configurable Reserved Volumes • On Qubole platform, using new generation instances, users have an option to use reserved volumes if data requirement exceeds the local storage available in the cluster. • AWS EBS volumes come in various flavors e.g. magnetic, SSD backed. Users can select the size and type of EBS volumes based on the data and performance requirements. SSD Reserved Disk Disk Access HDFSMapReduce
  • 22. Protection Against Bad Jobs Protection Against Bad Jobs A single hadoop cluster is usually shared across many users. - common occurrence that a certain user may issue a bad job which may degrade the performance of the entire cluster. Running out of Disk: - Single mapper issuing too much output. - A map/reduce job may have a lot of mappers outputting too much map data - Reducer tasks copying a lot of map output data during the shuffle phase
  • 23. Protection Against Bad Jobs Protection Against Bad Jobs Qubole’s hadoop distribution provides protection of the clusters against such jobs. Its clusters periodically monitors the jobs and kills any job that may be affecting the entire cluster. Kill job when … - Total map output of a job is beyond a configurable value - Any tasks produce more map output than a set disk percent - A job produces a lot of logs (configurable value) - Reducers read a lot of map data (configurable value)
  • 24. Direct output commit to S3 Direct output commit to S3 Using default S3 code path involves writing to a temp directory and then moving the temp directory into its final location. Move on S3 is really a copy and then delete. Instead, write into the target directory.
  • 25. Direct output commit to S3 - Hive Direct output commit to S3 - Hive Changed the naming scheme for the files Hive creates. • Instead of 00000 we use names like UUID_000000 where all files generated by a single insert into use the same prefix. • Guarantees that a new insert into will not stomp on data produced by an earlier query. To support insert overwrite with dynamic partitions the tasks that write into the directory must delete any existing files. Before the insert overwrite begins we generate a UUID to use for this statement. All the mappers/reducers when deleting from a directory will delete all files that don't begin with this UUID.
  • 26. Direct output commit to S3 - MapReduce Direct output commit to S3 - MapReduce By default the MR code also writes to a temp location and then moves to final. The moves are done by listing the temp location and moving all the files there. • To avoid this we track expected file counts for a couple of File formats • Provide Direct committers which avoids the move completely
  • 27. Spot Instances, Placement Policy, and Fallback to on-demand Spot Instances Spot Instances allow users to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the Spot Price. QDS enables you to realize cost savings of as much as 50% to 60% by supporting the Spot Instance pricing model in addition to the Reserved Instance pricing. Use Qubole Placement Policy: When using spot instances for slaves, this ensures that at least one replica of each HDFS block is placed on Stable instances. It is recommended to keep this enabled when using spot instances. Fallback to on-demand: When upscaling the cluster, sometimes we may not be able to procure Spot Instances because of low availability or high price. This option specifies that autoscaling should then fall back to procuring On-Demand instances. This will increase the cost of running the cluster, but ensures that the processing completes relatively quickly. Enable this if command processing time is important to you.
  • 29. Operations Analyst Marketing Ops Analyst Data Architects Business Users Product Support Customer Support Developer Sales Ops Product Managers Developer Tools Service Management Data Workbench Cloud Data Platform BI & DW Systems • SDK • API • Analysis • Security • Job Scheduler • Data Governance • Analytics templates • Monitoring • Support • Collaboration • Workflow & Map/Reduce • Auto Scaling • Cloud Optimization • Data Connectors • YARN • Spark & Pig • Presto & Hive
  • 30. 30 Features S3 Caching: S3 caching utilizes resources more efficiently and brings up clusters faster – up to 10x faster on some client instances. Variable Spot Instance Pricing: QDS allows you to vary the number of spots vs. on-demand nodes, providing the benefits of spot pricing (up to 90% less expensive) with the certainty of getting your job done. Searchable Queries and Log Files: QDS shows you all of your jobs, allowing you to compare the efficiency of queries and avoid having to re-create queries from scratch. Built in Job Tracker: QDS provides a job tracker accessed directly through the UI, allowing you to identify resources, nodes, and tasks. Security: QDS has a tight security environment, with the ability to encrypt data at rest on nodes.
  • 31. Use Cases and Additional Information 31
  • 32. Why Qubole? 32 “Qubole has enabled more users within Pinterest to get to the data and has made the data platform lot more scalable and stable” Mohammad Shahangian - Lead, Data Science and Infrastructure Moved to Qubole from Amazon EMR because of stability and rapidly expanded big data usage by giving access to data to users beyond developers. Rapid expansion of big data beyond developers (240 users out of 600 person company) Use CasesUser and Query Growth Rapid expansion in use cases ranging from ETL, search, adhoc querying, product analytics etc. Rock solid infrastructure sees 50% less failures as compared to AWS Elastic Map/Reduce Enterprise scale processing and data access
  • 33. Why Qubole? 33 “We needed something that was reliable and easy to learn, setup, use and put into production without the risk and high expectations that comes with committing millions of dollars in upfront investment. Qubole was that thing.” Marc Rosen - Sr. Director, Data Analytics Moved to Big data on the cloud (from internal Oracle clusters) because getting to analysis was much quicker than operating infrastructure themselves. Used to answer client queries and power client dashboards. Use Cases# Commands Per Month 0 1250 2500 3750 5000 Number of queries Segment audiences based on their behavior including such topics as user pathway and multi-dimensional recency analysis Build customer profiles (both uni/multivariate) across thousands of first party (i.e., client CRM files) and third party (i.e., demographic) segments Simplify attribution insights showing the effects of upper funnel prospecting on lower funnel remarketing media strategies
  • 34. 34 Operations Analyst Marketing Ops Analyst Data Architect Business Users Product Support Customer Support Developer Sales Ops Product Managers Data Infrastructure Links for more information http://www.datacenterknowledge.com/archives/2 015/04/02/hybrid-clouds-need-for-speed/ http://engineering.pinterest.com/post/927423719 19/powering-big-data-at-pinterest http://www.itbusinessedge.com/slideshows/six- details-your-big-data-provider-wont-tell-you.html http://www.marketwired.com/press- release/qubole-reports-rapid-adoption-of-its-self- service-big-data-analytics-platform-1990272.htm