SlideShare a Scribd company logo
Optimizing Big Data to run in the Public Cloud
April 23, 2015
NYC Hadoop Meetup
A little bit about Qubole
Ashish Thusoo
Founder & CEO
Joydeep Sen Sarma
Founder & CTO
Founded in 2011 by the pioneers of “big data” @
Facebook and the creator’s of the Apache Hive Project
Based in Mountain View, CA with offices in Bangalore,
India. Investments by Charles River, LightSpeed, Norwest
Ventures.
World class product and engineering team from:
Team
Qubole QDS
Qubole works in:
• Adtech
• Media & Entertainment
• Healthcare
• Retail
• eCommerce
Qubole works best when:
• Born in Cloud
• Commitment to Public Cloud
• Data Driven
• Large scale data
• Lack Hadoop Skills
• Analysts & scientist need access
Standard Hadoop (on premises)
Standard Hadoop (on premises)
- JobTrackers
- TaskTrackers
- NameNodes
- DataNodes
= Datacenter, servers, VMs, wires…
How about adding more capacity? Dev/Test environments?
Non-Technical Users? Version upgrades?
Hadoop in the Cloud
Hadoop in the Cloud
= Someone else’s datacenter, servers, VMs, wires…
Designed for capacity scaling (and reduction)
Designed for multiple Dev/Test environments
Potential UI for Non-Technical Users
Potential support for version upgrades
Ad hoc queries can spin up a cluster on-demand exactly when needed
= Cost reduction, self-service, custom configurable clusters
*Security (important and production ready, but we won’t focus on it here)
Great, what about S3?
(so what?)
Data stored in HDFS requires nodes to be kept running
continuously (an EC2 instance). This can be expensive.
Data stored in S3 means it is remote from the compute nodes.
S3 performs well in general, but it’s not uncommon to see
significant variance in performance.
7
Split Computation and File I/O
Split Computation and File I/O
Multiple map tasks are instantiated and each of these is assigned a split.
Hadoop needs to know the size of input files so that they can grouped
into equal sized splits.
Input files are spread across many directories.
For example, two years of data, organized into hourly directories, results
in 17520 directories. If each directory contains 6 files, this makes a grand
total of 105,120 files.
Map-Reduce calls the generic Hadoop file listing API against each input
directory to get the size of all files in the directory.
Split Computation and File I/O
Split Computation and File I/O
This is okay when on HDFS. But, in our example, this results in
17520 API calls. This is not a big deal in HDFS, but results in very
bad performance in S3.
Every listing call in S3 involves using a Rest API call and parsing of
XML results which has very high overhead and latency.
Furthermore, Amazon employs protection mechanisms against high
rate of API calls. For certain workloads, split computation becomes
a huge bottleneck.
Data Consistency
Data Consistency
From AWS FAQ:
What data consistency model does Amazon S3 employ?
Amazon S3 buckets in the US Standard region provide eventual
consistency. Amazon S3 buckets in all other regions provide read-after-
write consistency for PUTS of new objects and eventual consistency for
overwrite PUTS and DELETES.
Data Consistency
Data Consistency
• Eventual Consistency
• Read-After-Write Consistency
• Why is Read-After-Write Consistency Useful?
Data Consistency
Data Consistency
• Read-after-update consistency
• Read-after-delete
Data Consistency
Data Consistency
Specify named S3 endpoints instead of US Standard.
For example, replace:
http://mybucket.s3.amazonaws.com/somekey.ext
with:
http://mybucket.s3-external-1.amazonaws.com/somekey.ext
http://docs.aws.amazon.com/redshift/latest/dg/managing-data-
consistency.html
Qubole DataFlow Diagram
Qubole UI via
Browser
SDK
ODBC
User Access
Qubole’s
AWS Account
Customer’s AWS Account
REST API
(HTTPS)
SSH
Ephemeral Hadoop Clusters,
Managed by Qubole
Slave
Master
Data Flow within
Customer’s AWS
(optional)
Other RDS,
Redshift
Ephemeral
Web Tier
Web Servers
Encrypted
Result
Cache
Encrypted
HDFS
Slave
Encrypted
HDFS
RDS – Qubole
User, Account
Configurations
(Encrypted
credentials
Amazon S3
w/S3 Server Side
Encryption
Default Hive
Metastore
Encryption Options:
a) Qubole can encrypt the result cache
b) Qubole supports encryption of the ephemeral drives used for HDFS
c) Qubole supports S3 Server Side Encryption
(c)(b)
(a)
(optional)
Custom Hive
Metastore
SSH
QDS Platform Features
Auto-Scaling self managed Hadoop Clusters in Cloud
– Including Amazon EC2, Rackspace, Google Compute & OpenStack
The Fastest Hadoop running on the cloud
– Numerous Optimizations that provide 4 to 8 times faster performance than Amazon
Elastic MapReduce (EMR)
Pre-built connectors
– Traditional RDBMS, MongoDB and other NoSQL solutions
– Incremental Data Scrapes
Job Scheduler
– Dependencies, Workflows, Incremental Jobs
Multi-Platform Support
– Supports AWS, Google & Azure Credentials
15
QDS Capabilities
Mix and Match Reserved & Spot instances
– To reduce the cost of compute hours on the cloud
Perform data exploration and analysis on raw multi-structured
data formats.
Integration with data visualization & BI tools via ODBC
– Tableau Software, Pentaho, Excel
All functionality also available through API’s and toolkits
16
QDS Platform Features
Faster split computations on S3
Faster split computations on S3
To solve this problem, we modified split computation to invoke listing at the level of the parent directory.
This call returns all files (and their sizes) in all subdirectories in blocks of 1000.
Some subdirectories and files may not be of interest to job/query e.g. partition elimination may be eliminated
some of them.
We take advantage of the fact that file listing is in lexicographic order and perform a modified merge join of the
list of files and list of directories of interest.
This allows us to efficiently identify files sizes of interesting files. The modified algorithm results in only 106 API
calls (each call returns 1000 files) compared to 17520 API calls in the original implementation. We compared the
two approaches using a simple Hive test. In this test, we take a partitioned table T with 15,000 files but vary the
number of partitions (a partition corresponds to a directory). We compare the performance of ‘select count(*)
from T’. In the extreme case, this optimization shows a speedup of 8x!
Faster reads from S3
Faster reads from S3
Opening of files take a significant amount of time – at least 50 milliseconds per file.
This problem becomes pronounced when the input dataset has lots of small files and file open latency
forms a significant portion of overall execution time.
To alleviate this problem, we included an optimization wherein we open an S3 file in a background thread
a little while before it is actually required by the map task. This hides the file open latency.
One thing to be aware of is that if a S3 file is opened, but not read from for a while, S3 returns a
RequestTimeout and potentially penalizes the caller.
We tested this optimization with a simple hive test. Our dataset consisted of 80000 files, each of size
640KB. We noticed an improvement of 30% in a count(*) query as a result of this optimization.
Attaching to EBS Volumes
Attaching to EBS Volumes
Storage on EC2 Instances.
Instance Type Instance Storage (GB)
c1.xlarge 1680 [4 x 420]
c3.2xlarge 160 [2 x 80 SSD]
c3.4xlarge 320 [2 x 160 SSD]
c3.8xlarge 640 [2 x 320 SSD]
The amount of storage per instance might not be
sufficient for running Hadoop clusters with high
volumes of data.
Attaching to EBS Volumes
Attaching to EBS Volumes
• AWS offers raw block devices called EBS
volumes which can be attached to EC2
instances.
• It is possible to attach multiple EBS volumes
with a size up to 1 TB per volume. This can
easily compensate for the low instance
storage available on the new generation
instances. Also, use of EBS volumes for
storage purpose costs much less than adding
cheaper instances with more storage capacity.
Attaching to EBS Volumes
Attaching to EBS Volumes
• Configurable Reserved Volumes
• On Qubole platform, using new generation
instances, users have an option to use
reserved volumes if data requirement exceeds
the local storage available in the cluster.
• AWS EBS volumes come in various flavors
e.g. magnetic, SSD backed. Users can select
the size and type of EBS volumes based on
the data and performance requirements.
SSD
Reserved
Disk
Disk Access
HDFSMapReduce
Protection Against Bad Jobs
Protection Against Bad Jobs
A single hadoop cluster is usually shared across many users.
- common occurrence that a certain user may issue a bad job which may
degrade the performance of the entire cluster.
Running out of Disk:
- Single mapper issuing too much output.
- A map/reduce job may have a lot of mappers outputting too much map data
- Reducer tasks copying a lot of map output data during the shuffle phase
Protection Against Bad Jobs
Protection Against Bad Jobs
Qubole’s hadoop distribution provides protection of the clusters against such
jobs. Its clusters periodically monitors the jobs and kills any job that may be
affecting the entire cluster.
Kill job when …
- Total map output of a job is beyond a configurable value
- Any tasks produce more map output than a set disk percent
- A job produces a lot of logs (configurable value)
- Reducers read a lot of map data (configurable value)
Direct output commit to S3
Direct output commit to S3
Using default S3 code path involves writing to a temp directory and then moving
the temp directory into its final location.
Move on S3 is really a copy and then delete.
Instead, write into the target directory.
Direct output commit to S3 - Hive
Direct output commit to S3 - Hive
Changed the naming scheme for the files Hive creates.
• Instead of 00000 we use names like UUID_000000 where all files generated by a
single insert into use the same prefix.
• Guarantees that a new insert into will not stomp on data produced by an earlier query.
To support insert overwrite with dynamic partitions the tasks that write into the directory
must delete any existing files.
Before the insert overwrite begins we generate a UUID to use for this statement. All the
mappers/reducers when deleting from a directory will delete all files that don't begin with
this UUID.
Direct output commit to S3 - MapReduce
Direct output commit to S3 - MapReduce
By default the MR code also writes to a temp location and then moves to final. The moves
are done by listing the temp location and moving all the files there.
• To avoid this we track expected file counts for a couple of File formats
• Provide Direct committers which avoids the move completely
Spot Instances, Placement Policy, and Fallback to on-demand
Spot Instances
Spot Instances allow users to bid on unused Amazon EC2 capacity and run those instances for as
long as their bid exceeds the Spot Price. QDS enables you to realize cost savings of as much as
50% to 60% by supporting the Spot Instance pricing model in addition to the Reserved Instance
pricing.
Use Qubole Placement Policy:
When using spot instances for slaves, this ensures that at least one replica of each HDFS block is
placed on Stable instances. It is recommended to keep this enabled when using spot instances.
Fallback to on-demand:
When upscaling the cluster, sometimes we may not be able to procure Spot Instances because of
low availability or high price. This option specifies that autoscaling should then fall back to procuring
On-Demand instances. This will increase the cost of running the cluster, but ensures that the
processing completes relatively quickly. Enable this if command processing time is important to you.
DIY vs. Qubole
28
Operations
Analyst
Marketing Ops
Analyst
Data
Architects
Business
Users
Product
Support
Customer
Support
Developer
Sales
Ops
Product
Managers
Developer
Tools
Service
Management
Data Workbench
Cloud Data Platform
BI & DW
Systems
• SDK
• API
• Analysis
• Security
• Job Scheduler
• Data Governance
• Analytics templates
• Monitoring
• Support
• Collaboration
• Workflow &
Map/Reduce
• Auto Scaling
• Cloud Optimization
• Data Connectors
• YARN
• Spark & Pig
• Presto & Hive
30
Features
S3 Caching: S3 caching utilizes resources more efficiently and
brings up clusters faster – up to 10x faster on some client instances.
Variable Spot Instance Pricing: QDS allows you
to vary the number of spots vs. on-demand nodes,
providing the benefits of spot pricing (up to 90%
less expensive) with the certainty of getting your
job done.
Searchable Queries and Log Files: QDS shows
you all of your jobs, allowing you to compare the
efficiency of queries and avoid having to re-create
queries from scratch.
Built in Job Tracker: QDS provides a job tracker
accessed directly through the UI, allowing you to
identify resources, nodes, and tasks.
Security: QDS has a tight security environment,
with the ability to encrypt data at rest on nodes.
Use Cases and Additional Information
31
Why Qubole?
32
“Qubole has enabled more users within Pinterest to get to the
data and has made the data platform lot more scalable and
stable”
Mohammad Shahangian - Lead, Data Science and Infrastructure
Moved to Qubole from Amazon EMR because
of stability and rapidly expanded big data usage by
giving access to data to users beyond developers.
Rapid expansion of big data beyond developers (240 users
out of 600 person company)
Use CasesUser and Query Growth
Rapid expansion in use cases ranging from ETL, search,
adhoc querying, product analytics etc.
Rock solid infrastructure sees 50% less failures as
compared to AWS Elastic Map/Reduce
Enterprise scale processing and data access
Why Qubole?
33
“We needed something that was reliable and easy to learn,
setup, use and put into production without the risk and high
expectations that comes with committing millions of dollars in
upfront investment. Qubole was that thing.”
Marc Rosen - Sr. Director, Data Analytics
Moved to Big data on the cloud (from internal Oracle
clusters) because getting to analysis was much
quicker than operating infrastructure themselves.
Used to answer client queries and power client
dashboards.
Use Cases# Commands Per Month
0
1250
2500
3750
5000
Number of queries
Segment audiences based on their behavior including
such topics as user pathway and multi-dimensional
recency analysis
Build customer profiles (both uni/multivariate) across
thousands of first party (i.e., client CRM files) and third
party (i.e., demographic) segments
Simplify attribution insights showing the effects of upper
funnel prospecting on lower funnel remarketing media
strategies
34
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure
Links for more information
http://www.datacenterknowledge.com/archives/2
015/04/02/hybrid-clouds-need-for-speed/
http://engineering.pinterest.com/post/927423719
19/powering-big-data-at-pinterest
http://www.itbusinessedge.com/slideshows/six-
details-your-big-data-provider-wont-tell-you.html
http://www.marketwired.com/press-
release/qubole-reports-rapid-adoption-of-its-self-
service-big-data-analytics-platform-1990272.htm

More Related Content

What's hot

Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
Neev Technologies
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
DataWorks Summit
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
DataWorks Summit
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!
Progress
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Cloudera, Inc.
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Databricks
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
✔ Eric David Benari, PMP
 
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeJoydeep Sen Sarma
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR Technologies
 
Hadoop at Ebay
Hadoop at EbayHadoop at Ebay
Hadoop at Ebay
Aroop Maliakkal
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
✔ Eric David Benari, PMP
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Spark Summit
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use cases
Jeff Kelly
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
Gert Drapers
 
Is Cloud a right Companion for Hadoop
Is Cloud a right Companion for HadoopIs Cloud a right Companion for Hadoop
Is Cloud a right Companion for HadoopDataWorks Summit
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
Dan Lynn
 

What's hot (20)

Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
 
Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Hadoop at Ebay
Hadoop at EbayHadoop at Ebay
Hadoop at Ebay
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
Big Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use cases
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Is Cloud a right Companion for Hadoop
Is Cloud a right Companion for HadoopIs Cloud a right Companion for Hadoop
Is Cloud a right Companion for Hadoop
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 

Similar to Optimizing Big Data to run in the Public Cloud

Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
gvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
gvernik
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
Shubham Tagra
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
Steve Staso
 
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
Megha Shah
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Amazon Web Services
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Lucidworks
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
Santiago Coffey
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
The HDF-EOS Tools and Information Center
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and Cloud
Alluxio, Inc.
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
Vladimir Simek
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Antonio Silveira
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
Amazon Web Services
 

Similar to Optimizing Big Data to run in the Public Cloud (20)

Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and Cloud
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 

More from Qubole

7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
Qubole
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
Qubole
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
Qubole
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
Qubole
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
Qubole
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data Industry
Qubole
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
Qubole
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
Qubole
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Qubole
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
Qubole
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposalQubole
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloudQubole
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
Qubole
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive Queries
Qubole
 

More from Qubole (15)

7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data Industry
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Effective Hive Queries
Effective Hive QueriesEffective Hive Queries
Effective Hive Queries
 

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 

Optimizing Big Data to run in the Public Cloud

  • 1. Optimizing Big Data to run in the Public Cloud April 23, 2015 NYC Hadoop Meetup
  • 2. A little bit about Qubole Ashish Thusoo Founder & CEO Joydeep Sen Sarma Founder & CTO Founded in 2011 by the pioneers of “big data” @ Facebook and the creator’s of the Apache Hive Project Based in Mountain View, CA with offices in Bangalore, India. Investments by Charles River, LightSpeed, Norwest Ventures. World class product and engineering team from: Team
  • 3. Qubole QDS Qubole works in: • Adtech • Media & Entertainment • Healthcare • Retail • eCommerce Qubole works best when: • Born in Cloud • Commitment to Public Cloud • Data Driven • Large scale data • Lack Hadoop Skills • Analysts & scientist need access
  • 4. Standard Hadoop (on premises) Standard Hadoop (on premises) - JobTrackers - TaskTrackers - NameNodes - DataNodes = Datacenter, servers, VMs, wires… How about adding more capacity? Dev/Test environments? Non-Technical Users? Version upgrades?
  • 5. Hadoop in the Cloud Hadoop in the Cloud = Someone else’s datacenter, servers, VMs, wires… Designed for capacity scaling (and reduction) Designed for multiple Dev/Test environments Potential UI for Non-Technical Users Potential support for version upgrades Ad hoc queries can spin up a cluster on-demand exactly when needed = Cost reduction, self-service, custom configurable clusters *Security (important and production ready, but we won’t focus on it here)
  • 6. Great, what about S3? (so what?)
  • 7. Data stored in HDFS requires nodes to be kept running continuously (an EC2 instance). This can be expensive. Data stored in S3 means it is remote from the compute nodes. S3 performs well in general, but it’s not uncommon to see significant variance in performance. 7
  • 8. Split Computation and File I/O Split Computation and File I/O Multiple map tasks are instantiated and each of these is assigned a split. Hadoop needs to know the size of input files so that they can grouped into equal sized splits. Input files are spread across many directories. For example, two years of data, organized into hourly directories, results in 17520 directories. If each directory contains 6 files, this makes a grand total of 105,120 files. Map-Reduce calls the generic Hadoop file listing API against each input directory to get the size of all files in the directory.
  • 9. Split Computation and File I/O Split Computation and File I/O This is okay when on HDFS. But, in our example, this results in 17520 API calls. This is not a big deal in HDFS, but results in very bad performance in S3. Every listing call in S3 involves using a Rest API call and parsing of XML results which has very high overhead and latency. Furthermore, Amazon employs protection mechanisms against high rate of API calls. For certain workloads, split computation becomes a huge bottleneck.
  • 10. Data Consistency Data Consistency From AWS FAQ: What data consistency model does Amazon S3 employ? Amazon S3 buckets in the US Standard region provide eventual consistency. Amazon S3 buckets in all other regions provide read-after- write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.
  • 11. Data Consistency Data Consistency • Eventual Consistency • Read-After-Write Consistency • Why is Read-After-Write Consistency Useful?
  • 12. Data Consistency Data Consistency • Read-after-update consistency • Read-after-delete
  • 13. Data Consistency Data Consistency Specify named S3 endpoints instead of US Standard. For example, replace: http://mybucket.s3.amazonaws.com/somekey.ext with: http://mybucket.s3-external-1.amazonaws.com/somekey.ext http://docs.aws.amazon.com/redshift/latest/dg/managing-data- consistency.html
  • 14. Qubole DataFlow Diagram Qubole UI via Browser SDK ODBC User Access Qubole’s AWS Account Customer’s AWS Account REST API (HTTPS) SSH Ephemeral Hadoop Clusters, Managed by Qubole Slave Master Data Flow within Customer’s AWS (optional) Other RDS, Redshift Ephemeral Web Tier Web Servers Encrypted Result Cache Encrypted HDFS Slave Encrypted HDFS RDS – Qubole User, Account Configurations (Encrypted credentials Amazon S3 w/S3 Server Side Encryption Default Hive Metastore Encryption Options: a) Qubole can encrypt the result cache b) Qubole supports encryption of the ephemeral drives used for HDFS c) Qubole supports S3 Server Side Encryption (c)(b) (a) (optional) Custom Hive Metastore SSH
  • 15. QDS Platform Features Auto-Scaling self managed Hadoop Clusters in Cloud – Including Amazon EC2, Rackspace, Google Compute & OpenStack The Fastest Hadoop running on the cloud – Numerous Optimizations that provide 4 to 8 times faster performance than Amazon Elastic MapReduce (EMR) Pre-built connectors – Traditional RDBMS, MongoDB and other NoSQL solutions – Incremental Data Scrapes Job Scheduler – Dependencies, Workflows, Incremental Jobs Multi-Platform Support – Supports AWS, Google & Azure Credentials 15
  • 16. QDS Capabilities Mix and Match Reserved & Spot instances – To reduce the cost of compute hours on the cloud Perform data exploration and analysis on raw multi-structured data formats. Integration with data visualization & BI tools via ODBC – Tableau Software, Pentaho, Excel All functionality also available through API’s and toolkits 16 QDS Platform Features
  • 17. Faster split computations on S3 Faster split computations on S3 To solve this problem, we modified split computation to invoke listing at the level of the parent directory. This call returns all files (and their sizes) in all subdirectories in blocks of 1000. Some subdirectories and files may not be of interest to job/query e.g. partition elimination may be eliminated some of them. We take advantage of the fact that file listing is in lexicographic order and perform a modified merge join of the list of files and list of directories of interest. This allows us to efficiently identify files sizes of interesting files. The modified algorithm results in only 106 API calls (each call returns 1000 files) compared to 17520 API calls in the original implementation. We compared the two approaches using a simple Hive test. In this test, we take a partitioned table T with 15,000 files but vary the number of partitions (a partition corresponds to a directory). We compare the performance of ‘select count(*) from T’. In the extreme case, this optimization shows a speedup of 8x!
  • 18. Faster reads from S3 Faster reads from S3 Opening of files take a significant amount of time – at least 50 milliseconds per file. This problem becomes pronounced when the input dataset has lots of small files and file open latency forms a significant portion of overall execution time. To alleviate this problem, we included an optimization wherein we open an S3 file in a background thread a little while before it is actually required by the map task. This hides the file open latency. One thing to be aware of is that if a S3 file is opened, but not read from for a while, S3 returns a RequestTimeout and potentially penalizes the caller. We tested this optimization with a simple hive test. Our dataset consisted of 80000 files, each of size 640KB. We noticed an improvement of 30% in a count(*) query as a result of this optimization.
  • 19. Attaching to EBS Volumes Attaching to EBS Volumes Storage on EC2 Instances. Instance Type Instance Storage (GB) c1.xlarge 1680 [4 x 420] c3.2xlarge 160 [2 x 80 SSD] c3.4xlarge 320 [2 x 160 SSD] c3.8xlarge 640 [2 x 320 SSD] The amount of storage per instance might not be sufficient for running Hadoop clusters with high volumes of data.
  • 20. Attaching to EBS Volumes Attaching to EBS Volumes • AWS offers raw block devices called EBS volumes which can be attached to EC2 instances. • It is possible to attach multiple EBS volumes with a size up to 1 TB per volume. This can easily compensate for the low instance storage available on the new generation instances. Also, use of EBS volumes for storage purpose costs much less than adding cheaper instances with more storage capacity.
  • 21. Attaching to EBS Volumes Attaching to EBS Volumes • Configurable Reserved Volumes • On Qubole platform, using new generation instances, users have an option to use reserved volumes if data requirement exceeds the local storage available in the cluster. • AWS EBS volumes come in various flavors e.g. magnetic, SSD backed. Users can select the size and type of EBS volumes based on the data and performance requirements. SSD Reserved Disk Disk Access HDFSMapReduce
  • 22. Protection Against Bad Jobs Protection Against Bad Jobs A single hadoop cluster is usually shared across many users. - common occurrence that a certain user may issue a bad job which may degrade the performance of the entire cluster. Running out of Disk: - Single mapper issuing too much output. - A map/reduce job may have a lot of mappers outputting too much map data - Reducer tasks copying a lot of map output data during the shuffle phase
  • 23. Protection Against Bad Jobs Protection Against Bad Jobs Qubole’s hadoop distribution provides protection of the clusters against such jobs. Its clusters periodically monitors the jobs and kills any job that may be affecting the entire cluster. Kill job when … - Total map output of a job is beyond a configurable value - Any tasks produce more map output than a set disk percent - A job produces a lot of logs (configurable value) - Reducers read a lot of map data (configurable value)
  • 24. Direct output commit to S3 Direct output commit to S3 Using default S3 code path involves writing to a temp directory and then moving the temp directory into its final location. Move on S3 is really a copy and then delete. Instead, write into the target directory.
  • 25. Direct output commit to S3 - Hive Direct output commit to S3 - Hive Changed the naming scheme for the files Hive creates. • Instead of 00000 we use names like UUID_000000 where all files generated by a single insert into use the same prefix. • Guarantees that a new insert into will not stomp on data produced by an earlier query. To support insert overwrite with dynamic partitions the tasks that write into the directory must delete any existing files. Before the insert overwrite begins we generate a UUID to use for this statement. All the mappers/reducers when deleting from a directory will delete all files that don't begin with this UUID.
  • 26. Direct output commit to S3 - MapReduce Direct output commit to S3 - MapReduce By default the MR code also writes to a temp location and then moves to final. The moves are done by listing the temp location and moving all the files there. • To avoid this we track expected file counts for a couple of File formats • Provide Direct committers which avoids the move completely
  • 27. Spot Instances, Placement Policy, and Fallback to on-demand Spot Instances Spot Instances allow users to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the Spot Price. QDS enables you to realize cost savings of as much as 50% to 60% by supporting the Spot Instance pricing model in addition to the Reserved Instance pricing. Use Qubole Placement Policy: When using spot instances for slaves, this ensures that at least one replica of each HDFS block is placed on Stable instances. It is recommended to keep this enabled when using spot instances. Fallback to on-demand: When upscaling the cluster, sometimes we may not be able to procure Spot Instances because of low availability or high price. This option specifies that autoscaling should then fall back to procuring On-Demand instances. This will increase the cost of running the cluster, but ensures that the processing completes relatively quickly. Enable this if command processing time is important to you.
  • 29. Operations Analyst Marketing Ops Analyst Data Architects Business Users Product Support Customer Support Developer Sales Ops Product Managers Developer Tools Service Management Data Workbench Cloud Data Platform BI & DW Systems • SDK • API • Analysis • Security • Job Scheduler • Data Governance • Analytics templates • Monitoring • Support • Collaboration • Workflow & Map/Reduce • Auto Scaling • Cloud Optimization • Data Connectors • YARN • Spark & Pig • Presto & Hive
  • 30. 30 Features S3 Caching: S3 caching utilizes resources more efficiently and brings up clusters faster – up to 10x faster on some client instances. Variable Spot Instance Pricing: QDS allows you to vary the number of spots vs. on-demand nodes, providing the benefits of spot pricing (up to 90% less expensive) with the certainty of getting your job done. Searchable Queries and Log Files: QDS shows you all of your jobs, allowing you to compare the efficiency of queries and avoid having to re-create queries from scratch. Built in Job Tracker: QDS provides a job tracker accessed directly through the UI, allowing you to identify resources, nodes, and tasks. Security: QDS has a tight security environment, with the ability to encrypt data at rest on nodes.
  • 31. Use Cases and Additional Information 31
  • 32. Why Qubole? 32 “Qubole has enabled more users within Pinterest to get to the data and has made the data platform lot more scalable and stable” Mohammad Shahangian - Lead, Data Science and Infrastructure Moved to Qubole from Amazon EMR because of stability and rapidly expanded big data usage by giving access to data to users beyond developers. Rapid expansion of big data beyond developers (240 users out of 600 person company) Use CasesUser and Query Growth Rapid expansion in use cases ranging from ETL, search, adhoc querying, product analytics etc. Rock solid infrastructure sees 50% less failures as compared to AWS Elastic Map/Reduce Enterprise scale processing and data access
  • 33. Why Qubole? 33 “We needed something that was reliable and easy to learn, setup, use and put into production without the risk and high expectations that comes with committing millions of dollars in upfront investment. Qubole was that thing.” Marc Rosen - Sr. Director, Data Analytics Moved to Big data on the cloud (from internal Oracle clusters) because getting to analysis was much quicker than operating infrastructure themselves. Used to answer client queries and power client dashboards. Use Cases# Commands Per Month 0 1250 2500 3750 5000 Number of queries Segment audiences based on their behavior including such topics as user pathway and multi-dimensional recency analysis Build customer profiles (both uni/multivariate) across thousands of first party (i.e., client CRM files) and third party (i.e., demographic) segments Simplify attribution insights showing the effects of upper funnel prospecting on lower funnel remarketing media strategies
  • 34. 34 Operations Analyst Marketing Ops Analyst Data Architect Business Users Product Support Customer Support Developer Sales Ops Product Managers Data Infrastructure Links for more information http://www.datacenterknowledge.com/archives/2 015/04/02/hybrid-clouds-need-for-speed/ http://engineering.pinterest.com/post/927423719 19/powering-big-data-at-pinterest http://www.itbusinessedge.com/slideshows/six- details-your-big-data-provider-wont-tell-you.html http://www.marketwired.com/press- release/qubole-reports-rapid-adoption-of-its-self- service-big-data-analytics-platform-1990272.htm