Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Optimizing Big Data to run in the Public Cloud
1. Optimizing Big Data to run in the Public Cloud
April 23, 2015
NYC Hadoop Meetup
2. A little bit about Qubole
Ashish Thusoo
Founder & CEO
Joydeep Sen Sarma
Founder & CTO
Founded in 2011 by the pioneers of “big data” @
Facebook and the creator’s of the Apache Hive Project
Based in Mountain View, CA with offices in Bangalore,
India. Investments by Charles River, LightSpeed, Norwest
Ventures.
World class product and engineering team from:
Team
3. Qubole QDS
Qubole works in:
• Adtech
• Media & Entertainment
• Healthcare
• Retail
• eCommerce
Qubole works best when:
• Born in Cloud
• Commitment to Public Cloud
• Data Driven
• Large scale data
• Lack Hadoop Skills
• Analysts & scientist need access
4. Standard Hadoop (on premises)
Standard Hadoop (on premises)
- JobTrackers
- TaskTrackers
- NameNodes
- DataNodes
= Datacenter, servers, VMs, wires…
How about adding more capacity? Dev/Test environments?
Non-Technical Users? Version upgrades?
5. Hadoop in the Cloud
Hadoop in the Cloud
= Someone else’s datacenter, servers, VMs, wires…
Designed for capacity scaling (and reduction)
Designed for multiple Dev/Test environments
Potential UI for Non-Technical Users
Potential support for version upgrades
Ad hoc queries can spin up a cluster on-demand exactly when needed
= Cost reduction, self-service, custom configurable clusters
*Security (important and production ready, but we won’t focus on it here)
7. Data stored in HDFS requires nodes to be kept running
continuously (an EC2 instance). This can be expensive.
Data stored in S3 means it is remote from the compute nodes.
S3 performs well in general, but it’s not uncommon to see
significant variance in performance.
7
8. Split Computation and File I/O
Split Computation and File I/O
Multiple map tasks are instantiated and each of these is assigned a split.
Hadoop needs to know the size of input files so that they can grouped
into equal sized splits.
Input files are spread across many directories.
For example, two years of data, organized into hourly directories, results
in 17520 directories. If each directory contains 6 files, this makes a grand
total of 105,120 files.
Map-Reduce calls the generic Hadoop file listing API against each input
directory to get the size of all files in the directory.
9. Split Computation and File I/O
Split Computation and File I/O
This is okay when on HDFS. But, in our example, this results in
17520 API calls. This is not a big deal in HDFS, but results in very
bad performance in S3.
Every listing call in S3 involves using a Rest API call and parsing of
XML results which has very high overhead and latency.
Furthermore, Amazon employs protection mechanisms against high
rate of API calls. For certain workloads, split computation becomes
a huge bottleneck.
10. Data Consistency
Data Consistency
From AWS FAQ:
What data consistency model does Amazon S3 employ?
Amazon S3 buckets in the US Standard region provide eventual
consistency. Amazon S3 buckets in all other regions provide read-after-
write consistency for PUTS of new objects and eventual consistency for
overwrite PUTS and DELETES.
13. Data Consistency
Data Consistency
Specify named S3 endpoints instead of US Standard.
For example, replace:
http://mybucket.s3.amazonaws.com/somekey.ext
with:
http://mybucket.s3-external-1.amazonaws.com/somekey.ext
http://docs.aws.amazon.com/redshift/latest/dg/managing-data-
consistency.html
14. Qubole DataFlow Diagram
Qubole UI via
Browser
SDK
ODBC
User Access
Qubole’s
AWS Account
Customer’s AWS Account
REST API
(HTTPS)
SSH
Ephemeral Hadoop Clusters,
Managed by Qubole
Slave
Master
Data Flow within
Customer’s AWS
(optional)
Other RDS,
Redshift
Ephemeral
Web Tier
Web Servers
Encrypted
Result
Cache
Encrypted
HDFS
Slave
Encrypted
HDFS
RDS – Qubole
User, Account
Configurations
(Encrypted
credentials
Amazon S3
w/S3 Server Side
Encryption
Default Hive
Metastore
Encryption Options:
a) Qubole can encrypt the result cache
b) Qubole supports encryption of the ephemeral drives used for HDFS
c) Qubole supports S3 Server Side Encryption
(c)(b)
(a)
(optional)
Custom Hive
Metastore
SSH
15. QDS Platform Features
Auto-Scaling self managed Hadoop Clusters in Cloud
– Including Amazon EC2, Rackspace, Google Compute & OpenStack
The Fastest Hadoop running on the cloud
– Numerous Optimizations that provide 4 to 8 times faster performance than Amazon
Elastic MapReduce (EMR)
Pre-built connectors
– Traditional RDBMS, MongoDB and other NoSQL solutions
– Incremental Data Scrapes
Job Scheduler
– Dependencies, Workflows, Incremental Jobs
Multi-Platform Support
– Supports AWS, Google & Azure Credentials
15
16. QDS Capabilities
Mix and Match Reserved & Spot instances
– To reduce the cost of compute hours on the cloud
Perform data exploration and analysis on raw multi-structured
data formats.
Integration with data visualization & BI tools via ODBC
– Tableau Software, Pentaho, Excel
All functionality also available through API’s and toolkits
16
QDS Platform Features
17. Faster split computations on S3
Faster split computations on S3
To solve this problem, we modified split computation to invoke listing at the level of the parent directory.
This call returns all files (and their sizes) in all subdirectories in blocks of 1000.
Some subdirectories and files may not be of interest to job/query e.g. partition elimination may be eliminated
some of them.
We take advantage of the fact that file listing is in lexicographic order and perform a modified merge join of the
list of files and list of directories of interest.
This allows us to efficiently identify files sizes of interesting files. The modified algorithm results in only 106 API
calls (each call returns 1000 files) compared to 17520 API calls in the original implementation. We compared the
two approaches using a simple Hive test. In this test, we take a partitioned table T with 15,000 files but vary the
number of partitions (a partition corresponds to a directory). We compare the performance of ‘select count(*)
from T’. In the extreme case, this optimization shows a speedup of 8x!
18. Faster reads from S3
Faster reads from S3
Opening of files take a significant amount of time – at least 50 milliseconds per file.
This problem becomes pronounced when the input dataset has lots of small files and file open latency
forms a significant portion of overall execution time.
To alleviate this problem, we included an optimization wherein we open an S3 file in a background thread
a little while before it is actually required by the map task. This hides the file open latency.
One thing to be aware of is that if a S3 file is opened, but not read from for a while, S3 returns a
RequestTimeout and potentially penalizes the caller.
We tested this optimization with a simple hive test. Our dataset consisted of 80000 files, each of size
640KB. We noticed an improvement of 30% in a count(*) query as a result of this optimization.
19. Attaching to EBS Volumes
Attaching to EBS Volumes
Storage on EC2 Instances.
Instance Type Instance Storage (GB)
c1.xlarge 1680 [4 x 420]
c3.2xlarge 160 [2 x 80 SSD]
c3.4xlarge 320 [2 x 160 SSD]
c3.8xlarge 640 [2 x 320 SSD]
The amount of storage per instance might not be
sufficient for running Hadoop clusters with high
volumes of data.
20. Attaching to EBS Volumes
Attaching to EBS Volumes
• AWS offers raw block devices called EBS
volumes which can be attached to EC2
instances.
• It is possible to attach multiple EBS volumes
with a size up to 1 TB per volume. This can
easily compensate for the low instance
storage available on the new generation
instances. Also, use of EBS volumes for
storage purpose costs much less than adding
cheaper instances with more storage capacity.
21. Attaching to EBS Volumes
Attaching to EBS Volumes
• Configurable Reserved Volumes
• On Qubole platform, using new generation
instances, users have an option to use
reserved volumes if data requirement exceeds
the local storage available in the cluster.
• AWS EBS volumes come in various flavors
e.g. magnetic, SSD backed. Users can select
the size and type of EBS volumes based on
the data and performance requirements.
SSD
Reserved
Disk
Disk Access
HDFSMapReduce
22. Protection Against Bad Jobs
Protection Against Bad Jobs
A single hadoop cluster is usually shared across many users.
- common occurrence that a certain user may issue a bad job which may
degrade the performance of the entire cluster.
Running out of Disk:
- Single mapper issuing too much output.
- A map/reduce job may have a lot of mappers outputting too much map data
- Reducer tasks copying a lot of map output data during the shuffle phase
23. Protection Against Bad Jobs
Protection Against Bad Jobs
Qubole’s hadoop distribution provides protection of the clusters against such
jobs. Its clusters periodically monitors the jobs and kills any job that may be
affecting the entire cluster.
Kill job when …
- Total map output of a job is beyond a configurable value
- Any tasks produce more map output than a set disk percent
- A job produces a lot of logs (configurable value)
- Reducers read a lot of map data (configurable value)
24. Direct output commit to S3
Direct output commit to S3
Using default S3 code path involves writing to a temp directory and then moving
the temp directory into its final location.
Move on S3 is really a copy and then delete.
Instead, write into the target directory.
25. Direct output commit to S3 - Hive
Direct output commit to S3 - Hive
Changed the naming scheme for the files Hive creates.
• Instead of 00000 we use names like UUID_000000 where all files generated by a
single insert into use the same prefix.
• Guarantees that a new insert into will not stomp on data produced by an earlier query.
To support insert overwrite with dynamic partitions the tasks that write into the directory
must delete any existing files.
Before the insert overwrite begins we generate a UUID to use for this statement. All the
mappers/reducers when deleting from a directory will delete all files that don't begin with
this UUID.
26. Direct output commit to S3 - MapReduce
Direct output commit to S3 - MapReduce
By default the MR code also writes to a temp location and then moves to final. The moves
are done by listing the temp location and moving all the files there.
• To avoid this we track expected file counts for a couple of File formats
• Provide Direct committers which avoids the move completely
27. Spot Instances, Placement Policy, and Fallback to on-demand
Spot Instances
Spot Instances allow users to bid on unused Amazon EC2 capacity and run those instances for as
long as their bid exceeds the Spot Price. QDS enables you to realize cost savings of as much as
50% to 60% by supporting the Spot Instance pricing model in addition to the Reserved Instance
pricing.
Use Qubole Placement Policy:
When using spot instances for slaves, this ensures that at least one replica of each HDFS block is
placed on Stable instances. It is recommended to keep this enabled when using spot instances.
Fallback to on-demand:
When upscaling the cluster, sometimes we may not be able to procure Spot Instances because of
low availability or high price. This option specifies that autoscaling should then fall back to procuring
On-Demand instances. This will increase the cost of running the cluster, but ensures that the
processing completes relatively quickly. Enable this if command processing time is important to you.
30. 30
Features
S3 Caching: S3 caching utilizes resources more efficiently and
brings up clusters faster – up to 10x faster on some client instances.
Variable Spot Instance Pricing: QDS allows you
to vary the number of spots vs. on-demand nodes,
providing the benefits of spot pricing (up to 90%
less expensive) with the certainty of getting your
job done.
Searchable Queries and Log Files: QDS shows
you all of your jobs, allowing you to compare the
efficiency of queries and avoid having to re-create
queries from scratch.
Built in Job Tracker: QDS provides a job tracker
accessed directly through the UI, allowing you to
identify resources, nodes, and tasks.
Security: QDS has a tight security environment,
with the ability to encrypt data at rest on nodes.
32. Why Qubole?
32
“Qubole has enabled more users within Pinterest to get to the
data and has made the data platform lot more scalable and
stable”
Mohammad Shahangian - Lead, Data Science and Infrastructure
Moved to Qubole from Amazon EMR because
of stability and rapidly expanded big data usage by
giving access to data to users beyond developers.
Rapid expansion of big data beyond developers (240 users
out of 600 person company)
Use CasesUser and Query Growth
Rapid expansion in use cases ranging from ETL, search,
adhoc querying, product analytics etc.
Rock solid infrastructure sees 50% less failures as
compared to AWS Elastic Map/Reduce
Enterprise scale processing and data access
33. Why Qubole?
33
“We needed something that was reliable and easy to learn,
setup, use and put into production without the risk and high
expectations that comes with committing millions of dollars in
upfront investment. Qubole was that thing.”
Marc Rosen - Sr. Director, Data Analytics
Moved to Big data on the cloud (from internal Oracle
clusters) because getting to analysis was much
quicker than operating infrastructure themselves.
Used to answer client queries and power client
dashboards.
Use Cases# Commands Per Month
0
1250
2500
3750
5000
Number of queries
Segment audiences based on their behavior including
such topics as user pathway and multi-dimensional
recency analysis
Build customer profiles (both uni/multivariate) across
thousands of first party (i.e., client CRM files) and third
party (i.e., demographic) segments
Simplify attribution insights showing the effects of upper
funnel prospecting on lower funnel remarketing media
strategies