3. Where Big
Data falls
short:
• 6-18 month implementation time
• Only 27% of Big Data initiatives are
classified as “Successful” in 2014
Rigid and
inflexible
infrastructure
Non adaptive
software
services
Highly
specialized
systems
Difficult to
build and
operate
• Only 13% of organizations achieve full-scale production
• 57% of organizations cite skills gap as a major inhibitor
3
4. 1. Flexible Infrastructure
2. Pay only for what you actually use
3. Shared Storage
4. Heterogenous Clusters
4
Why Cloud?
6. 1. Properties:
a. Ephemeral
b. Volatile(Spot for AWS, Preemptible for GCP)
2. Challenges:
a. Scale as per workload
b. Separation of compute and storage
c. Job histories, log files, results all need to be persisted.
d. Adapting YARN/HDFS to take into account ephemeral cloud nodes.
6
Cloud Compute
7. Up-scaling for MR jobs
Resource
Manager
Node 1
Node 2
User
Submit Job
Launches
MR AM
NodeManager
MR AppMaster
Container
Request
Allocate
Resources
NodeManager
C1 C2
Task
Progress
Up Scale
Request
Cluster
Manager
Add Node
NodeManager
C3 C4
Node 3
9. Node 2
Down-scaling
Resource
Manager
NodeManager
C1 C2
C3 C4
NodeManager
C1 C2
C3 C4
NodeManager
C1 C2
C4C3
Status
Update
Evaluates cluster is
being underutilized and
can be down scaled
Selects node whose
estimated task
completion time is
lowest
Graceful
Shutdown
User
Submits
Job
Allocates
container
Job1
Completes
Cluster
Manager
Remove
Node
Job 1
Job 2
Job 3
Decommission
Node
Node 1
Node 3
C3
C1
C1
C3
10. 1. Upscaling
a. Engine specific algorithms
b. Cannot just look at expected time(parallelism matters)
2. Downscaling
a. Decommissioning takes time
b. Need to consider hour boundaries
c. Stuck on mapper output
10
Why is it hard?
12. Job History – Terminated Cluster
Qubole
UI
User
Cluster
Proxy
Job
History
Server
Clicks
UI link
Authenticates
the request
Finds cluster
is down
Fetches jhist
file from cloud
Jhist file
Rendered
JobHist
Proxifies Link
13. 1. Volatile Nodes
a. Lower priced nodes bought in an auction (Spot Nodes in AWS, Preemptible in
GCE)
2. Hybrid Clusters
a. Mix of stable and volatile nodes to improve stability
3. Heterogenous Clusters
a. Preferred machine types may not be available
b. Preferred machine types may be more expensive than larger machines
4. Autoscaling Optimizations
a. Packing of tasks
b. Upload intermediate data to cloud storage
c. Recommission nodes
13
Advanced Optimizations
14. 1. Cloud Compute(Cluster) management
a. Challenges
b. Scaling
c. Advanced Optimizations
2. Cloud Storage
a. Challenges
b. Solutions and Optimizations
14
Agenda
15. 1. Properties:
a. Simple key value store
b. Inexpensive.
c. Accessed via REST APIs/SDK
d. Is the source of truth.
2. Challenges:
a. Connection establishment is expensive
b. Copying/Moving is expensive... no rename
3. Some positives:
a. Prefix listing.
b. PUTs are atomic: File is created when file is uploaded, unlike HDFS where it is
created on first write.
c. Multipart
15
Cloud Storage
16. • Naive
• Smart
• Up to 1000x improvement
16
Prefix Listing
for path in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:
result << listObject(path)
pathList = listPrefix(‘/x’)
while (entry = pathList.next()):
if entry in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:
result << entry
Storage Optimizations
C
17. 1. Split Computation : Divide input files into tasks for Map-Reduce/Spark/Presto
2. Recovering Partitions
3. List Paths matching regex pattern (‘/x/y/z/*/*’)
4. and many more ..
17
Prefix Listing - Use Cases
Storage Optimizations
C
18. • Normally:
– Write data to temporary location - atomically rename to final location
• With S3:
– Write data to final location
– Atomic PUTs deal with speculation/retries
• By default in Hive, DirectFileOutputCommitter in MR/Spark
• Tricky: retries/speculation must use same path
18
Direct Writes
Storage Optimizations
C
19. • Object caches(per bucket): High gain for roles based accounts
• Connection pools
• Read ahead optimizations
• Streaming upload
19
S3 Optimizations
20. • RubiX: Block level file cache
• Metadata caching for ORC and Parquet
20
Cache! Cache! Cache!
Storage Optimizations
C
21. • Cache blocks on local disks
• Open Source
• Engine agnostic
• Works well with auto-scaling
• Consistent Hashing to assign files or blocks to nodes.
21
RubiX
Storage Optimizations
C
25. • Cache on a Redis server running on master
• Effective and efficient split computation with PPD
• ORC and Parquet
• Engine agnostic
25
Metadata Caching