Strata Data Conference, London, May 2017
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark specifically designed to optimize their performance with object stores. Trent and Gil describe how Stocator works and share real-life examples and benchmarks that demonstrate how it can greatly improve performance and reduce the quantity of resources used.
2. Speakers..
§ Gil Vernik
- IBM Research from 2010
- Architect, 25+ years of development experience
- Active in open source
- Recent interest: Big Data engines and object stores
§ Trent
- IBM Distinguished Engineer
- Architect on Watson Data Platform
- Historically worked on the IBM Java VM
Twitter: @vernikgil
3. Agenda
§ Storage for unstructured data
§ Introduction to object storage – why are they needed and what are they?
§ HDFS and object stores – differences
§ Real world usage
- How Hadoop accesses object stores
- Understanding the issues
- An alternative approach
§ SETI usage
4. Storage for unstructured data
§ HDFS, distributed file system, or similar
§ Object storage
- On premise, cloud based, hybrid, etc.
- IBM Cloud Object Storage
- Amazon S3
- OpenStack Swift, Azure Blob Storage, etc.
§ Non-SQL data bases / key-value stores
Ingest raw data
Read data with
schema
Unstructured
data storage
5. HDFS - Summary
§ Hadoop Distributed File System (distributed, and Hadoop-native).
§ Stores large amounts of unstructured data in arbitrary formats.
§ Default internal block size is large - usually 64MB.
§ Blocks are replicated.
§ Write once – read many (append allowed)
§ (Often) collocated with compute capacity.
§ Need an HDFS client to work with HDFS.
§ Hadoop FS shell is widely used with HDFS.
6. What is an object store?
§ Object store is a perfect solution to store files (we call them data objects)
§ Each data object contains rich metadata and data itself.
§ Capable of storing huge amounts of unstructured data.
§ On premise, cloud based, hybrid, etc.
Object storage
7. Good things about object stores
§ Resilient store: data is will not be lost.
§ Fault tolerant : object store designed to operate during failures.
§ Various security models – data is safe.
§ Can be easily accessed for write or read flows.
§ (effectively) infinitely scalable – EB and beyond.
§ Low cost, long term storage solution.
8. Organize data in the object store
§ Data objects are organized inside buckets (s3) or containers (Swift).
§ Each data object may contain a name with delimiters, usually “/”.
§ Conceptual grouping via delimiters allows hierarchical organization, an analogy to the directories in
file systems but without the overhead or scalability limits of lots of directories.
mytalks/year=2016/month=5/day=24/data-palooza.pdf
mytalks/year=2017/month=5/day=24/hadoop-strata.pdf
mytalks/year=2017/month=6/day=07/spark-summit.pdf
bucket
data object
9. Object storage is not a file system
§ Write once – no append in place
§ Usually eventual consistent
§ Accessed via RESTful API, SDKs available for many languages.
§ Each data object has a unique URI.
§ Rename in object store is not atomic operation (unlike on file systems).
- Rename = GET and PUT/COPY and DELETE.
§ Object creation is atomic.
- Writing a file is not.
§ Examples
- Store raw data for archive, raw IoT sensor data.
- Export old data from database and store it as objects.
Export “old” data
10. HDFS HDFS HDFS
Worker 1 Worker 2 Worker 3Worker 1 Worker 2 Worker 3
Object
storage
The usual dilemma
No data locality Data locality
• Impossible to scale storage without scaling compute.
• Difficult to share HDFS data more globally
• Separated from compute nodes thus storage can
be scaled independently from compute
• Data is easily shared and can be accessed from
different locations.
11. HDFS HDFS HDFS
Worker 1 Worker 2 Worker 3Worker 1 Worker 2 Worker 3
Object
storage
The usual dilemma
No data locality Data locality
• Impossible to scale storage without scaling compute.
• Difficult to share HDFS data more globally
• Separated from compute nodes thus storage can
be scaled independently from compute
• Data is easily shared and can be accessed from
different locations.
Lower cost
More versatile
Fast enough
Higher cost
Less versatile
Potentially Faster
15. Hadoop ecosystem
§ Hadoop FileSystem interface is popular to interact with underlying storage
§ Hadoop shipped with various storage connectors that implement FileSystem interface
§ Many Big Data engines utilize Hadoop storage connectors
Object Storage ( S3 API, Swift API, Azure API)HDFS
16. Apache Spark
§ Apache Spark is a fast and general engine for large-scale data processing
§ Written in Scala, Python, Java, R
§ Very active Big Data project
§ Apache Spark combines Spark SQL, streaming, machine learning, graph processing and complex
analytics (MapReduce plus) in a single engine and is able to optimize programs across all of these
paradigms
§ Spark can handle multiple object stores as a data source
§ Spark depends on Hadoop connectors to interact with objects
17. Example: persist collection as an object
val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
val myData = sc.parallelize(data, 9)
myData.saveAsTextFile(”s3a://mybucket/data.txt")
API GET HEAD PUT DELETE
Hadoop (s3a) S3
18. Example: persist collection as an object
val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
val myData = sc.parallelize(data, 9)
myData.saveAsTextFile(”s3a://mybucket/data.txt")
API GET HEAD PUT DELETE
Hadoop (s3a) S3 158 361 26 16
19. The deep dive into the numbers
§ What is wrong?
- We observed that some of the Hadoop components are highly inefficient to
work with object stores
- Two major reasons
- The existing algorithms used by Hadoop for persisting distributed data sets
are not optimized for object stores.
- Cost of supporting FS shell operations and treating object store as a file
system. This has negative effect on the Hadoop connectors.
We can make it much better
We did it better
It doesn’t have to be like this
20. Fault tolerance algorithms in the write flows
§ Output committers are code components in the Hadoop that responsible to persist data sets
generated by MapReduce jobs. Output committers designed to be fault tolerant.
..result/_temporary/0/_temporary/attempt_201702221313_0000_m_000000_0/part-0000
..result/_temporary/0/task_201702221313_0000_m_000000/part-00000
..result/part-00001
input data set wordcount
Persist result as an object “result”
21. Output committers and object stores
§ Output committers uses temp files and folders for every write operation and then renames them.
§ Algorithms used by output committers uses temporary files to achieve fault tolerance of the write
flows. Hadoop has FileOutputComitter version 1 and 2
§ File systems supports atomic rename, which perfectly fits into this paradigm.
§ Object stores do not support rename natively; use copy and delete instead.
..result/_temporary/0/_temporary/attempt_201702221313_0000_m_000000_0/part-0000
..result/_temporary/0/task_201702221313_0000_m_000000/part-00000
..result/part-00001
This leads to dozens of expensive requests targeted to the object
store
22. Hadoop FS shell operations and Hadoop connectors
§ All the Hadoop connectors to be 100% compliant with the Hadoop ecosystem must support FS shell
operations.
§ FS shell operations are frequently used with HDFS
§ FS shell operations are not object store friendly
- not native object store operations : operations on files/directories such as : copy, rename, etc.
- not optimized object store operations: upload object will first create temp object, then rename it to
the final name
§ Object store vendors provide CLI tools that are preferable over Hadoop FS shell commands.
./bin/hadoop fs –mkdirs hdfs://myhdfs/a/b/c/
./bin/hadoop fs –put mydata.txt hdfs://myhdfs/a/b/c/data.txt
23. Hadoop FS shell operations and analytic flows
§ The code to enable FS shell indirectly hurts entire analytic flows in the Hadoop connectors by
performing operations that are not inherent to the analytic flows
- Recursive directories create (empty object), check if directory exists, etc.
- Supporting move, rename, recursive listing of directories, etc.
§ Analytic flows such as Spark or Map Reduce do not directly need these FS shell operations
24. Hadoop FS shell operations and analytic flows
§
-
-
§
§ What does analytic flows need?
- Object listing
- Create new objects (object name may contain “/” to indicate pseudo-directory)
- Read objects
- Get data partitions (data partition is the unit of data parallelism for each MapReduce task)
- Delete
Analytic flows need only a small subset of the functionality
25. Why does supporting FS shell affect analytic flows?
/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000000_0/part-0000
/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1/part-0001
……
/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000008_8/part-0008
4
5
3
2
1
0
7
8
6
Persist distributed data
set as an object
26. Why does supporting FS shell affect analytic flows?
Operation File
1 Spark Driver: make
directories recursively
..data.txt/_temporary/0
2 Spark Executor: make
directories recursively
..data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1
3 (SE): write task
temporary object
..data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1/part-00001
4 (SE): list directory ..data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1
5 (SE): rename task
temporary object to job
temporary object
..data.txt/_temporary/0/task_201702221313_0000_m_000001/part-00001
6 (SD): list job temporary
directories recursively
..data.txt/_temporary/0/task_201702221313_0000_m_000001
7 (SD): rename job
temporary object to
final name
..data.txt/part-00001
8 (SD): write
SUCCESS object
..data.txt/_SUCCESS
27. Certain Hadoop components designed to
work with file systems and not object stores
An opinionated object store connector for
Spark can provide significant gains
.FileSystem
28. Stocator – the next-gen object store connector
§ Advanced connector designed for object stores. Doesn’t create temp files and folders for write
operations and still provides fault tolerance coverage, including speculative mode.
§ Doesn’t use Hadoop modules and interacts with object store directly. This makes Stocator superior
faster for write flow and generate many less REST calls
§ Supports analytic flows and not shell commands
§ Implements Hadoop FileSystem interface.
§ No need to modify Spark or Hadoop
§ Stocator doesn’t need local HDFS
Stocator adapted for analytic flows
https://github.com/SparkTC/stocator
Released under Apache License 2.0
30. Where to find Stocator
§ IBM Cloud Object Storage
§ Based on the open source stocator
§ Bluemix Spark as a Service
§ IBM Data Science Experience
§ Open source - https://github.com/SparkTC/stocator
- Stocator-core module, stocator-openstack-swift connector
- Apache License 2.0
31. Example: persist collection as object
val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
val myData = sc.parallelize(data, 9)
myData.saveAsTextFile(”s3d://mybucket/data.txt")
API GET HEAD PUT DELETE
Stocator S3 1 2 11 0
Hadoop (s3a) S3 158 361 26 16
32. Compare performance of Stocator
0
100
200
300
400
500
600
700
800
Teragen Copy Terasort Wordcount Read (50GB) Read (500GB) TPC-DS
Seconds
Stocator Hadoop Swift S3a
18x 10x 9x 2x 1x1x 1x**
** Comparing Stocator to S3a
* 40Gbps in accesser tier
§ Stocator is much faster
for write-intensive
workloads
§ Stocator as good for
read-intensive
workloads
33. S3a connector is improving*
1.5x 1.3x 1.3x 1.1x 1x1x 1x**
** Comparing Stocator to S3a with CV2 and FU
§ File Output Committer
Algorithm 2 halves
number of renames
(CV2)
§ Fast Upload introduces
streaming on output
(FU)
§ Stocator still faster for
write-intensive
workloads and as good
for read-intensive 0
100
200
300
400
500
600
700
800
Teragen Copy Terasort Wordcount Read (50 GB) Read (500 GB) TPC-DS
seconds
Stocator S3a S3a CV2 S3a CV2+FU
34. Compare number of REST operations*
21x 16x 15x 16x 2x2x 2x**
** Comparing Stocator to S3a with CV2 and FU
§ Stocator does
many less REST
operations
§ Less operations
means
• Lower overhead
• Lower cost
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Teragen Copy Terasort Wordcount Read (50GB) Read (500GB) TPC-DS
RESTful operations
Stocator Hadoop Swift S3a* 40Gbps in accesser tier
35. IBM Spark@SETI
§ Headquartered in Mountain View, CA. Founded 1984. 150 Scientists, researchers and staff.
§ The mission of the SETI Institute is to explore the potential for extra-terrestrial life….
§ Allen Telescope Array (ATA)
42 Receiving Dishes
Each 6m diameter
1GHz to 10GHz
The Allen Telescope Array
36. The Spark@SETI Project – By the Numbers
§ 200 million signal events
§ 14 million complex amplitude files in Object Store
- Signal of interest
- Each binary file contains 90 second ‘snapshot’ of raw antennae voltages
- 14M files = 1TB of raw signal data
feature extraction for clustering ~12 hours
§ Long duration observations = 2 beams @ 2.5TB each
- Wide-band analysis…. 5TB processed for wideband detection in approximately 13.5 hours
wall time.
Visit our joint talk with SETI at Spark Summit San Francisco, Wednesday, June 7 5:40 PM – 6:10 PM
“Very large data files, object stores, and deep learning –
lessons learned while looking for sights of extra-terrestrial life”
37. Lessons learned
Object storage provides a good alternative for HDFS
Existing Hadoop ecosystem doesn’t work efficient with object stores
Nothing fundamental wrong with object stores and the inefficiency is due to software
components that are not adapted for object stores
We demonstrated Stocator – an object store connector
Gil Vernik (gilv@il.ibm.com), Trent Gray-Donald (trent@ca.ibm.com)