SlideShare a Scribd company logo
Hadoop Vectored IO:
your Data just got Faster!
Mukund Thakur
Software Engineer @Cloudera
mthakur@apache.org
Committer@Apache Hadoop
Hadoop Vectored IO: your data just got faster!
Not just faster, cheaper as well.
Everybody hates slow!
● Slow running jobs/queries.
● Slow websites.
● Slow computer.
● …..
Vectored IO api; your reading ORC and Parquet will get faster.
2
Agenda:
● Problem and solution.
● New vectored IO api spec.
● Implementation details.
● How to use?
● Benchmarks.
● Current state.
● Upcoming work.
3
seek() hurts
● HDDs: disk rotations generate latency costs
● SSDs: block-by-block reads still best
● Cloud storage like S3 or Azure: cost and latency of GET requests
against new locations
● Prefetching: connector code assumes sequential IO
● ORC/Parquet reads are random IO, but can read columns
independently, with locations inferred from file and stripe metadata
If the format libraries can pass their "read plan" to the client, it can optimize
data retrieval
4
GET requests on remote object store: costly and slow.
Iostats for a task while running tpcds Q53.
counters=(
(action_http_get_request=30) (stream_read_remote_stream_drain.failures=0)
(stream_read_close_operations=10) (stream_read_bytes=2339593)
(stream_read_bytes_backwards_on_seek=16068978) (stream_read_seek_bytes_discarded=16146)
(action_file_opened=10) (stream_read_opened=30) (stream_read_fully_operations=50)
(stream_read_seek_backward_operations=5) (stream_read_closed=30)
(stream_read_seek_operations=40) (stream_read_total_bytes=2685289)
(stream_read_bytes_discarded_in_close=329614) (stream_write_bytes=2751))
means=((action_file_opened.mean=(samples=10, sum=414, mean=41.4000))
(action_http_get_request.mean=(samples=30, sum=2187, mean=72.9000)));
5
ORC
● Postscript with offset of footer
● Footer includes schemas, stripe info
● Index data in stripes include column
ranges, encryption, compression
● Columns can be read with this data
● Predicate pushdown can dynamically
skip columns if index data shows it is not
needed.
6
Parquet
● 256+ MB Row Groups
● Fixed 8 byte postscript includes pointer
to real footer
● Engines read/process subset of columns
within row groups
7
Solution: Hadoop Vectored IO API
● A vectored read API which allows seek-heavy readers to specify multiple
ranges to read.
● Each of the ranges comes back with a CompletableFuture that completes
when the data for that range is read.
● Default implementation reads each range data in buffers.
● Object store optimizations.
● Inspired from readv()/writev() from linux os.
Works great everywhere, radical benefit in object stores
8
New method in PositionedReadable interface:
public interface PositionedReadable {
default void readVectored(
List<? extends FileRange> ranges,
IntFunction<ByteBuffer> allocate)
throws IOException {
VectoredReadUtils.readVectored(this, ranges, allocate);
}
}
9
Hints for range merging:
● `minSeekForVectorReads()`
The smallest reasonable seek. Two ranges won't be merged together if the
difference between end of first and start of next range is more than this value.
● `maxReadSizeForVectorReads()`
Maximum number of bytes which can be read in one go after merging the
ranges. Two ranges won't be merged if the combined data to be read is more
than this value. Essentially setting this to 0 will disable the merging of ranges.
10
Advantages of new Vectored Read API:
Asynchronous IO: Clients can perform other operations while waiting for
data for specific ranges.
Parallel Processing of results: Clients can process ranges
independently/Out of Order
Efficient: A single vectored read call will replace multiple reads.
Performant: Optimizations like range merging, parallel data fetching,
bytebuffer pooling and java direct buffer IO.
11
Implementations:
Default implementation:
Blocking readFully() calls for each range data in buffers one by one.
It is exactly same as the using the old read APIs.
Local/Checksum FileSystem:
Merge the nearby ranges and uses java Async channel api to read
combined data into large byte buffer and return sliced smaller buffer for
each user ranges.
12
S3A parallelized reads of coalesced ranges
13
Features in S3A:
● Supports both direct and heap byte buffers.
● Introduced a new buffer pool with weak referencing for efficient garbage
collection for direct buffers.
● Terminates already running vectored reads when input stream is closed
or unbuffer() is called.
● Multiple get requests issued into common thread pool.
● IOStatistics to capture Vectored IO metrics.
● If/when s3 GET supports multiple ranges in future, could be done in a
single request.
14
How to use?
public void testVectoredIOEndToEnd() throws Exception {
FileSystem fs = getFileSystem();
List<FileRange> fileRanges = new ArrayList<>();
fileRanges.add(FileRange.createFileRange(8 * 1024, 100));
fileRanges.add(FileRange.createFileRange(14 * 1024, 100));
fileRanges.add(FileRange.createFileRange(10 * 1024, 100));
fileRanges.add(FileRange.createFileRange(2 * 1024 - 101, 100));
fileRanges.add(FileRange.createFileRange(40 * 1024, 1024));
WeakReferencedElasticByteBufferPool pool = new WeakReferencedElasticByteBufferPool();
ExecutorService dataProcessor = Executors.newFixedThreadPool(5);
CountDownLatch countDown = new CountDownLatch(fileRanges.size());
15
try (FSDataInputStream in = fs.open(path(VECTORED_READ_FILE_NAME))) {
in.readVectored(fileRanges, value -> pool.getBuffer(true, value));
for (FileRange res : fileRanges)
dataProcessor.submit(() -> {
try {
readBufferValidateDataAndReturnToPool(pool, res, countDown);
} catch (Exception e) {
LOG.error("Error while processing result for {} ", res, e);
}});
// user can perform other computations while waiting for IO.
if (!countDown.await(100, TimeUnit.SECONDS))
throw new AssertionError("Error while processing vectored io results");
} finally {
pool.release();
HadoopExecutors.shutdown(dataProcessor, LOG, 100, TimeUnit.SECONDS);
}
}
16
private void readBufferValidateDataAndReturnToPool(
ByteBufferPool pool, FileRange res, CountDownLatch countDownLatch)
throws IOException, TimeoutException {
CompletableFuture<ByteBuffer> data = res.getData();
ByteBuffer buffer = FutureIO.awaitFuture(data,
VECTORED_READ_OPERATION_TEST_TIMEOUT_SECONDS,
TimeUnit.SECONDS);
// Read the data and perform custom operation. Here we are just
// validating it with original data.
assertDatasetEquals((int) res.getOffset(), "vecRead",
buffer, res.getLength(), DATASET);
// return buffer to pool.
pool.putBuffer(buffer);
countDownLatch.countDown();
}
17
Is it faster? Oh yes*.
● JMH benchmark for local fs is 7x faster.
Hive queries with orc data in S3.
● TPCH => 10-20% reduced execution time.
● TPCDS => 20-40% reduced execution time.
● Synthetic ETL benchmark => 50-70% reduced execution time.
* your results may vary. No improvements for CSV
18
JMH benchmarks
We initially implemented the new vectored IO API in RawLocal and
Checksum filesystem and wrote JMH benchmark to compare the
performance across
● Raw and Checksum file system.
● Direct and Heap buffers
● Vectored read vs traditional sync read api vs java async file channel
read.
Data is generated locally and stored on local disk.
19
JMH benchmarks result
Benchmark bufferKind fileSystem Mode # Score Error Units
VectoredReadBenchmark.asyncFileChanArray direct N/A avgt 20 1448.505 ± 140.251 us/op
VectoredReadBenchmark.asyncFileChanArray array N/A avgt 20 705.741 ± 10.574 us/op
VectoredReadBenchmark.asyncRead direct checksum avgt 20 1690.764 ± 306.537 us/op
VectoredReadBenchmark.asyncRead direct raw avgt 20 1562.085 ± 242.231 us/op
VectoredReadBenchmark.asyncRead array checksum avgt 20 888.708 ± 13.691 us/op
VectoredReadBenchmark.asyncRead array raw avgt 20 810.721 ± 13.284 us/op
VectoredReadBenchmark.syncRead (array) checksum avgt 20 12627.855 ± 439.369 us/op
VectoredReadBenchmark.syncRead (array) raw avgt 20 1783.479 ± 161.927 us/op
20
JMH benchmarks result observations.
● The current code ( traditional read api) is by far the slowest and using
the Java native async file channel is the fastest.
● Reading in raw fs is almost at par java native async file channel. This
was expected as the vectored read implementation of raw fs uses java
async channel to read data.
● For checksum fs vectored read is almost 7X faster than traditional sync
read.
21
Hive with ORC data in S3:
Modified ORC file format library to use the new API and ran hive TPCH,
TPCDS and synthetic ETL benchmarks.
All applications using this format will automatically get the speedups. No
changes are required explicitly.
Special mention to Owen O'Malley @linkedin who co-authored ORC and
designed the initial vectored IO API.
22
Hive TPCH Benchmark:
23
NOTE: Q11 has a small runtime overall ( just few secs) so can be neglected
24
But the improvements were not super great.
TPCH benchmarks are too short to leverage full vectored io improvements.
We created a synthetic benchmark to test vectored IO and mimic ETL based
scenarios.
In this benchmark we created wide string tables with large data reads.
Queries that will generate up to 33 different ranges on the same stripe thus
asking 33 ranges to be read in parallel.
25
26
Hive ETL benchmark result:
● 50-70 percent reduction in the mapper runtime with vectored IO.
● More performance gains in long running queries. This is because
vectored IO fills the data in each buffer parallely.
● Queries which reads more data shows more speedup.
27
Hive TPCDS Benchmark
28
NOTE: Q74 has a small runtime overall ( just few secs) so can be neglected for the sake of
comparison. Tez container release time dominated.
29
Current State:
● API, default, local and S3A filesystem implementation in upcoming
hadoop 3.3.5 release. HADOOP-18103
30
Upcoming work:
● ORC support in ORC-1251.
● Parquet support in PARQUET-2171. Running spark benchmarks.
● More hive benchmarks and figure out best default values for minSeek
and maxReadSize.
● Perform load tests to understand more about memory and cpu
consumption.
● ABFS and GCS implementation of Vectored IO.
31
Summary
● Vectored IO APIs can radically improve processing time of columnar data
stored in cloud.
● ORC and Parquet libraries are the sole places we need to deliver this
speedup
● API in forthcoming Hadoop release
32
Call to Action: Everyone
1. Upgrade to latest versions of Hadoop
2. And ORC, parquet libraries which use the new API
3. Enjoy the speedups!
33
Call to Action: Apache code contributors
● Make Hadoop 3.3.4 minimum release; ideally 3.3.5
● If you can't move to 3.3.5, help with
HADOOP-18287. Provide a shim library for modern FS APIs
● Use IOStatistics APIs to print/collect/aggregate IO Statistics
● Use in your application.
● Extend API implementations to other stores.
● Test everywhere!
34
Thanks to everyone involved.
● Owen
● Steve
● Rajesh
● Harshit
● Shwetha
● Mehakmeet
● And many other folks from community.
35
Q/A?
36
Backup Slides
37
You said "Cheaper"?
● Range coalescing: fewer GET requests
● No wasted GETs on prefetch or back-to-back GETs
● Speedup in execution reduces life/cost of of VM clusters
● Or even reduces cluster size requirements
38
Spark support?
Comes automatically with Vector IO releases of ORC/Parquet and hadoop
libraries
No changes to Spark needed at all!
tip: IOStatistics API gives stats on the operations; InputStream.toString() prints
there.
39
Apache Avro
sequential file reading/processing ==> no benefit from use of the API
AVRO-3594: FsInput to use openFile() API
fileSystem.openFile(path)
.opt("fs.option.openfile.read.policy", "adaptive")
.withFileStatus(status)
.build()
Eliminates HEAD on s3a/abfs open; read policy of "sequential unless we seek()".
(GCS support of openFile() enhancements straightforward...)
40
Apache Iceberg
● Iceberg manifest file IO independent of format of data processed
● If ORC/Parquet library uses Vector IO, query processing improvements
● As Iceberg improves query planning time (no directory scans), speedups as a
percentage of query time *may* improve.
● Avro speedups (AVRO-3594, ...) do improve manifest read time.
● Even better: PR#4518: Provide mechanism to cache manifest file content
41
Apache Impala
Impala already has a high performance reader for HDFS, S3 and ABFS:
● Multiple input streams reading stripes on same file in different thread
parallely.
● Uses unbuffer() to release all client side resources.
● Caches the existing unbuffered input streams for next workers.
42
HDFS Support
● Default PositionedReadable implementation works well as latency of
seek()/readFully() not as bad as for cloud storage.
● Parallel block reads could deliver speedups if Datanode IPC supported this.
● Ozone makes a better target
● No active work here -yet
43
Azure and GCS storage
Azure storage through ABFS
● Async 2MB prefetching works OK for parquet column reads
● Parallel range read would reduce latency
Google GCS through gs://
● gs in random/adaptive does prefetch and cache footer already)
● Vectored IO adds parallell column reads
● +openFile().withFileStatus/seek policy for Avro and Iceberg.
Help needed here!
44
CSV
Just stop it!
Especially schema inference in Spark, which duplicates entire read
val csvDF = spark.read.format("csv")
.option("inferSchema", "true") /* please read my data twice! */
.load("s3a://bucket/data/csvdata.csv")
Use Avro for exchange; ORC/Parquet for tables
45
HADOOP-18028. S3A input stream with prefetching
● PInterest code being incorporated into hadoop-trunk
● AWS S3 team leading this work
● Block prefetch; cache to memory or local FS
● Caches entire object into RAM if small enough (nice!)
● No changes to libraries for benefits
● Good for csv, avro, distcp, small files.
Stabilization needed for scale issues, especially in apps (Hive, Spark, Impala) with
many threads
Will not deliver same targeted performance improvements as explicit range
prefetching, but easier to adopt.
46
HADOOP-18287. Provide a shim library for FS APIs
● Library to allow access to the modern APIs for applications which still need to
work with older 3.2+ Hadoop releases.
● Invoke new APIs on recent releases
● Downgrade to older APIs on older Hadoop releases
Applications which use this shim library will work on older releases but get the
speedups on the new ones.
47
S3A configurations to tune:
<property>
<name>fs.s3a.vectored.read.min.seek.size</name>
<value>4K</value>
<description>
What is the smallest reasonable seek in bytes such
that we group ranges together during vectored
read operation.
</description>
</property>
48
<property>
<name>fs.s3a.vectored.read.max.merged.size</name>
<value>1M</value>
<description>
What is the largest merged read size in bytes such
that we group ranges together during vectored read.
Setting this value to 0 will disable merging of ranges.
</description>
</property>
49

More Related Content

What's hot

Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by example
Mauro Pagano
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Glen Hawkins
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
Eric Xiao
 
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19cMaximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Glen Hawkins
 
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdfOracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
SrirakshaSrinivasan2
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 

What's hot (20)

Oracle statistics by example
Oracle statistics by exampleOracle statistics by example
Oracle statistics by example
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19cMaximum Availability Architecture - Best Practices for Oracle Database 19c
Maximum Availability Architecture - Best Practices for Oracle Database 19c
 
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdfOracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 

Similar to [ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
Steve Loughran
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
Ryousei Takano
 
Ceph
CephCeph
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
t_ivanov
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
Alluxio, Inc.
 
1.multicore processors
1.multicore processors1.multicore processors
1.multicore processors
Hebeon1
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
shanker_uma
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
Mark Kromer
 
Presto
PrestoPresto
Presto
Knoldus Inc.
 
Couchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionCouchbase - Yet Another Introduction
Couchbase - Yet Another Introduction
Kelum Senanayake
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
Databricks
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data Software
Pooyan Jamshidi
 
The Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage EngineThe Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage Engine
fschupp
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
Bob Ward
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
Artem Ervits
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
Bob Ward
 

Similar to [ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf (20)

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
1.multicore processors
1.multicore processors1.multicore processors
1.multicore processors
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
Presto
PrestoPresto
Presto
 
Couchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionCouchbase - Yet Another Introduction
Couchbase - Yet Another Introduction
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data Software
 
The Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage EngineThe Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage Engine
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 

Recently uploaded

一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 

Recently uploaded (20)

一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 

[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdf

  • 1. Hadoop Vectored IO: your Data just got Faster! Mukund Thakur Software Engineer @Cloudera mthakur@apache.org Committer@Apache Hadoop
  • 2. Hadoop Vectored IO: your data just got faster! Not just faster, cheaper as well. Everybody hates slow! ● Slow running jobs/queries. ● Slow websites. ● Slow computer. ● ….. Vectored IO api; your reading ORC and Parquet will get faster. 2
  • 3. Agenda: ● Problem and solution. ● New vectored IO api spec. ● Implementation details. ● How to use? ● Benchmarks. ● Current state. ● Upcoming work. 3
  • 4. seek() hurts ● HDDs: disk rotations generate latency costs ● SSDs: block-by-block reads still best ● Cloud storage like S3 or Azure: cost and latency of GET requests against new locations ● Prefetching: connector code assumes sequential IO ● ORC/Parquet reads are random IO, but can read columns independently, with locations inferred from file and stripe metadata If the format libraries can pass their "read plan" to the client, it can optimize data retrieval 4
  • 5. GET requests on remote object store: costly and slow. Iostats for a task while running tpcds Q53. counters=( (action_http_get_request=30) (stream_read_remote_stream_drain.failures=0) (stream_read_close_operations=10) (stream_read_bytes=2339593) (stream_read_bytes_backwards_on_seek=16068978) (stream_read_seek_bytes_discarded=16146) (action_file_opened=10) (stream_read_opened=30) (stream_read_fully_operations=50) (stream_read_seek_backward_operations=5) (stream_read_closed=30) (stream_read_seek_operations=40) (stream_read_total_bytes=2685289) (stream_read_bytes_discarded_in_close=329614) (stream_write_bytes=2751)) means=((action_file_opened.mean=(samples=10, sum=414, mean=41.4000)) (action_http_get_request.mean=(samples=30, sum=2187, mean=72.9000))); 5
  • 6. ORC ● Postscript with offset of footer ● Footer includes schemas, stripe info ● Index data in stripes include column ranges, encryption, compression ● Columns can be read with this data ● Predicate pushdown can dynamically skip columns if index data shows it is not needed. 6
  • 7. Parquet ● 256+ MB Row Groups ● Fixed 8 byte postscript includes pointer to real footer ● Engines read/process subset of columns within row groups 7
  • 8. Solution: Hadoop Vectored IO API ● A vectored read API which allows seek-heavy readers to specify multiple ranges to read. ● Each of the ranges comes back with a CompletableFuture that completes when the data for that range is read. ● Default implementation reads each range data in buffers. ● Object store optimizations. ● Inspired from readv()/writev() from linux os. Works great everywhere, radical benefit in object stores 8
  • 9. New method in PositionedReadable interface: public interface PositionedReadable { default void readVectored( List<? extends FileRange> ranges, IntFunction<ByteBuffer> allocate) throws IOException { VectoredReadUtils.readVectored(this, ranges, allocate); } } 9
  • 10. Hints for range merging: ● `minSeekForVectorReads()` The smallest reasonable seek. Two ranges won't be merged together if the difference between end of first and start of next range is more than this value. ● `maxReadSizeForVectorReads()` Maximum number of bytes which can be read in one go after merging the ranges. Two ranges won't be merged if the combined data to be read is more than this value. Essentially setting this to 0 will disable the merging of ranges. 10
  • 11. Advantages of new Vectored Read API: Asynchronous IO: Clients can perform other operations while waiting for data for specific ranges. Parallel Processing of results: Clients can process ranges independently/Out of Order Efficient: A single vectored read call will replace multiple reads. Performant: Optimizations like range merging, parallel data fetching, bytebuffer pooling and java direct buffer IO. 11
  • 12. Implementations: Default implementation: Blocking readFully() calls for each range data in buffers one by one. It is exactly same as the using the old read APIs. Local/Checksum FileSystem: Merge the nearby ranges and uses java Async channel api to read combined data into large byte buffer and return sliced smaller buffer for each user ranges. 12
  • 13. S3A parallelized reads of coalesced ranges 13
  • 14. Features in S3A: ● Supports both direct and heap byte buffers. ● Introduced a new buffer pool with weak referencing for efficient garbage collection for direct buffers. ● Terminates already running vectored reads when input stream is closed or unbuffer() is called. ● Multiple get requests issued into common thread pool. ● IOStatistics to capture Vectored IO metrics. ● If/when s3 GET supports multiple ranges in future, could be done in a single request. 14
  • 15. How to use? public void testVectoredIOEndToEnd() throws Exception { FileSystem fs = getFileSystem(); List<FileRange> fileRanges = new ArrayList<>(); fileRanges.add(FileRange.createFileRange(8 * 1024, 100)); fileRanges.add(FileRange.createFileRange(14 * 1024, 100)); fileRanges.add(FileRange.createFileRange(10 * 1024, 100)); fileRanges.add(FileRange.createFileRange(2 * 1024 - 101, 100)); fileRanges.add(FileRange.createFileRange(40 * 1024, 1024)); WeakReferencedElasticByteBufferPool pool = new WeakReferencedElasticByteBufferPool(); ExecutorService dataProcessor = Executors.newFixedThreadPool(5); CountDownLatch countDown = new CountDownLatch(fileRanges.size()); 15
  • 16. try (FSDataInputStream in = fs.open(path(VECTORED_READ_FILE_NAME))) { in.readVectored(fileRanges, value -> pool.getBuffer(true, value)); for (FileRange res : fileRanges) dataProcessor.submit(() -> { try { readBufferValidateDataAndReturnToPool(pool, res, countDown); } catch (Exception e) { LOG.error("Error while processing result for {} ", res, e); }}); // user can perform other computations while waiting for IO. if (!countDown.await(100, TimeUnit.SECONDS)) throw new AssertionError("Error while processing vectored io results"); } finally { pool.release(); HadoopExecutors.shutdown(dataProcessor, LOG, 100, TimeUnit.SECONDS); } } 16
  • 17. private void readBufferValidateDataAndReturnToPool( ByteBufferPool pool, FileRange res, CountDownLatch countDownLatch) throws IOException, TimeoutException { CompletableFuture<ByteBuffer> data = res.getData(); ByteBuffer buffer = FutureIO.awaitFuture(data, VECTORED_READ_OPERATION_TEST_TIMEOUT_SECONDS, TimeUnit.SECONDS); // Read the data and perform custom operation. Here we are just // validating it with original data. assertDatasetEquals((int) res.getOffset(), "vecRead", buffer, res.getLength(), DATASET); // return buffer to pool. pool.putBuffer(buffer); countDownLatch.countDown(); } 17
  • 18. Is it faster? Oh yes*. ● JMH benchmark for local fs is 7x faster. Hive queries with orc data in S3. ● TPCH => 10-20% reduced execution time. ● TPCDS => 20-40% reduced execution time. ● Synthetic ETL benchmark => 50-70% reduced execution time. * your results may vary. No improvements for CSV 18
  • 19. JMH benchmarks We initially implemented the new vectored IO API in RawLocal and Checksum filesystem and wrote JMH benchmark to compare the performance across ● Raw and Checksum file system. ● Direct and Heap buffers ● Vectored read vs traditional sync read api vs java async file channel read. Data is generated locally and stored on local disk. 19
  • 20. JMH benchmarks result Benchmark bufferKind fileSystem Mode # Score Error Units VectoredReadBenchmark.asyncFileChanArray direct N/A avgt 20 1448.505 ± 140.251 us/op VectoredReadBenchmark.asyncFileChanArray array N/A avgt 20 705.741 ± 10.574 us/op VectoredReadBenchmark.asyncRead direct checksum avgt 20 1690.764 ± 306.537 us/op VectoredReadBenchmark.asyncRead direct raw avgt 20 1562.085 ± 242.231 us/op VectoredReadBenchmark.asyncRead array checksum avgt 20 888.708 ± 13.691 us/op VectoredReadBenchmark.asyncRead array raw avgt 20 810.721 ± 13.284 us/op VectoredReadBenchmark.syncRead (array) checksum avgt 20 12627.855 ± 439.369 us/op VectoredReadBenchmark.syncRead (array) raw avgt 20 1783.479 ± 161.927 us/op 20
  • 21. JMH benchmarks result observations. ● The current code ( traditional read api) is by far the slowest and using the Java native async file channel is the fastest. ● Reading in raw fs is almost at par java native async file channel. This was expected as the vectored read implementation of raw fs uses java async channel to read data. ● For checksum fs vectored read is almost 7X faster than traditional sync read. 21
  • 22. Hive with ORC data in S3: Modified ORC file format library to use the new API and ran hive TPCH, TPCDS and synthetic ETL benchmarks. All applications using this format will automatically get the speedups. No changes are required explicitly. Special mention to Owen O'Malley @linkedin who co-authored ORC and designed the initial vectored IO API. 22
  • 24. NOTE: Q11 has a small runtime overall ( just few secs) so can be neglected 24
  • 25. But the improvements were not super great. TPCH benchmarks are too short to leverage full vectored io improvements. We created a synthetic benchmark to test vectored IO and mimic ETL based scenarios. In this benchmark we created wide string tables with large data reads. Queries that will generate up to 33 different ranges on the same stripe thus asking 33 ranges to be read in parallel. 25
  • 26. 26
  • 27. Hive ETL benchmark result: ● 50-70 percent reduction in the mapper runtime with vectored IO. ● More performance gains in long running queries. This is because vectored IO fills the data in each buffer parallely. ● Queries which reads more data shows more speedup. 27
  • 29. NOTE: Q74 has a small runtime overall ( just few secs) so can be neglected for the sake of comparison. Tez container release time dominated. 29
  • 30. Current State: ● API, default, local and S3A filesystem implementation in upcoming hadoop 3.3.5 release. HADOOP-18103 30
  • 31. Upcoming work: ● ORC support in ORC-1251. ● Parquet support in PARQUET-2171. Running spark benchmarks. ● More hive benchmarks and figure out best default values for minSeek and maxReadSize. ● Perform load tests to understand more about memory and cpu consumption. ● ABFS and GCS implementation of Vectored IO. 31
  • 32. Summary ● Vectored IO APIs can radically improve processing time of columnar data stored in cloud. ● ORC and Parquet libraries are the sole places we need to deliver this speedup ● API in forthcoming Hadoop release 32
  • 33. Call to Action: Everyone 1. Upgrade to latest versions of Hadoop 2. And ORC, parquet libraries which use the new API 3. Enjoy the speedups! 33
  • 34. Call to Action: Apache code contributors ● Make Hadoop 3.3.4 minimum release; ideally 3.3.5 ● If you can't move to 3.3.5, help with HADOOP-18287. Provide a shim library for modern FS APIs ● Use IOStatistics APIs to print/collect/aggregate IO Statistics ● Use in your application. ● Extend API implementations to other stores. ● Test everywhere! 34
  • 35. Thanks to everyone involved. ● Owen ● Steve ● Rajesh ● Harshit ● Shwetha ● Mehakmeet ● And many other folks from community. 35
  • 38. You said "Cheaper"? ● Range coalescing: fewer GET requests ● No wasted GETs on prefetch or back-to-back GETs ● Speedup in execution reduces life/cost of of VM clusters ● Or even reduces cluster size requirements 38
  • 39. Spark support? Comes automatically with Vector IO releases of ORC/Parquet and hadoop libraries No changes to Spark needed at all! tip: IOStatistics API gives stats on the operations; InputStream.toString() prints there. 39
  • 40. Apache Avro sequential file reading/processing ==> no benefit from use of the API AVRO-3594: FsInput to use openFile() API fileSystem.openFile(path) .opt("fs.option.openfile.read.policy", "adaptive") .withFileStatus(status) .build() Eliminates HEAD on s3a/abfs open; read policy of "sequential unless we seek()". (GCS support of openFile() enhancements straightforward...) 40
  • 41. Apache Iceberg ● Iceberg manifest file IO independent of format of data processed ● If ORC/Parquet library uses Vector IO, query processing improvements ● As Iceberg improves query planning time (no directory scans), speedups as a percentage of query time *may* improve. ● Avro speedups (AVRO-3594, ...) do improve manifest read time. ● Even better: PR#4518: Provide mechanism to cache manifest file content 41
  • 42. Apache Impala Impala already has a high performance reader for HDFS, S3 and ABFS: ● Multiple input streams reading stripes on same file in different thread parallely. ● Uses unbuffer() to release all client side resources. ● Caches the existing unbuffered input streams for next workers. 42
  • 43. HDFS Support ● Default PositionedReadable implementation works well as latency of seek()/readFully() not as bad as for cloud storage. ● Parallel block reads could deliver speedups if Datanode IPC supported this. ● Ozone makes a better target ● No active work here -yet 43
  • 44. Azure and GCS storage Azure storage through ABFS ● Async 2MB prefetching works OK for parquet column reads ● Parallel range read would reduce latency Google GCS through gs:// ● gs in random/adaptive does prefetch and cache footer already) ● Vectored IO adds parallell column reads ● +openFile().withFileStatus/seek policy for Avro and Iceberg. Help needed here! 44
  • 45. CSV Just stop it! Especially schema inference in Spark, which duplicates entire read val csvDF = spark.read.format("csv") .option("inferSchema", "true") /* please read my data twice! */ .load("s3a://bucket/data/csvdata.csv") Use Avro for exchange; ORC/Parquet for tables 45
  • 46. HADOOP-18028. S3A input stream with prefetching ● PInterest code being incorporated into hadoop-trunk ● AWS S3 team leading this work ● Block prefetch; cache to memory or local FS ● Caches entire object into RAM if small enough (nice!) ● No changes to libraries for benefits ● Good for csv, avro, distcp, small files. Stabilization needed for scale issues, especially in apps (Hive, Spark, Impala) with many threads Will not deliver same targeted performance improvements as explicit range prefetching, but easier to adopt. 46
  • 47. HADOOP-18287. Provide a shim library for FS APIs ● Library to allow access to the modern APIs for applications which still need to work with older 3.2+ Hadoop releases. ● Invoke new APIs on recent releases ● Downgrade to older APIs on older Hadoop releases Applications which use this shim library will work on older releases but get the speedups on the new ones. 47
  • 48. S3A configurations to tune: <property> <name>fs.s3a.vectored.read.min.seek.size</name> <value>4K</value> <description> What is the smallest reasonable seek in bytes such that we group ranges together during vectored read operation. </description> </property> 48
  • 49. <property> <name>fs.s3a.vectored.read.max.merged.size</name> <value>1M</value> <description> What is the largest merged read size in bytes such that we group ranges together during vectored read. Setting this value to 0 will disable merging of ranges. </description> </property> 49