SlideShare a Scribd company logo
1 of 39
Steve Loughran
stevel@cloudera.com
github: steveloughran
Hadoop Vectored IO:
your data just got faster!
Steve Loughran
 Long-standing open source developer
 Hadoop Committer
 Cloud Connectors (s3a, abfs…)
 PB of data goes through my code every
day
 I get to care when it doesn't or just
takes too long!
 Mountain Biking (+climbing)
Problem: reading large data from cloud storage is slow
● More time waiting for queries.
● Longer cluster lifespans or more VMs
● Bigger bills from AWS, Microsoft, Google.
● Worst case: ETL jobs take so long they can't keep up
Vectored IO transforms ORC and Parquet read times in cloud
3
seek() is a concept from Hard Disks: its time is done
4
● HDDs: seeking was a physical operation
● SSDs: block-by-block reads; you can read > 1 block in parallel.
● Cloud storage: S3, Azure, GCS: latency of GET requests
● Prefetching? Great for sequential IO
● ORC/Parquet reads are considered "random IO", but there is nothing
random about it
If the format readers can pass their "read plan" to the
filesystem client, it can optimise data retrieval
5
What do you mean
"random" IO?
ORC
1. Fixed length postscript bytes contain
offset of footer
2. Footer includes schemas, stripe info
3. Index data in stripes include column
ranges, encryption, compression
4. Column stripes of hundreds of MB
5. Columns can be read with this data
6. Predicate pushdown can dynamically
skip columns if index data shows it is not
needed.
6
Parquet
● Thrift-based columnar format
● 256+ MB Row Groups
● Fixed 8 byte postscript includes
pointer to real footer
● Footer has schema + metadata of
every column in every row group
● Query engines process subset of
columns within row groups
7
Columnar formats need Hadoop Vectored IO
● Inspired from readv()/writev() from Linux.
● Scatter/Gather API where callers request 1+ file range.
● Each range mapped to a CompletableFuture
● Default implementation: sort ranges + sequential
readFully(offset, range) into buffers
● Object store optimizations.
Works great everywhere; radical benefit in object
stores 8
New method in PositionedReadable interface:
public interface PositionedReadable {
readVectored(
List<? extends FileRange> ranges,
IntFunction<ByteBuffer> allocator)
throws IOException
}
9
default implementation: available everywhere
Asynchronous: Clients can perform other operations while waiting for
data for specific ranges.
Parallel Processing of results: Clients can process ranges
independently/Out of Order
Efficient: A single vectored read call will replace multiple reads.
Performant: Optimizations like range merging, parallel data fetching,
bytebuffer pooling and java direct buffer IO.
10
Advantages
Implementations
Default implementation:
Blocking readFully() calls for each range.
Same performance as the classic API.
Local/Checksum FileSystem:
Java NIO Async Channel API
Reads directly into on-heap or off-heap ByteBuffers
Parallel read with zero-kernel-copy, DMA, memory-mapped SSD...
11
S3A: parallelized GETs
12
S3A parallel GET requests
● Multiple GET requests issued into common thread pool.
● Set in fs.s3a.vectored.active.ranged.reads; default = 4
● Results returned as soon as their data is received
● If/when S3 supports multiple ranges in GET, could be done in a single
request.
13
How to use?
Path p = path("s3a://bucket/table1/file.orc");
FileSystem fs = FileSystem(p.getUri(), conf);
List<FileRange> ranges = new ArrayList<>();
ranges.add(createFileRange(8 * 1024, 100));
ranges.add(createFileRange(14 * 1024, 100));
ranges.add(createFileRange(10 * 1024, 100));
ranges.add(createFileRange(2 * 1024 - 101, 100));
ranges.add(createFileRange(40 * 1024, 1024));
14
// open file
try (FSDataInputStream in = fs.open(p)) {
// initiate vectored read
in.readVectored(ranges, size -> pool.getBuffer(false, size));
// process results in worker pool
for (FileRange range : ranges)
dataProcessor.submit(() ->
range.getData().thenAccept(buffer -> {
/* process */
pool.putBuffer(buffer);
countDownLatch.countDown();
15
Is it faster? Oh yes*.
Java Microbench benchmark for local storage is 7x faster.
Hive queries with ORC data in S3.
● TPCH ⇒ 10-20% reduced execution time.
● TPCDS ⇒ 20-40% reduced execution time.
● Synthetic ETL benchmark ⇒ 50-70% reduced execution time.
* your results may vary. No improvements for CSV or Avro
16
Java Microbench Harness
Native IO implementation of vectored IO API in RawLocal and Checksum
filesystem +
JMH benchmark to compare the performance across
● FS: [Raw, Checksum file system].
● Buffers: [Direct, Heap]
● read: [seek() + read(), Vectored, java async file channel read]
see: org.apache.hadoop.benchmark.VectoredReadBenchmark
17
JMH benchmarks result
Benchmark buffer fileSystem Score Error
syncRead (array) checksum 12627.855 ± 439.369
syncRead (array) raw 1783.479 ± 161.927
asyncFileChanArray direct N/A 1448.505 ± 140.251
asyncFileChanArray array N/A 705.741 ± 10.574
asyncRead direct checksum 1690.764 ± 306.537
asyncRead direct raw 1562.085 ± 242.231
asyncRead array checksum 888.708 ± 13.691
asyncRead array raw 810.721 ± 13.284 18
=7x faster
Hive with ORC data in S3:
ORC library updated to use to the new API when reading stripes
All applications using this format automatically get the speedups.
Special mention to Owen O'Malley
20
21
Hive TPCH Benchmark:
10-20% decrease in query runtimes
NOTE: Q11 has a small runtime overall ( just few secs) so can be neglected
22
Hive TPCDS Benchmark @ 300GB
23
NOTE: Q74 has a small runtime overall ( just few secs) so can be neglected for the sake of
comparison. Tez container release time dominated.
24
Synthetic benchmark of ETL & vectored IO
TPCH benchmarks are too short to show full vectored IO improvements.
Synthetic benchmark to test vectored IO and mimic ETL based scenarios.
Wide string tables with large data reads.
Queries generate up to 33 different ranges on the same stripe
25
26
Synthetic benchmark of ETL & vectored IO
Hive ETL benchmark result:
27
● 50-70 percent reduction in the mapper runtime with vectored IO.
● More performance gains in long running queries. This is because
vectored IO fills the data in each buffer in parallel.
● Queries which reads more data shows more speedup.
Shipping in Hadoop 3.3.5!
● Available in Hadoop 3.3.5
● file:// has local filesystem NIO implementation
● S3A does parallel GETs per input stream
● Fallback implementation for everything else
28
Azure and GCS storage: help needed
Azure storage through ABFS
● Async 2MB prefetching doesn't do much for columnar formats
● Parallel range read would reduce latency
Google GCS through gs://
● random/adaptive policies do prefetch and cache footer already
● Vectored IO adds parallel column reads
● +openFile() supports status and seek policy for Avro and Iceberg.
29
30
Problem: the format libraries are stuck
Update the format libraries & the apps get the speedup automatically
● ORC-1251. Support VectoredIO in ORC
● PARQUET-2171. Implement vectored IO in parquet file format
● +Iceberg bundles its own shaded Parquet lib
How to get this into libraries that still want to support Hadoop 2.x?
1. Cloudera: we can ship our own consistent set of libraries.
2. WiP: https://github.com/apache/hadoop-api-shim
3. For now: apply our patches yourself
Conclusions
● Vectored IO APIs radically improve processing time of columnar data stored in
cloud and local SSD.
● Available in Hadoop 3.3.5
● Azure and GCS connectors next targets
● We need to get support into ORC and Parquet libraries
● We need to work together to get support into the ASF releases
31
Questions?
32
Backup Slides
33
Apache Iceberg
● Iceberg manifest file IO independent of format of data processed
● If ORC/Parquet library uses Vector IO, query processing improvements
● As Iceberg improves query planning time (no directory scans), speedups as a
percentage of query time *may* improve.
● Avro speedups (AVRO-3594, ...) do improve manifest read time.
● Even better: PR#4518: Provide mechanism to cache manifest file content
34
Apache Avro
sequential file reading/processing ==> no benefit from use of the API
AVRO-3594: FsInput to use openFile() API
fileSystem.openFile(path)
.opt("fs.option.openfile.read.policy", "adaptive")
.withFileStatus(status)
.build()
Eliminates HEAD on s3a/abfs open; read policy of "sequential unless we seek()".
(GCS support of openFile() enhancements straightforward...)
35
Spark support?
Comes automatically with Vector IO releases of ORC/Parquet and Iceberg
dependencies
Spark master branch on Hadoop 3.3.5, so ready to play
36
Apache Impala
Impala already has a high performance reader for HDFS, S3 and ABFS:
● Multiple input streams reading stripes on same file in different thread.
● Uses unbuffer() to release all client side resources.
● Caches the existing unbuffered input streams for next workers.
37
HDFS Support
● Default PositionedReadable implementation works well as latency of
seek()/readFully() not as bad as for cloud storage.
● Parallel block reads could deliver speedups if Datanode IPC supported this.
● Ozone makes a better target
● No active work here -yet
38
CSV
Just stop it!
Especially schema inference in Spark, which duplicates entire read
val csvDF = spark.read.format("csv")
.option("inferSchema", "true") /* please read my data twice! */
.load("s3a://bucket/data/csvdata.csv")
Use Avro for exchange; ORC/Parquet for tables
39
HADOOP-18287. Provide a shim library for FS APIs
● Library to allow access to the modern APIs for applications which still need to
work with older 3.2+ Hadoop releases.
● Invoke new APIs on recent releases
● Downgrade to older APIs on older Hadoop releases
Applications which use this shim library will work on older releases but get the
speedups on the new ones.
40

More Related Content

Similar to Hadoop Vectored IO

Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLinaro
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions Alfresco Software
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsshanker_uma
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A CloudSky Jian
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Copper: A high performance workflow engine
Copper: A high performance workflow engineCopper: A high performance workflow engine
Copper: A high performance workflow enginedmoebius
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community
 

Similar to Hadoop Vectored IO (20)

Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Cnam azure 2015 storage
Cnam azure 2015  storageCnam azure 2015  storage
Cnam azure 2015 storage
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A Cloud
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
Copper: A high performance workflow engine
Copper: A high performance workflow engineCopper: A high performance workflow engine
Copper: A high performance workflow engine
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
 

More from Steve Loughran

The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is overSteve Loughran
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)Steve Loughran
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionSteve Loughran
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!Steve Loughran
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()Steve Loughran
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming DeployedSteve Loughran
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?Steve Loughran
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveSteve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupSteve Loughran
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSteve Loughran
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresSteve Loughran
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraSteve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionSteve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateSteve Loughran
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARNSteve Loughran
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider projectSteve Loughran
 

More from Steve Loughran (20)

The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is over
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
 
Testing
TestingTesting
Testing
 
I hate mocking
I hate mockingI hate mocking
I hate mocking
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Datacentre stack
Datacentre stackDatacentre stack
Datacentre stack
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 

Recently uploaded

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 

Recently uploaded (20)

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 

Hadoop Vectored IO

  • 1. Steve Loughran stevel@cloudera.com github: steveloughran Hadoop Vectored IO: your data just got faster!
  • 2. Steve Loughran  Long-standing open source developer  Hadoop Committer  Cloud Connectors (s3a, abfs…)  PB of data goes through my code every day  I get to care when it doesn't or just takes too long!  Mountain Biking (+climbing)
  • 3. Problem: reading large data from cloud storage is slow ● More time waiting for queries. ● Longer cluster lifespans or more VMs ● Bigger bills from AWS, Microsoft, Google. ● Worst case: ETL jobs take so long they can't keep up Vectored IO transforms ORC and Parquet read times in cloud 3
  • 4. seek() is a concept from Hard Disks: its time is done 4 ● HDDs: seeking was a physical operation ● SSDs: block-by-block reads; you can read > 1 block in parallel. ● Cloud storage: S3, Azure, GCS: latency of GET requests ● Prefetching? Great for sequential IO ● ORC/Parquet reads are considered "random IO", but there is nothing random about it If the format readers can pass their "read plan" to the filesystem client, it can optimise data retrieval
  • 5. 5 What do you mean "random" IO?
  • 6. ORC 1. Fixed length postscript bytes contain offset of footer 2. Footer includes schemas, stripe info 3. Index data in stripes include column ranges, encryption, compression 4. Column stripes of hundreds of MB 5. Columns can be read with this data 6. Predicate pushdown can dynamically skip columns if index data shows it is not needed. 6
  • 7. Parquet ● Thrift-based columnar format ● 256+ MB Row Groups ● Fixed 8 byte postscript includes pointer to real footer ● Footer has schema + metadata of every column in every row group ● Query engines process subset of columns within row groups 7
  • 8. Columnar formats need Hadoop Vectored IO ● Inspired from readv()/writev() from Linux. ● Scatter/Gather API where callers request 1+ file range. ● Each range mapped to a CompletableFuture ● Default implementation: sort ranges + sequential readFully(offset, range) into buffers ● Object store optimizations. Works great everywhere; radical benefit in object stores 8
  • 9. New method in PositionedReadable interface: public interface PositionedReadable { readVectored( List<? extends FileRange> ranges, IntFunction<ByteBuffer> allocator) throws IOException } 9 default implementation: available everywhere
  • 10. Asynchronous: Clients can perform other operations while waiting for data for specific ranges. Parallel Processing of results: Clients can process ranges independently/Out of Order Efficient: A single vectored read call will replace multiple reads. Performant: Optimizations like range merging, parallel data fetching, bytebuffer pooling and java direct buffer IO. 10 Advantages
  • 11. Implementations Default implementation: Blocking readFully() calls for each range. Same performance as the classic API. Local/Checksum FileSystem: Java NIO Async Channel API Reads directly into on-heap or off-heap ByteBuffers Parallel read with zero-kernel-copy, DMA, memory-mapped SSD... 11
  • 13. S3A parallel GET requests ● Multiple GET requests issued into common thread pool. ● Set in fs.s3a.vectored.active.ranged.reads; default = 4 ● Results returned as soon as their data is received ● If/when S3 supports multiple ranges in GET, could be done in a single request. 13
  • 14. How to use? Path p = path("s3a://bucket/table1/file.orc"); FileSystem fs = FileSystem(p.getUri(), conf); List<FileRange> ranges = new ArrayList<>(); ranges.add(createFileRange(8 * 1024, 100)); ranges.add(createFileRange(14 * 1024, 100)); ranges.add(createFileRange(10 * 1024, 100)); ranges.add(createFileRange(2 * 1024 - 101, 100)); ranges.add(createFileRange(40 * 1024, 1024)); 14
  • 15. // open file try (FSDataInputStream in = fs.open(p)) { // initiate vectored read in.readVectored(ranges, size -> pool.getBuffer(false, size)); // process results in worker pool for (FileRange range : ranges) dataProcessor.submit(() -> range.getData().thenAccept(buffer -> { /* process */ pool.putBuffer(buffer); countDownLatch.countDown(); 15
  • 16. Is it faster? Oh yes*. Java Microbench benchmark for local storage is 7x faster. Hive queries with ORC data in S3. ● TPCH ⇒ 10-20% reduced execution time. ● TPCDS ⇒ 20-40% reduced execution time. ● Synthetic ETL benchmark ⇒ 50-70% reduced execution time. * your results may vary. No improvements for CSV or Avro 16
  • 17. Java Microbench Harness Native IO implementation of vectored IO API in RawLocal and Checksum filesystem + JMH benchmark to compare the performance across ● FS: [Raw, Checksum file system]. ● Buffers: [Direct, Heap] ● read: [seek() + read(), Vectored, java async file channel read] see: org.apache.hadoop.benchmark.VectoredReadBenchmark 17
  • 18. JMH benchmarks result Benchmark buffer fileSystem Score Error syncRead (array) checksum 12627.855 ± 439.369 syncRead (array) raw 1783.479 ± 161.927 asyncFileChanArray direct N/A 1448.505 ± 140.251 asyncFileChanArray array N/A 705.741 ± 10.574 asyncRead direct checksum 1690.764 ± 306.537 asyncRead direct raw 1562.085 ± 242.231 asyncRead array checksum 888.708 ± 13.691 asyncRead array raw 810.721 ± 13.284 18 =7x faster
  • 19. Hive with ORC data in S3: ORC library updated to use to the new API when reading stripes All applications using this format automatically get the speedups. Special mention to Owen O'Malley 20
  • 20. 21 Hive TPCH Benchmark: 10-20% decrease in query runtimes
  • 21. NOTE: Q11 has a small runtime overall ( just few secs) so can be neglected 22
  • 22. Hive TPCDS Benchmark @ 300GB 23
  • 23. NOTE: Q74 has a small runtime overall ( just few secs) so can be neglected for the sake of comparison. Tez container release time dominated. 24
  • 24. Synthetic benchmark of ETL & vectored IO TPCH benchmarks are too short to show full vectored IO improvements. Synthetic benchmark to test vectored IO and mimic ETL based scenarios. Wide string tables with large data reads. Queries generate up to 33 different ranges on the same stripe 25
  • 25. 26 Synthetic benchmark of ETL & vectored IO
  • 26. Hive ETL benchmark result: 27 ● 50-70 percent reduction in the mapper runtime with vectored IO. ● More performance gains in long running queries. This is because vectored IO fills the data in each buffer in parallel. ● Queries which reads more data shows more speedup.
  • 27. Shipping in Hadoop 3.3.5! ● Available in Hadoop 3.3.5 ● file:// has local filesystem NIO implementation ● S3A does parallel GETs per input stream ● Fallback implementation for everything else 28
  • 28. Azure and GCS storage: help needed Azure storage through ABFS ● Async 2MB prefetching doesn't do much for columnar formats ● Parallel range read would reduce latency Google GCS through gs:// ● random/adaptive policies do prefetch and cache footer already ● Vectored IO adds parallel column reads ● +openFile() supports status and seek policy for Avro and Iceberg. 29
  • 29. 30 Problem: the format libraries are stuck Update the format libraries & the apps get the speedup automatically ● ORC-1251. Support VectoredIO in ORC ● PARQUET-2171. Implement vectored IO in parquet file format ● +Iceberg bundles its own shaded Parquet lib How to get this into libraries that still want to support Hadoop 2.x? 1. Cloudera: we can ship our own consistent set of libraries. 2. WiP: https://github.com/apache/hadoop-api-shim 3. For now: apply our patches yourself
  • 30. Conclusions ● Vectored IO APIs radically improve processing time of columnar data stored in cloud and local SSD. ● Available in Hadoop 3.3.5 ● Azure and GCS connectors next targets ● We need to get support into ORC and Parquet libraries ● We need to work together to get support into the ASF releases 31
  • 33. Apache Iceberg ● Iceberg manifest file IO independent of format of data processed ● If ORC/Parquet library uses Vector IO, query processing improvements ● As Iceberg improves query planning time (no directory scans), speedups as a percentage of query time *may* improve. ● Avro speedups (AVRO-3594, ...) do improve manifest read time. ● Even better: PR#4518: Provide mechanism to cache manifest file content 34
  • 34. Apache Avro sequential file reading/processing ==> no benefit from use of the API AVRO-3594: FsInput to use openFile() API fileSystem.openFile(path) .opt("fs.option.openfile.read.policy", "adaptive") .withFileStatus(status) .build() Eliminates HEAD on s3a/abfs open; read policy of "sequential unless we seek()". (GCS support of openFile() enhancements straightforward...) 35
  • 35. Spark support? Comes automatically with Vector IO releases of ORC/Parquet and Iceberg dependencies Spark master branch on Hadoop 3.3.5, so ready to play 36
  • 36. Apache Impala Impala already has a high performance reader for HDFS, S3 and ABFS: ● Multiple input streams reading stripes on same file in different thread. ● Uses unbuffer() to release all client side resources. ● Caches the existing unbuffered input streams for next workers. 37
  • 37. HDFS Support ● Default PositionedReadable implementation works well as latency of seek()/readFully() not as bad as for cloud storage. ● Parallel block reads could deliver speedups if Datanode IPC supported this. ● Ozone makes a better target ● No active work here -yet 38
  • 38. CSV Just stop it! Especially schema inference in Spark, which duplicates entire read val csvDF = spark.read.format("csv") .option("inferSchema", "true") /* please read my data twice! */ .load("s3a://bucket/data/csvdata.csv") Use Avro for exchange; ORC/Parquet for tables 39
  • 39. HADOOP-18287. Provide a shim library for FS APIs ● Library to allow access to the modern APIs for applications which still need to work with older 3.2+ Hadoop releases. ● Invoke new APIs on recent releases ● Downgrade to older APIs on older Hadoop releases Applications which use this shim library will work on older releases but get the speedups on the new ones. 40

Editor's Notes

  1. Who uses orc in cloud store? Who uses parquet in cloud store? Are there any orc/parquet devs here?
  2. With prefetching we make lot of unnecessary get requests and data gets discarded. SSD reads block by block thus sequential, we can think of parallelizing.
  3. Backwards seek() of footer read switches s3a/gs to random IO mode
  4. It takes a list of file ranges and a method to allocate byte buffers and fills the data in each allocated buffer which can be accessed by client from each file range. Any FileSystem already implementing this interface will automatically have the base implementation of Vectored IO. Thanks to Java 8 default methods
  5. This should improve you local IO for example running spark apps locally.
  6. S3A implementation sort the ranges based on start file offset then coalesces adjacent and "nearby" ranges and then fetches each combined range in separate HTTP GET requests, and then populate the buffers for each user requested ranges. Complete operation for each combined range is executed in parallel using a thread pool.
  7. Improvement in query execution time. And the best thing about this no code change required in hive, spark anywhere. Just upgrade the hadoop and orc/parquet.
  8. The current code ( traditional read api) is by far the slowest and using the Java native async file channel is the fastest. Reading in raw fs is almost at par java native async file channel. This was expected as the vectored read implementation of raw fs uses java async channel to read data. For checksum fs vectored read is almost 7X faster than traditional sync read.
  9. Seeing the Jmh benchmark result we got super motivated and went ahead implementing vectored read api in S3A connector which is widely used to read/write data from AWS S3 cloud store.
  10. We saw a reduction of 10-20% at least on the overall runtime of each query as can be seen from the graph above. Q11 has a small runtime overall so can be neglected for the sake of comparison. TPC-H Composite Query-per-Hour Performance Metric (QphH@Size)
  11. We saw a reduction of 10-20% at least on the overall runtime of each query as can be seen from the graph above. Q11 has a small runtime overall so can be neglected for the sake of comparison.
  12. Scale 250 GB with strings.
  13. i believe there may be someone from the Parquet team in the audience, and they know what I'm talking about here: I made a change to parquet which used a Hadoop 2.0 method and then someone complained about it breaking their deployments. the library dev teams don't want to be the ones dictating your updates. But Hadoop 2.x doesn't work completley on java8, and don't even think about java11+. we need to move on here. And while we do internally, I think it cripples the open source stack.
  14. as an exchange format for data it works well; easy to append to. then you can import to orc, parquet, etc. as it adds its schema to the front (and has a JSON format) it is self describing and can be processed even without precompiling classes off the IDL