ORC File –
Optimizing Your Big Data
Owen O’Malley, Co-founder Hortonworks
Apache Hadoop, Hive, ORC, and
Incubator
@owen_omalley
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Overview
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
In the Beginning…
 Hadoop applications used text or SequenceFile
– Text is slow and not splittable when compressed
– SequenceFile only supports key and value and user-defined serialization
 Hive added RCFile
– User controls the columns to read and decompress
– No type information and user-defined serialization
– Finding splits was expensive
 Avro files created
– Type information included!
– Had to read and decompress entire row
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
ORC File Basics
 Columnar format
– Enables user to read & decompress just the bytes they need
 Fast
– See https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet
 Indexed
 Self-describing
– Includes all of the information about types and encoding
 Rich type system
– All of Hive’s types including timestamp, struct, map, list, and union
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Compatibility
 Backwards compatibility
– Automatically detect the version of the file and read it.
 Forward compatibility
– Most changes are made so old readers will read the new files
– Maintain the ability to write old files via orc.write.format
– Always write old version until your last cluster upgrades
 Current file versions
– 0.11 – Original version
– 0.12 – Updated run length encoding (RLE)
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Structure
 File contains a list of stripes, which are sets of rows
– Default size is 64MB
– Large stripe size enables efficient reads
 Footer
– Contains the list of stripe locations
– Type description
– File and stripe statistics
 Postscript
– Compression parameters
– File format version
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Stripe Structure
 Indexes
– Offsets to jump to start of row group
– Row group size defaults to 10,000 rows
– Minimum, Maximum, and Count of each column
 Data
– Data for the stripe organized by column
 Footer
– List of stream locations
– Column encoding information
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Layout
Page 8
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Index Data
Row Data
Stripe Footer
~64MBStripe
Index Data
Row Data
Stripe Footer
~64MBStripe
Index Data
Row Data
Stripe Footer
~64MBStripe
File Footer
Postscript
File Metadata
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Schema Evolution
 ORC now supports schema evolution
– Hive 2.1 – append columns or type conversion
– Upcoming Hive 2.3 – map columns or inner structures by name
– User passes desired schema to ORC reader
 Type conversions
– Most types will convert although some are ugly.
– If the value doesn’t fit in the new type, it will become null.
 Cautions
– Name mapping requires ORC files written by Hive ≥ 2.0
– Some of the type conversions are slow
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Using ORC
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From Hive or Presto
 Modify your table definition:
– create table my_table (
name string,
address string,
) stored as orc;
 Import data:
– insert overwrite table my_table select * from my_staging;
 Use either configuration or table properties
– tblproperties ("orc.compress"="NONE")
– set hive.exec.orc.default.compress=NONE;
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From Java
 Use the ORC project rather than Hive’s ORC.
– Hive’s master branch uses it.
– Maven group id: org.apache.orc version: 1.4.0
– nohive classifier avoids interfering with Hive’s packages
 Two levels of access
– orc-core – Faster access, but uses Hive’s vectorized API
– orc-mapreduce – Row by row access, simpler OrcStruct API
 MapReduce API implements WritableComparable
– Can be shuffled
– Need to specify type information in configuration for shuffle or output
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From C++
 Pure C++ client library
– No JNI or JDK so client can estimate and control memory
 Combine with pure C++ HDFS client from HDFS-8707
– Work ongoing in feature branch, but should be committed soon.
 Reader is stable and in production use.
 Alibaba has created a writer and is contributing it to Apache ORC.
– Should be in the next release ORC 1.5.0.
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Command Line
 Using hive –orcfiledump from Hive
– -j -p – pretty prints the metadata as JSON
– -d – prints data as JSON
 Using java -jar orc-tools-1.4.0-uber.jar from ORC
– meta – print the metadata as JSON
– data – print data as JSON
– convert – convert JSON to ORC
– json-schema – scan a set of JSON documents to find the matching schema
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Optimization
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Stripe Size
 Makes a huge difference in performance
– orc.stripe.size or hive.exec.orc.default.stripe.size
– Controls the amount of buffer in writer. Default is 64MB
– Trade off
• Large stripes = Large more efficient reads
• Small stripes = Less memory and more granular processing splits
 Multiple files written at the same time will shrink stripes
– Use Hive’s hive.optimize.sort.dynamic.partition
– Sorting dynamic partitions means a one writer at a time
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
HDFS Block Padding
 The stripes don’t align exactly with HDFS blocks
 HDFS scatters blocks around cluster
 Often want to pad to block boundaries
– Costs space, but improves performance
– hive.exec.orc.default.block.padding – true
– hive.exec.orc.block.padding.tolerance – 0.05
Index Data
Row Data
Stripe Footer
~64MBStripe
Index Data
Row Data
Stripe Footer
~64MBStripe
Index Data
Row Data
Stripe Footer
~64MBStripe
HDFS Block
HDFS Block
Padding
File Footer
Postscript
File Metadata
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Predicate Push Down
 Reader is given a SearchArg
– Limited set predicates over column and literal value
– Reader will skip over any parts of file that can’t contain valid rows
 ORC indexes at three levels:
– File
– Stripe
– Row Group (10k rows)
 Reader still needs to apply predicate to filter out single rows
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Row Pruning
 Every primitive column has minimum and maximum at each level
– Sorting your data within a file helps a lot
– Consider sorting instead of making lots of partitions
 Writer can optionally include bloomfilters
– Provides a probabilistic bitmap of hashcodes
– Only works with equality predicates at the row group level
– Requires significant space in the file
– Manually enabled by using orc.bloom.filter.columns
– Use orc.bloom.filter.fpp to set the false positive rate (default 0.05)
– Set the default charset in JVM via -Dfile.encoding=UTF-8
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Row Pruning Example
 TPC-DS
– from tpch1000.lineitem where l_orderkey = 1212000001;
 Rows Read
– Nothing – 5,999,989,709
– Min/Max – 540,000
– BloomFilter – 10,000
 Time Taken
– Nothing – 74 sec
– Min/Max – 4.5 sec
– BloomFilter – 1.3 sec
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Split Calculation
 Hive’s OrcInputFormat has three strategies for split calculation
– BI
• Small fast queries
• Splits based on HDFS blocks
– ETL
• Large queries
• Read file footer and apply SearchArg to stripes
• Can include footer in splits (hive.orc.splits.include.file.footer)
– Hybrid
• If small files or lots of files, use BI
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
LLAP – Live Long & Process
 Provides a persistent service to speed up Hive
– Caches ORC and text data
– Saves costs of Yarn container & JVM spin up
– JIT finishes after first few seconds
 Cache uses ORC’s RLE
– Decompresses zlib or Snappy
– RLE is fast and saves memory
– Automatically caches hot columns and partitions
 Allows Spark to use Hive’s column and row security
23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Current Work In Progress
24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Speed Improvements for ACID
 Hive supports ACID transactions on ORC tables
– Uses delta files in HDFS to store changes to each partition
– Delta files store insert/update/delete operations
– Used to support SQL insert commands
 Unfortunately, update operations don’t allow predicate push down
on the deltas
 In the upcoming Hive 2.3, we added a new ACID layout
– It change updates to an insert and delete
– Allows predicate pushdown even on the delta files
 Also added SQL merge command in Hive 2.2
25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Column Encryption (ORC-14)
 Allows users to encrypt some of the columns of the file
– Provides column level security even with access to raw files
– Uses Key Management Server from Ranger or Hadoop
– Includes both the data and the index
– Daily key rolling can anonymize data after 90 days
 User specifies how data is masked if user doesn’t have access
– Nullify
– Redact
– SHA256
26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You
@owen_omalley
owen@hortonworks.com

ORC File - Optimizing Your Big Data

  • 1.
    ORC File – OptimizingYour Big Data Owen O’Malley, Co-founder Hortonworks Apache Hadoop, Hive, ORC, and Incubator @owen_omalley
  • 2.
    2 © HortonworksInc. 2011 – 2017. All Rights Reserved Overview
  • 3.
    3 © HortonworksInc. 2011 – 2017. All Rights Reserved In the Beginning…  Hadoop applications used text or SequenceFile – Text is slow and not splittable when compressed – SequenceFile only supports key and value and user-defined serialization  Hive added RCFile – User controls the columns to read and decompress – No type information and user-defined serialization – Finding splits was expensive  Avro files created – Type information included! – Had to read and decompress entire row
  • 4.
    4 © HortonworksInc. 2011 – 2017. All Rights Reserved ORC File Basics  Columnar format – Enables user to read & decompress just the bytes they need  Fast – See https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet  Indexed  Self-describing – Includes all of the information about types and encoding  Rich type system – All of Hive’s types including timestamp, struct, map, list, and union
  • 5.
    5 © HortonworksInc. 2011 – 2017. All Rights Reserved File Compatibility  Backwards compatibility – Automatically detect the version of the file and read it.  Forward compatibility – Most changes are made so old readers will read the new files – Maintain the ability to write old files via orc.write.format – Always write old version until your last cluster upgrades  Current file versions – 0.11 – Original version – 0.12 – Updated run length encoding (RLE)
  • 6.
    6 © HortonworksInc. 2011 – 2017. All Rights Reserved File Structure  File contains a list of stripes, which are sets of rows – Default size is 64MB – Large stripe size enables efficient reads  Footer – Contains the list of stripe locations – Type description – File and stripe statistics  Postscript – Compression parameters – File format version
  • 7.
    7 © HortonworksInc. 2011 – 2017. All Rights Reserved Stripe Structure  Indexes – Offsets to jump to start of row group – Row group size defaults to 10,000 rows – Minimum, Maximum, and Count of each column  Data – Data for the stripe organized by column  Footer – List of stream locations – Column encoding information
  • 8.
    8 © HortonworksInc. 2011 – 2017. All Rights Reserved File Layout Page 8 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe File Footer Postscript File Metadata
  • 9.
    9 © HortonworksInc. 2011 – 2017. All Rights Reserved Schema Evolution  ORC now supports schema evolution – Hive 2.1 – append columns or type conversion – Upcoming Hive 2.3 – map columns or inner structures by name – User passes desired schema to ORC reader  Type conversions – Most types will convert although some are ugly. – If the value doesn’t fit in the new type, it will become null.  Cautions – Name mapping requires ORC files written by Hive ≥ 2.0 – Some of the type conversions are slow
  • 10.
    10 © HortonworksInc. 2011 – 2017. All Rights Reserved Using ORC
  • 11.
    11 © HortonworksInc. 2011 – 2017. All Rights Reserved From Hive or Presto  Modify your table definition: – create table my_table ( name string, address string, ) stored as orc;  Import data: – insert overwrite table my_table select * from my_staging;  Use either configuration or table properties – tblproperties ("orc.compress"="NONE") – set hive.exec.orc.default.compress=NONE;
  • 12.
    12 © HortonworksInc. 2011 – 2017. All Rights Reserved From Java  Use the ORC project rather than Hive’s ORC. – Hive’s master branch uses it. – Maven group id: org.apache.orc version: 1.4.0 – nohive classifier avoids interfering with Hive’s packages  Two levels of access – orc-core – Faster access, but uses Hive’s vectorized API – orc-mapreduce – Row by row access, simpler OrcStruct API  MapReduce API implements WritableComparable – Can be shuffled – Need to specify type information in configuration for shuffle or output
  • 13.
    13 © HortonworksInc. 2011 – 2017. All Rights Reserved From C++  Pure C++ client library – No JNI or JDK so client can estimate and control memory  Combine with pure C++ HDFS client from HDFS-8707 – Work ongoing in feature branch, but should be committed soon.  Reader is stable and in production use.  Alibaba has created a writer and is contributing it to Apache ORC. – Should be in the next release ORC 1.5.0.
  • 14.
    14 © HortonworksInc. 2011 – 2017. All Rights Reserved Command Line  Using hive –orcfiledump from Hive – -j -p – pretty prints the metadata as JSON – -d – prints data as JSON  Using java -jar orc-tools-1.4.0-uber.jar from ORC – meta – print the metadata as JSON – data – print data as JSON – convert – convert JSON to ORC – json-schema – scan a set of JSON documents to find the matching schema
  • 15.
    15 © HortonworksInc. 2011 – 2017. All Rights Reserved Optimization
  • 16.
    16 © HortonworksInc. 2011 – 2017. All Rights Reserved Stripe Size  Makes a huge difference in performance – orc.stripe.size or hive.exec.orc.default.stripe.size – Controls the amount of buffer in writer. Default is 64MB – Trade off • Large stripes = Large more efficient reads • Small stripes = Less memory and more granular processing splits  Multiple files written at the same time will shrink stripes – Use Hive’s hive.optimize.sort.dynamic.partition – Sorting dynamic partitions means a one writer at a time
  • 17.
    17 © HortonworksInc. 2011 – 2017. All Rights Reserved HDFS Block Padding  The stripes don’t align exactly with HDFS blocks  HDFS scatters blocks around cluster  Often want to pad to block boundaries – Costs space, but improves performance – hive.exec.orc.default.block.padding – true – hive.exec.orc.block.padding.tolerance – 0.05 Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe HDFS Block HDFS Block Padding File Footer Postscript File Metadata
  • 18.
    18 © HortonworksInc. 2011 – 2017. All Rights Reserved Predicate Push Down  Reader is given a SearchArg – Limited set predicates over column and literal value – Reader will skip over any parts of file that can’t contain valid rows  ORC indexes at three levels: – File – Stripe – Row Group (10k rows)  Reader still needs to apply predicate to filter out single rows
  • 19.
    19 © HortonworksInc. 2011 – 2017. All Rights Reserved Row Pruning  Every primitive column has minimum and maximum at each level – Sorting your data within a file helps a lot – Consider sorting instead of making lots of partitions  Writer can optionally include bloomfilters – Provides a probabilistic bitmap of hashcodes – Only works with equality predicates at the row group level – Requires significant space in the file – Manually enabled by using orc.bloom.filter.columns – Use orc.bloom.filter.fpp to set the false positive rate (default 0.05) – Set the default charset in JVM via -Dfile.encoding=UTF-8
  • 20.
    20 © HortonworksInc. 2011 – 2017. All Rights Reserved Row Pruning Example  TPC-DS – from tpch1000.lineitem where l_orderkey = 1212000001;  Rows Read – Nothing – 5,999,989,709 – Min/Max – 540,000 – BloomFilter – 10,000  Time Taken – Nothing – 74 sec – Min/Max – 4.5 sec – BloomFilter – 1.3 sec
  • 21.
    21 © HortonworksInc. 2011 – 2017. All Rights Reserved Split Calculation  Hive’s OrcInputFormat has three strategies for split calculation – BI • Small fast queries • Splits based on HDFS blocks – ETL • Large queries • Read file footer and apply SearchArg to stripes • Can include footer in splits (hive.orc.splits.include.file.footer) – Hybrid • If small files or lots of files, use BI
  • 22.
    22 © HortonworksInc. 2011 – 2017. All Rights Reserved LLAP – Live Long & Process  Provides a persistent service to speed up Hive – Caches ORC and text data – Saves costs of Yarn container & JVM spin up – JIT finishes after first few seconds  Cache uses ORC’s RLE – Decompresses zlib or Snappy – RLE is fast and saves memory – Automatically caches hot columns and partitions  Allows Spark to use Hive’s column and row security
  • 23.
    23 © HortonworksInc. 2011 – 2017. All Rights Reserved Current Work In Progress
  • 24.
    24 © HortonworksInc. 2011 – 2017. All Rights Reserved Speed Improvements for ACID  Hive supports ACID transactions on ORC tables – Uses delta files in HDFS to store changes to each partition – Delta files store insert/update/delete operations – Used to support SQL insert commands  Unfortunately, update operations don’t allow predicate push down on the deltas  In the upcoming Hive 2.3, we added a new ACID layout – It change updates to an insert and delete – Allows predicate pushdown even on the delta files  Also added SQL merge command in Hive 2.2
  • 25.
    25 © HortonworksInc. 2011 – 2017. All Rights Reserved Column Encryption (ORC-14)  Allows users to encrypt some of the columns of the file – Provides column level security even with access to raw files – Uses Key Management Server from Ranger or Hadoop – Includes both the data and the index – Daily key rolling can anonymize data after 90 days  User specifies how data is masked if user doesn’t have access – Nullify – Redact – SHA256
  • 26.
    26 © HortonworksInc. 2011 – 2017. All Rights Reserved Thank You @owen_omalley owen@hortonworks.com