ORC Files

•

49 likes•51,481 views

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.

© Hortonworks Inc. 2012
ORC Files
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
owen@hortonworks.com

© Hortonworks Inc. 2012
Who Am I?
Page 2

© Hortonworks Inc. 2012
History
Page 3

© Hortonworks Inc. 2012
Remaining Challenges
Page 4

© Hortonworks Inc. 2012
Requirements
Page 5

© Hortonworks Inc. 2012
File Structure
Page 6

© Hortonworks Inc. 2012
Stripe Structure
Page 7

© Hortonworks Inc. 2012
File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4

© Hortonworks Inc. 2012
Compression
Page 9

© Hortonworks Inc. 2012
Integer Column Serialization
Page 10

© Hortonworks Inc. 2012
String Column Serialization
Page 11

© Hortonworks Inc. 2012
Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double

© Hortonworks Inc. 2012
Compound Type Serialization
Page 13

© Hortonworks Inc. 2012
Generic Compression
Page 14

© Hortonworks Inc. 2012
Column Projection
Page 15

© Hortonworks Inc. 2012
How Do You Use ORC
Page 16

© Hortonworks Inc. 2012
Managing Memory
Page 17

© Hortonworks Inc. 2012
Pavan’s Trick
Page 18

© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 19

© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 20

© Hortonworks Inc. 2012
TPC-DS File Sizes
Page 21

© Hortonworks Inc. 2012
TPC-DS Query Performance
Page 22

© Hortonworks Inc. 2012
Additional Details
Page 23

© Hortonworks Inc. 2012
Current work
Page 24

© Hortonworks Inc. 2012
Vectorization
Page 25

© Hortonworks Inc. 2012
Vectorization Preliminary Results
Page 26

© Hortonworks Inc. 2012
Future Work
Page 27

© Hortonworks Inc. 2012
Thanks!
Page 28

© Hortonworks Inc. 2012
Comparison
Page 29
RC File Trevni Parquet ORC File
Hive Type Model N N N Y
Separate complex columns N Y Y Y
Splits found quickly N Y Y Y
Default column group size 4MB 64MB* 64MB* 256MB
Files per a bucket 1 > 1 1* 1
Store min, max, sum, count N N N Y
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N N Y
Store row count N Y N Y
Skip compressed blocks N N N Y
Store internal indexes N N N Y

More Related Content

What's hot

Hive partitioning best practices

Hive partitioning best practices

Hive partitioning best practices

Hive+Tez: A performance deep dive

Hive+Tez: A performance deep dive

Hive+Tez: A performance deep dive

ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for. This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.

ORC File - Optimizing Your Big Data

ORC File - Optimizing Your Big Data

ORC File - Optimizing Your Big Data

DataWorks Summit

Hadoop Summit June 2016 The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

How to understand and analyze Apache Hive query execution plan for performanc...

How to understand and analyze Apache Hive query execution plan for performanc...

How to understand and analyze Apache Hive query execution plan for performanc...

DataWorks Summit/Hadoop Summit

Building a Virtual Data Lake with Apache Arrow

Building a Virtual Data Lake with Apache Arrow

Building a Virtual Data Lake with Apache Arrow

Dremio Corporation

Apache Spark 2.3, released on February 2018, is the fourth release in 2.x line and has a lot of new improvements. One of the notable improvements is ORC support. Apache Spark 2.3 adds a native ORC file format implementation by using the latest Apache ORC 1.4.1. Users can switch between “native” and “hive” ORC file formats. Hive ORC file format is the existing one until Spark 2.2. In this talk, I'll talk about three key changes. First of all, performance. New native ORC implementation is faster 2x - 11x times on 10TB TPCDS benchmark. Vectorized query execution over ORC files improves Spark ORC query execution greatly. Especially, ORC filter pushdown can be faster than Parquet due to in-file indexes. Second, as a part of native ORC support, Spark 2.3 can convert the Hive ORC tables into Spark ORC data sources automatically. This solves several existing ORC issues and Spark 2.4 will enable it by default. Last, but not least, Spark 2.3 officially supports structural streaming over ORC data sources. You can create a streaming dataset over ORC files. Speaker Dongjoon Hyun, Staff Software Engineer, Hortonworks

ORC improvement in Apache Spark 2.3

ORC improvement in Apache Spark 2.3

ORC improvement in Apache Spark 2.3

DataWorks Summit

YARN Federation

YARN Federation

YARN Federation

DataWorks Summit/Hadoop Summit

This talk is about the beauty of sequential access and append-only data structures. We'll do this in the context of a little-known paper entitled “Log Structured Merge Trees”. LSM describes a surprisingly counterintuitive approach to storing and accessing data in a sequential fashion. It came to prominence in Google's Big Table paper and today, the use of Logs, LSM and append-only data structures drive many of the world's most influential storage systems: Cassandra, HBase, RocksDB, Kafka and more. Finally, we'll look at how the beauty of sequential access goes beyond database internals, right through to how applications communicate, share data and scale.

Power of the Log: LSM & Append Only Data Structures

Power of the Log: LSM & Append Only Data Structures

Power of the Log: LSM & Append Only Data Structures

Hive Data Modeling and Query Optimization

Hive Data Modeling and Query Optimization

Hive Data Modeling and Query Optimization

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive Queries

ORC: 2015 Faster, Better, Smaller

ORC: 2015 Faster, Better, Smaller

ORC: 2015 Faster, Better, Smaller

DataWorks Summit

RocksDB detail

Spark shuffle introduction

Spark shuffle introduction

Spark shuffle introduction

Log Structured Merge Tree

Log Structured Merge Tree

Log Structured Merge Tree

University of California, Santa Cruz

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

The Parquet Format and Performance Optimization Opportunities

The Parquet Format and Performance Optimization Opportunities

The Parquet Format and Performance Optimization Opportunities

Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing. In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30% In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.

How We Optimize Spark SQL Jobs With parallel and sync IO

How We Optimize Spark SQL Jobs With parallel and sync IO

How We Optimize Spark SQL Jobs With parallel and sync IO

HBase Blockcache 101

HBase Blockcache 101

HBase Blockcache 101

NOSQLEU - Graph Databases and Neo4j

NOSQLEU - Graph Databases and Neo4j

NOSQLEU - Graph Databases and Neo4j

Tobias Lindaaker

This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units. Below topics are explained in this Hadoop presentation: 1. What is Hadoop 2. Why Hadoop 3. Big Data generation 4. Hadoop HDFS 5. Hadoop MapReduce 6. Hadoop YARN 7. Use of Hadoop 8. Demo on HDFS, MapReduce and YARN What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? This course will enable you to: 1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques 14. Understand the common use-cases of Spark and the various interactive algorithms 15. Learn Spark SQL, creating, transforming, and querying Data frames Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...

What's hot (20)

Hive partitioning best practices

Hive partitioning best practices

Hive partitioning best practices

Hive+Tez: A performance deep dive

Hive+Tez: A performance deep dive

Hive+Tez: A performance deep dive

ORC File - Optimizing Your Big Data

ORC File - Optimizing Your Big Data

ORC File - Optimizing Your Big Data

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

How to understand and analyze Apache Hive query execution plan for performanc...

How to understand and analyze Apache Hive query execution plan for performanc...

How to understand and analyze Apache Hive query execution plan for performanc...

Building a Virtual Data Lake with Apache Arrow

Building a Virtual Data Lake with Apache Arrow

Building a Virtual Data Lake with Apache Arrow

ORC improvement in Apache Spark 2.3

ORC improvement in Apache Spark 2.3

ORC improvement in Apache Spark 2.3

YARN Federation

YARN Federation

YARN Federation

Power of the Log: LSM & Append Only Data Structures

Power of the Log: LSM & Append Only Data Structures

Power of the Log: LSM & Append Only Data Structures

Hive Data Modeling and Query Optimization

Hive Data Modeling and Query Optimization

Hive Data Modeling and Query Optimization

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive Queries

ORC: 2015 Faster, Better, Smaller

ORC: 2015 Faster, Better, Smaller

ORC: 2015 Faster, Better, Smaller

RocksDB detail

Spark shuffle introduction

Spark shuffle introduction

Spark shuffle introduction

Log Structured Merge Tree

Log Structured Merge Tree

Log Structured Merge Tree

The Parquet Format and Performance Optimization Opportunities

The Parquet Format and Performance Optimization Opportunities

The Parquet Format and Performance Optimization Opportunities

How We Optimize Spark SQL Jobs With parallel and sync IO

How We Optimize Spark SQL Jobs With parallel and sync IO

How We Optimize Spark SQL Jobs With parallel and sync IO

HBase Blockcache 101

HBase Blockcache 101

HBase Blockcache 101

NOSQLEU - Graph Databases and Neo4j

NOSQLEU - Graph Databases and Neo4j

NOSQLEU - Graph Databases and Neo4j

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...

Similar to ORC Files

Using Apache Hive with High Performance

Using Apache Hive with High Performance

Using Apache Hive with High Performance

Inderaj (Raj) Bains

Speaker: John Randolph, Sr. Software Developer, Gexa Energy Level: 100 (Beginner) Track: Developer Gexa has implemented several applications using MongoDB as a document repository storing multiple types of files (PDF, XLS, CSV, etc.). This entry level session is intended to share what we’ve learned in developing and deploying our first applications in an on premise, Microsoft environment. We’ll provide architectural and development information about what we’ve done. The focus is to help get your projects up-to-speed more quickly. This will be useful to teams moving from pilot to production and for developers getting started with the .Net MongoDB drivers. Plenty of code samples will be shown. We’ll discuss our successful engagement with MongoDB Consulting to help us design and deploy a high-quality production environment. What You Will Learn: - Ideas how to store and retrieve documents of different sizes, types, and volumes. We’ll describe the storage, partitioning and indexing techniques used that provide sub-second retrieval from collections with over 100 million records. - The issues addressed moving to production, including: backup, disaster recovery, SSL, using replica sets, implementing authorization and authentication, changing default setting, and creating a full path-to-production set of environments. - A successful pattern for building applications with .Net, providing teams some ideas to jump-start their development along with tips and tricks for using the .Net drivers.

Getting Started with MongoDB Using the Microsoft Stack

Getting Started with MongoDB Using the Microsoft Stack

Getting Started with MongoDB Using the Microsoft Stack

This presentation was given at the Strata + Hadoop World, 2015 in San Jose. Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data. In this talk the speakers examined Hive performance, past, present and future. In particular they looked at Hive’s origins as a petabyte scale SQL engine. Through some numbers and graphs, they showed how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer. They detailed and discussed the challenges of scalable SQL on Hadoop. The looked into Hive’s sub-second future, powered by LLAP and Hive on Spark. And showed just how fast Hive on Spark really is.

Hive on spark is blazing fast or is it final

Hive on spark is blazing fast or is it final

Hive on spark is blazing fast or is it final

MOUG17 Keynote: Oracle OpenWorld Major Announcements

MOUG17 Keynote: Oracle OpenWorld Major Announcements

MOUG17 Keynote: Oracle OpenWorld Major Announcements

Data lake – On Premise VS Cloud

Data lake – On Premise VS Cloud

Data lake – On Premise VS Cloud

SQL in the Hybrid World

SQL in the Hybrid World

SQL in the Hybrid World

Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.

Enabling R on Hadoop

Enabling R on Hadoop

Enabling R on Hadoop

DataWorks Summit

Cuando busca alternativas a Oracle en la nube, hacer el cambio puede parecer un trabajo duro. Entendemos que la migración involucra más que solo la base de datos. La compatibilidad es un punto clave, especialmente cuando se consideran los recursos que posiblemente ya haya invertido en Oracle, como por ejemplo el código de aplicación específico de Oracle.Este seminario web explorará las opciones y las principales consideraciones al pasar de las bases de datos de Oracle a la nube. - Revisión detallada de las ofertas de bases de datos disponibles en la nube - Factores críticos que se deben considerar considerar para elegir la oferta en la nube más adecuada - Cómo la experiencia de EDB con PostgreSQL puede ayudarlo en su decisión - Demostración de BigAnimal de EDB Présentateur: Sergio Romera, Senior Sales Engineer EMEA, EDB ------------------------------------------------------------ For more #webinars, visit http://bit.ly/EDB-Webinars Download free #PostgreSQL whitepapers: http://bit.ly/EDB-Whitepapers Read our #Postgres Blog http://bit.ly/EDB-Blogs Follow us on Facebook at http://bit.ly/EDB-FB Follow us on Twitter at http://bit.ly/EDB-Twitter Follow us on LinkedIn at http://bit.ly/EDB-LinkedIn Reach us via email at marketing@enterprisedb.com

Migre sus bases de datos Oracle a la nube

Migre sus bases de datos Oracle a la nube

Migre sus bases de datos Oracle a la nube

ORC 2015: Faster, Better, Smaller

ORC 2015: Faster, Better, Smaller

ORC 2015: Faster, Better, Smaller

The Apache Software Foundation

This topic describes the use of Spark and SequoiaDB in the Operational Data Lake of China’s financial industry, including how to use SequoiaDB to provide online high concurrent services and how to use Spark for data processing and machine learning. China has the world’s largest population, and also the world’s second largest economy. Many of the best technologies used in the United States and Europe are difficult to play effectively in China. This topic will show you how Spark and SequoiaDB are able to provide online financial services to billions of population.

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

The talk will be about the project to find a replacement for all IBM products in the company with the example for the databases. What was the goal of the project, the learning, a short overview about the options we migrated about 500 db2 databases to EnterpriseDB. The database size was from a small size up to 4 TB and we implemented a completely new fully automated deployment of VM and database. Databases are now 11 month in production. The talk will have an overview of the project, the learnings, a few parameters and technical parameters that were found for stability and performance.

Migration DB2 to EDB - Project Experience

Migration DB2 to EDB - Project Experience

Migration DB2 to EDB - Project Experience

LA HUG - Agile Analytics Applications on HDP

LA HUG - Agile Analytics Applications on HDP

LA HUG - Agile Analytics Applications on HDP

Things learned from OpenWorld 2013

Things learned from OpenWorld 2013

Things learned from OpenWorld 2013

Connor McDonald

Whats new in Oracle Database 12c release 12.1.0.2

Whats new in Oracle Database 12c release 12.1.0.2

Whats new in Oracle Database 12c release 12.1.0.2

Connor McDonald

Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0?

DataWorks Summit

Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.

What's New in Apache Hive 3.0 - Tokyo

What's New in Apache Hive 3.0 - Tokyo

What's New in Apache Hive 3.0 - Tokyo

DataWorks Summit

Ozone is an object store for Hadoop. Ozone solves the small file problem of HDFS, which allows users to store trillions of files in Ozone and access them as if there are on HDFS. Ozone plugs into existing Hadoop deployments seamlessly, and programs like Hive, LLAP, and Spark work without any modifications. This talk looks at the architecture, reliability, and performance of Ozone. In this talk, we will also explore Hadoop distributed storage layer, a block storage layer that makes this scaling possible, and how we plan to use the Hadoop distributed storage layer for scaling HDFS. We will demonstrate how to install an Ozone cluster, how to create volumes, buckets, and keys, how to run Hive and Spark against HDFS and Ozone file systems using federation, so that users don’t have to worry about where the data is stored. In other words, a full user primer on Ozone will be part of this talk. Speakers Anu Engineer, Software Engineer, Hortonworks Xiaoyu Yao, Software Engineer, Hortonworks

Ozone: scaling HDFS to trillions of objects

Ozone: scaling HDFS to trillions of objects

Ozone: scaling HDFS to trillions of objects

DataWorks Summit

Mongo db operations_v2

Mongo db operations_v2

Mongo db operations_v2

Thanabalan Sathneeganandan

Orange County HUG - Agile Data on HDP

Orange County HUG - Agile Data on HDP

Orange County HUG - Agile Data on HDP

Similar to ORC Files (20)

Using Apache Hive with High Performance

Using Apache Hive with High Performance

Using Apache Hive with High Performance

Getting Started with MongoDB Using the Microsoft Stack

Getting Started with MongoDB Using the Microsoft Stack

Getting Started with MongoDB Using the Microsoft Stack

Hive on spark is blazing fast or is it final

Hive on spark is blazing fast or is it final

Hive on spark is blazing fast or is it final

MOUG17 Keynote: Oracle OpenWorld Major Announcements

MOUG17 Keynote: Oracle OpenWorld Major Announcements

MOUG17 Keynote: Oracle OpenWorld Major Announcements

Data lake – On Premise VS Cloud

Data lake – On Premise VS Cloud

Data lake – On Premise VS Cloud

SQL in the Hybrid World

SQL in the Hybrid World

SQL in the Hybrid World

Enabling R on Hadoop

Enabling R on Hadoop

Enabling R on Hadoop

Migre sus bases de datos Oracle a la nube

Migre sus bases de datos Oracle a la nube

Migre sus bases de datos Oracle a la nube

ORC 2015: Faster, Better, Smaller

ORC 2015: Faster, Better, Smaller

ORC 2015: Faster, Better, Smaller

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migration DB2 to EDB - Project Experience

Migration DB2 to EDB - Project Experience

Migration DB2 to EDB - Project Experience

LA HUG - Agile Analytics Applications on HDP

LA HUG - Agile Analytics Applications on HDP

LA HUG - Agile Analytics Applications on HDP

Things learned from OpenWorld 2013

Things learned from OpenWorld 2013

Things learned from OpenWorld 2013

Whats new in Oracle Database 12c release 12.1.0.2

Whats new in Oracle Database 12c release 12.1.0.2

Whats new in Oracle Database 12c release 12.1.0.2

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0 - Tokyo

What's New in Apache Hive 3.0 - Tokyo

What's New in Apache Hive 3.0 - Tokyo

Ozone: scaling HDFS to trillions of objects

Ozone: scaling HDFS to trillions of objects

Ozone: scaling HDFS to trillions of objects

Mongo db operations_v2

Mongo db operations_v2

Mongo db operations_v2

Orange County HUG - Agile Data on HDP

Orange County HUG - Agile Data on HDP

Orange County HUG - Agile Data on HDP

More from Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid Them

Big Data's Journey to ACID

Big Data's Journey to ACID

Big Data's Journey to ACID

Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads but with integrated support for finding required rows quickly. Owen O’Malley dives into the progress the Apache community made for adding fine-grained column-level encryption natively into ORC format, which also provides capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys, providing the additional flexibility to use and manage keys per column centrally.

Protect your private data with ORC column encryption

Protect your private data with ORC column encryption

Protect your private data with ORC column encryption

Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads, but with integrated support for finding required rows quickly. In this talk, we will outline the progress made in Apache community for adding fine-grained column level encryption natively into ORC format that will also provide capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys providing the additional flexibility to use and manage keys per column centrally.

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate While previous work has compared the size and speed from Hive, this presentation will present benchmarks from Spark including the new work that radically improves the performance of Spark on ORC. This presentation will also include tips and suggestions to optimize the performance of your application while reading and writing the data. Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait. Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including: * All reads use snapshot isolation without locking. * No directory listings are required for query planning. * Files can be added, removed, or replaced atomically. * Full schema evolution supports changes in the table over time. * Partitioning evolution enables changes to the physical layout without breaking existing queries. * Data files are stored as Avro, ORC, or Parquet. * Support for Spark, Hive, and Presto.

Strata NYC 2018 Iceberg

Strata NYC 2018 Iceberg

Strata NYC 2018 Iceberg

The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: reading all of the columns reading a few of the columns filtering using a filter predicate While previous work has compared the size and speed from Hive, this presentation will present benchmarks from Spark including the new work that radically improves the performance of Spark on ORC. This presentation will also include tips and suggestions to optimize the performance of your application while reading and writing the data.

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

ORC Column Encryption

ORC Column Encryption

ORC Column Encryption

From Hadoop Summit 2015, San Jose From Apache BigData 2016, Vancouver Hadoop has long had strong authentication via integration with Kerberos, authorization via User/Group/Other HDFS permissions, and auditing via the audit log. Recent developments in Hadoop have added HDFS file access control lists, pluggable encryption key provider APIs, HDFS snapshots, and HDFS encryption zones. These features combine to give important new data protection features that every company should be using to protect their data. This talk will cover what the new features are and when and how to use them in enterprise production environments. Upcoming features including columnar encryption in the ORC columnar format will also be covered.

Protecting Enterprise Data in Apache Hadoop

Protecting Enterprise Data in Apache Hadoop

Protecting Enterprise Data in Apache Hadoop

Hadoop has long had strong authentication via integration with Kerberos, authorization via user/group/other HDFS permissions and auditing via the audit log. Recent developments in Hadoop have added HDFS file access control lists, pluggable encryption key provider APIs, HDFS snapshots, and HDFS encryption zones. These features combine to given important new data protection features that every company should be using to protect their data. This talk will cover what the new features are and when and how to use them in enterprise production environments. Upcoming features including columnar encryption in the ORC file format will also be covered.

Data protection2015

Data protection2015

Data protection2015

Structor - Automated Building of Virtual Hadoop Clusters

Structor - Automated Building of Virtual Hadoop Clusters

Structor - Automated Building of Virtual Hadoop Clusters

Hadoop Security Architecture

Hadoop Security Architecture

Hadoop Security Architecture

Adding ACID Updates to Hive

Adding ACID Updates to Hive

Adding ACID Updates to Hive

Next Generation Hadoop Operations

Next Generation Hadoop Operations

Next Generation Hadoop Operations

The next generation of Hadoop MapReduce Arun C. Murthy presented the plans for the next generation of Apache Hadoop MapReduce. The MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization. More information and video available at: http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/

Next Generation MapReduce

Next Generation MapReduce

Next Generation MapReduce

Bay Area HUG Feb 2011 Intro

Bay Area HUG Feb 2011 Intro

Bay Area HUG Feb 2011 Intro

Plugging the Holes: Security and Compatability in Hadoop

Plugging the Holes: Security and Compatability in Hadoop

Plugging the Holes: Security and Compatability in Hadoop

More from Owen O'Malley (17)

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid Them

Big Data's Journey to ACID

Big Data's Journey to ACID

Big Data's Journey to ACID

Protect your private data with ORC column encryption

Protect your private data with ORC column encryption

Protect your private data with ORC column encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Strata NYC 2018 Iceberg

Strata NYC 2018 Iceberg

Strata NYC 2018 Iceberg

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

ORC Column Encryption

ORC Column Encryption

ORC Column Encryption

Protecting Enterprise Data in Apache Hadoop

Protecting Enterprise Data in Apache Hadoop

Protecting Enterprise Data in Apache Hadoop

Data protection2015

Data protection2015

Data protection2015

Structor - Automated Building of Virtual Hadoop Clusters

Structor - Automated Building of Virtual Hadoop Clusters

Structor - Automated Building of Virtual Hadoop Clusters

Hadoop Security Architecture

Hadoop Security Architecture

Hadoop Security Architecture

Adding ACID Updates to Hive

Adding ACID Updates to Hive

Adding ACID Updates to Hive

Next Generation Hadoop Operations

Next Generation Hadoop Operations

Next Generation Hadoop Operations

Next Generation MapReduce

Next Generation MapReduce

Next Generation MapReduce

Bay Area HUG Feb 2011 Intro

Bay Area HUG Feb 2011 Intro

Bay Area HUG Feb 2011 Intro

Plugging the Holes: Security and Compatability in Hadoop

Plugging the Holes: Security and Compatability in Hadoop

Plugging the Holes: Security and Compatability in Hadoop

ORC Files

1. © Hortonworks Inc. 2012 ORC Files June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com

2. © Hortonworks Inc. 2012 Who Am I? Page 2

3. © Hortonworks Inc. 2012 History Page 3

4. © Hortonworks Inc. 2012 Remaining Challenges Page 4

5. © Hortonworks Inc. 2012 Requirements Page 5

6. © Hortonworks Inc. 2012 File Structure Page 6

7. © Hortonworks Inc. 2012 Stripe Structure Page 7

8. © Hortonworks Inc. 2012 File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4

9. © Hortonworks Inc. 2012 Compression Page 9

10. © Hortonworks Inc. 2012 Integer Column Serialization Page 10

11. © Hortonworks Inc. 2012 String Column Serialization Page 11

12. © Hortonworks Inc. 2012 Hive Compound Types Page 12 0 Struct 4 Struct 3 String 1 Int 2 Map 7 Time 5 String 6 Double

13. © Hortonworks Inc. 2012 Compound Type Serialization Page 13

14. © Hortonworks Inc. 2012 Generic Compression Page 14

15. © Hortonworks Inc. 2012 Column Projection Page 15

16. © Hortonworks Inc. 2012 How Do You Use ORC Page 16

17. © Hortonworks Inc. 2012 Managing Memory Page 17

18. © Hortonworks Inc. 2012 Pavan’s Trick Page 18

19. © Hortonworks Inc. 2012 Looking at ORC File Structures Page 19

20. © Hortonworks Inc. 2012 Looking at ORC File Structures Page 20

21. © Hortonworks Inc. 2012 TPC-DS File Sizes Page 21

22. © Hortonworks Inc. 2012 TPC-DS Query Performance Page 22

23. © Hortonworks Inc. 2012 Additional Details Page 23

24. © Hortonworks Inc. 2012 Current work Page 24

25. © Hortonworks Inc. 2012 Vectorization Page 25

26. © Hortonworks Inc. 2012 Vectorization Preliminary Results Page 26

27. © Hortonworks Inc. 2012 Future Work Page 27

28. © Hortonworks Inc. 2012 Thanks! Page 28

29. © Hortonworks Inc. 2012 Comparison Page 29 RC File Trevni Parquet ORC File Hive Type Model N N N Y Separate complex columns N Y Y Y Splits found quickly N Y Y Y Default column group size 4MB 64MB* 64MB* 256MB Files per a bucket 1 > 1 1* 1 Store min, max, sum, count N N N Y Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N N Y Store row count N Y N Y Skip compressed blocks N N N Y Store internal indexes N N N Y