SlideShare a Scribd company logo
© 2017 IBM Corporation
Ingesting Data at Blazing Speed with Apache ORC
Gustavo Arocena
IBM Toronto Lab
© 2017 IBM Corporation2
IBM Canada Lab
Toronto
© 2017 IBM Corporation3
Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries
in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are
provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice
to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is
provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of,
or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the
effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the
applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may
have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials
is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales,
revenue growth or other results.
© Copyright IBM Corporation 2017. All rights reserved.
U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, BigInsights, and Big SQL are trademarks or registered trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a
trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was
published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the
Web at
▪“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
▪TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
▪Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.
▪Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.
▪Other company, product, or service names may be trademarks or service marks of others.
© 2017 IBM Corporation4
Agenda
What is Big SQL?
From Parquet to ORC
Reading ORC Files Fast
Scaling Up
Tuning Big SQL for ORC
© 2017 IBM Corporation5
Big SQL Background
© 2017 IBM Corporation6
What is Big SQL?
Data Security
Metastore
Cluster Mgmt.
Administration
Runs on
Data Platform
© 2017 IBM Corporation7
What’s the Big Deal?
Grace Under Pressure
© 2017 IBM Corporation8
The Big SQL
Advantages
Scale
• Only engine to run
TPC-DS at 100TB
scale
Complex SQL
• Capable of running all
99 TPC-DS queries
since 2014
• Complex queries
optimized with IBM
Cost-based optimizer
Concurrency
• Handles highly concurrent
workloads gracefully
• 12 stream TPC-DS
Efficient Resource
Utilization
• Memory
• CPU
• IO
What’s the Big Deal?
© 2017 IBM Corporation9
Metrics for Big SQL 4.2.5 vs. Spark SQL 2.1
▪ Hadoop DS @ 100TB, 4 streams
13.7
43.2
BIG SQL SPARK SQL
Hours
Elapsed Time
76.4
88.2
BIG SQL SPARK SQL
%
CPU Utilization
107
388
BIG SQL SPARK SQL
MB/Sec
Disk Reads
25
237
BIG SQL SPARK SQL
MB/Sec
Disk Writes
- 15%
1/3
1/3 1/9
https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/
© 2017 IBM Corporation10
From Parquet to ORC
© 2017 IBM Corporation11
Big SQL Architecture (as of 2016)
Head Node
Worker Node
Worker Node
Worker Node
Parquet
IO
Hive
Compat.
IO
Hive
Metastore
HDFS
HDFS
NN
© 2017 IBM Corporation12
2017: Big SQL on HDP
Most popular
data format
on HDP is
ORC
ORC
performance
becomes top
priority
© 2017 IBM Corporation13
0
5000
10000
15000
20000
25000
30000
35000
1 Stream 4 Streams
ElapsedTime(sec)
Parquet vs ORC
1TB TPC-DS
Parquet ORC v0
70% Slower
with ORC
ORC vs Parquet in Big SQL 4.2 > 50000
315% Slower
with ORC
© 2017 IBM Corporation14
Limitations of Hive Compatibility IO Engine
Slow
Ingestion
Single row at a
time
JIT unfriendly
Data values as
Java objects
Low
Scalability
Large memory
footprint per
scan
Excessive
CPU use
Overloaded
disks
© 2017 IBM Corporation15
The Roadmap Towards Fast ORC ingestion
1st Phase
• Big SQL 4.2.5, Dec ‘16
• Fast ORC Ingestion
2nd Phase
• Big SQL 5.0.1, Aug ‘17
• ORC at Scale
© 2017 IBM Corporation16
1st Phase – Fast Ingestion using Apache ORC
0
5000
10000
15000
20000
25000
1 Stream 4 Streams
ElapsedTime(sec)
Parquet vs ORC
1TB TPC-DS
Parquet ORC v1
Apache ORC libs key benefits
▪ Many-row-at-a-time API
▪ Enable JIT-friendly code
▪ Represent data using
primitive Java types
▪ Make projection and
selection pushdown
very easy
2% Faster
with ORC
65% Slower
with ORC
© 2017 IBM Corporation17
2nd Phase - Managing Resources
0
2000
4000
6000
8000
10000
12000
14000
1 Stream 4 Streams
ElapsedTime(sec)
Parquet vs ORC
1TB TPC-DS
Parquet ORC v2
Resource Manager has global
oversight over
▪ Total number of threads
▪ Overall JVM heap
consumption
▪ Degree of parallelism
per scan
15% Faster
with ORC
3.4% Faster
with ORC
© 2017 IBM Corporation18
ORC as a First Class Citizen in 5.0.1
Head Node
Worker Node
Worker Node
Worker Node
Parquet
IO
Hive
Compat.
IO
Hive
Metastore
HDFS
HDFS
NN
ORC
IO
© 2017 IBM Corporation19
ORC Background
© 2017 IBM Corporation20
What is Apache ORC?
ORC = efficient
storage + fast
ingestion
Compression
• Type-specific
encodings (RLE for
numbers, dictionary
for strings, etc)
• Generic compression
(Zlib, Snappy)
Data skipping
• Column skipping
based on data layout
• Row skipping based
on MIN/MAX stats
and bloom filters
JIT friendly
• Vectorized APIs
(retrieve data as
arrays of primitive
values)
▪ Engines leverage all
these features
▪ Apache ORC libs allow
applications to leverage
them too
© 2017 IBM Corporation21
ORC Physical Data Layout
CREATE HADOOP TABLE SALES(id INTEGER,
quantity INTEGER,
amount DOUBLE)
Stripe stats
Stripe stats
Stripe stats
File stats
Stripe (HDFS block)
Row group (10K rows)
10K id values
10K quantity values
10K amount values
Row group stats
© 2017 IBM Corporation22
Leveraging the Apache ORC Libraries
© 2017 IBM Corporation23
Dependencies and Classes
▪ Java Dependencies (orc.apache.org group id in Maven)
orc-core-1.4.0-nohive.jar
aircompressor-0.3.jar
▪ Java Classes for “vectorized” processing
import org.apache.orc.OrcFile;
import org.apache.orc.Reader;
import org.apache.orc.RecordReader;
import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
import org.apache.orc.storage.ql.io.sarg.SearchArgument;
© 2017 IBM Corporation24
Using Vectorized ORC APIs
Reader r = OrcFile.createReader(path, OrcFile.readerOptions(conf));
RecordReader rr = r.rows();
VectorizedRowBatch batch = r.getSchema().createRowBatch(1000);
1000 Values
ID
QUANTITY
AMOUNT
long quantity[1000];
long id[1000];
double amount[1000];
▪ A vectorized row batch is a Java object that contains 1000 decoded rows
© 2017 IBM Corporation25
JIT friendly code
// Compute sum(amount)
double sum = 0;
while (rr.nextBatch(batch)) {
long[] qty = ((LongColumnVector) batch.cols[1]).vector;
double[] amt = ((DoubleColumnVector) batch.cols[2]).vector;
for (int i=0; i < batch.size; i++)
if (qty[i] < 500)
sum += amt[i];
}
▪ Get the total for sales involving less than 500 items
SELECT sum(amount)
FROM sales
WHERE quantity < 500
• No objects
• No method calls
• Tight loop compiles
to machine code
© 2017 IBM Corporation26
Column Skipping/Pruning (a.k.a. Projection Pushdown)
▪ If we don’t say otherwise, ORC will read all the columns
▪ But our query is using only two columns
ID
QUANTITY
AMOUNT
SELECT sum(amount)
FROM sales
WHERE quantity < 500
© 2017 IBM Corporation27
Column Skipping/Pruning (a.k.a. Projection Pushdown)
// Projection
boolean projection[] = new boolean[] {false, true, true});
// ORC RecordReader with projection pushdown
RecordReader rr = r.rows(
new Reader.Options()
.include(projection));
// Compute sum(amount)
double sum = 0;
while (rr.nextBatch(batch)) { … }
SELECT sum(amount)
FROM sales
WHERE quantity < 500
© 2017 IBM Corporation28
Column Skipping/Pruning (a.k.a. Projection Pushdown)
create external hadoop table web_sales (
ws_sold_date_sk int, ws_sold_time_sk int, ws_ship_date_sk int, ws_item_sk int not null, ws_bill_customer_sk int,
ws_bill_cdemo_sk int, ws_bill_hdemo_sk int, ws_bill_addr_sk int, ws_ship_customer_sk int, ws_ship_cdemo_sk int,
ws_ship_hdemo_sk int, ws_ship_addr_sk int, ws_web_page_sk int, ws_web_site_sk int, ws_ship_mode_sk int, ws_warehouse_sk int,
ws_promo_sk int, ws_order_number bigint not null, ws_quantity bigint, ws_wholesale_cost double, ws_list_price double,
ws_sales_price double, ws_ext_discount_amt double, ws_ext_sales_price double, ws_ext_wholesale_cost double,
ws_ext_list_price double, ws_ext_tax double, ws_coupon_amt double, ws_ext_ship_cost double, ws_net_paid double,
ws_net_paid_inc_tax double, ws_net_paid_inc_ship double, ws_net_paid_inc_ship_tax double, ws_net_profit double
)
STORED AS ORC;
0
5
10
15
20
25
30
With Projection Pushdown Without Projection Pushdown
Elapsedtime(seconds)
SELECT max(ws_sold_date_sk)
FROM tpcds_1tb.web_sales
3.5
24.9
© 2017 IBM Corporation29
Row Skipping/Pruning (a.k.a. Predicate Pushdown)
▪ Row skipping leverages the MIN/MAX stats
quantity
MIN:274,MAX:590
quantity
MIN:603,MAX:3000
quantity
MIN:510, Max:540
quantity
MIN:330, Max:420
Stripes Row groups
SELECT sum(amount)
FROM sales
WHERE quantity < 500
• Data must be sorted for
pruning to be effective!
© 2017 IBM Corporation30
Row Skipping/Pruning (a.k.a. Predicate Pushdown)
// Predicates
SearchArgument selection = SearchArgumentFactory.newBuilder()
.lessThan("quantity", PredicateLeaf.Type.LONG, 500L).build();
// ORC RecordReader with projection and selection pushdown
RecordReader rr = r.rows(
new Reader.Options()
.include(projection)
.searchArgument(selection, new String[] {}));
// Compute sum(amount)
double sum =0;
while (rr.nextBatch(batch)) { ... }
SELECT sum(amount)
FROM sales
WHERE quantity < 500
© 2017 IBM Corporation31
Scaling Up
© 2017 IBM Corporation32
Scaling Up
▪ Reading too many files hurts performance and can cause OOM
▪ How Many?
 Number of disks
 Java Heap size
▪ Must have multiple disks AND the files must be evenly distributed
across the disks
▪ But number of threads must be limited by the Java heap size!
 Need 30 to 80 MB of Java Heap per open ORC file/split
 Biggest consumers
• RecordReader
• VectorizedRowBatch
© 2017 IBM Corporation33
Scaling Up In an Engine
▪ An engine must handle concurrency (multiple queries/users)
▪ Fixed # of open files per scan leads to OOMs
▪ Must gracefully degrade performance instead of OOM
▪ Need to limit total open files
▪ Multiple issues to deal with
 Starvation (all queries must make reasonable progress)
 Stragglers (work for a scan must be evenly balanced across nodes)
 Adapting parallelism to concurrency
• Single query must take full advantage of the available resources
• As concurrency increases, parallelism decreases for ongoing scans
• When concurrency decreases, parallelism per scan increases
© 2017 IBM Corporation34
Big SQL Tuning for ORC
• The “fundamentals” (regardless of the data format)
 Ensure Big SQL has enough resources
• % memory
• % CPU
• Temp storage spread across multiple disks (e.g. same disks as HDFS)
 Use Partitioning
 Run ANALYZE on all your tables (or ensure AUTO ANALYZE is enabled)
▪ ORC-specific tuning (Big SQL 5.0.1 & up)
 Property bigsql.java.orc.preftp.size controls the max number of open ORC files
▪ ORC file creation
 Data sorted by filtering columns
 Stripe and row group size
 Bloom filters
▪ For more Big SQL tuning tips, see
https://developer.ibm.com/hadoop/2016/11/16/top-6-big-sql-v4-2-performance-tips/
© 2017 IBM Corporation35
Summary
▪ ORC = Storage Efficiency + Fast Ingestion
▪ Fast ingestion using Vectorized APIs in Apache ORC
▪ Big SQL performs best with data in ORC format!
© 2017 IBM Corporation36
Questions?
© 2017 IBM Corporation37
Thank you!

More Related Content

What's hot

FormaciónSP.pptx
FormaciónSP.pptxFormaciónSP.pptx
FormaciónSP.pptx
AlexisGodoy15
 
DevOps concepts, tools, and technologies v1.0
DevOps concepts, tools, and technologies v1.0DevOps concepts, tools, and technologies v1.0
DevOps concepts, tools, and technologies v1.0
Mohamed Taman
 
Postman. From simple API test to end to end scenario
Postman. From simple API test to end to end scenarioPostman. From simple API test to end to end scenario
Postman. From simple API test to end to end scenario
HYS Enterprise
 
Automated Code Reviews with AI and ML - DevOps Next
Automated Code Reviews with AI and ML - DevOps NextAutomated Code Reviews with AI and ML - DevOps Next
Automated Code Reviews with AI and ML - DevOps Next
Perfecto by Perforce
 
Automated testing with Cypress
Automated testing with CypressAutomated testing with Cypress
Automated testing with Cypress
Yong Shean Chong
 
Develop and optimize CV/DL applications with Intel OpenVINO toolkit
Develop and optimize CV/DL applications with Intel OpenVINO toolkitDevelop and optimize CV/DL applications with Intel OpenVINO toolkit
Develop and optimize CV/DL applications with Intel OpenVINO toolkit
Yury Gorbachev
 
Python in Test automation
Python in Test automationPython in Test automation
Python in Test automation
Krishnana Sreeraman
 
SonarQube Presentation.pptx
SonarQube Presentation.pptxSonarQube Presentation.pptx
SonarQube Presentation.pptx
Satwik Bhupathi Raju
 
Testing RESTful web services with REST Assured
Testing RESTful web services with REST AssuredTesting RESTful web services with REST Assured
Testing RESTful web services with REST Assured
Bas Dijkstra
 
Test Design and Automation for REST API
Test Design and Automation for REST APITest Design and Automation for REST API
Test Design and Automation for REST API
Ivan Katunou
 
Automatisations des tests fonctionnels avec Robot Framework
Automatisations des tests fonctionnels avec Robot FrameworkAutomatisations des tests fonctionnels avec Robot Framework
Automatisations des tests fonctionnels avec Robot Framework
laurent bristiel
 
Space Camp June 2022 - API First.pdf
Space Camp June 2022 - API First.pdfSpace Camp June 2022 - API First.pdf
Space Camp June 2022 - API First.pdf
Postman
 
[GDSC-ADYPU] APIs 101 with Postman
[GDSC-ADYPU] APIs 101 with Postman[GDSC-ADYPU] APIs 101 with Postman
[GDSC-ADYPU] APIs 101 with Postman
PranayNarang1
 
Managing code quality with SonarQube
Managing code quality with SonarQubeManaging code quality with SonarQube
Managing code quality with SonarQube
Radu Vunvulea
 
Advanced API Debugging
Advanced API DebuggingAdvanced API Debugging
Advanced API Debugging
Postman
 
Devops Devops Devops
Devops Devops DevopsDevops Devops Devops
Devops Devops Devops
Kris Buytaert
 
How to implement DevOps in your Organization
How to implement DevOps in your OrganizationHow to implement DevOps in your Organization
How to implement DevOps in your Organization
Dalibor Blazevic
 
Introduction cypress
Introduction cypressIntroduction cypress
Introduction cypress
Oim Trust
 

What's hot (20)

FormaciónSP.pptx
FormaciónSP.pptxFormaciónSP.pptx
FormaciónSP.pptx
 
DevOps concepts, tools, and technologies v1.0
DevOps concepts, tools, and technologies v1.0DevOps concepts, tools, and technologies v1.0
DevOps concepts, tools, and technologies v1.0
 
Postman. From simple API test to end to end scenario
Postman. From simple API test to end to end scenarioPostman. From simple API test to end to end scenario
Postman. From simple API test to end to end scenario
 
DevOps
DevOpsDevOps
DevOps
 
Automated Code Reviews with AI and ML - DevOps Next
Automated Code Reviews with AI and ML - DevOps NextAutomated Code Reviews with AI and ML - DevOps Next
Automated Code Reviews with AI and ML - DevOps Next
 
Automated testing with Cypress
Automated testing with CypressAutomated testing with Cypress
Automated testing with Cypress
 
Develop and optimize CV/DL applications with Intel OpenVINO toolkit
Develop and optimize CV/DL applications with Intel OpenVINO toolkitDevelop and optimize CV/DL applications with Intel OpenVINO toolkit
Develop and optimize CV/DL applications with Intel OpenVINO toolkit
 
Python in Test automation
Python in Test automationPython in Test automation
Python in Test automation
 
SonarQube Presentation.pptx
SonarQube Presentation.pptxSonarQube Presentation.pptx
SonarQube Presentation.pptx
 
Testing RESTful web services with REST Assured
Testing RESTful web services with REST AssuredTesting RESTful web services with REST Assured
Testing RESTful web services with REST Assured
 
Test Design and Automation for REST API
Test Design and Automation for REST APITest Design and Automation for REST API
Test Design and Automation for REST API
 
Automatisations des tests fonctionnels avec Robot Framework
Automatisations des tests fonctionnels avec Robot FrameworkAutomatisations des tests fonctionnels avec Robot Framework
Automatisations des tests fonctionnels avec Robot Framework
 
Space Camp June 2022 - API First.pdf
Space Camp June 2022 - API First.pdfSpace Camp June 2022 - API First.pdf
Space Camp June 2022 - API First.pdf
 
[GDSC-ADYPU] APIs 101 with Postman
[GDSC-ADYPU] APIs 101 with Postman[GDSC-ADYPU] APIs 101 with Postman
[GDSC-ADYPU] APIs 101 with Postman
 
SonarQube
SonarQubeSonarQube
SonarQube
 
Managing code quality with SonarQube
Managing code quality with SonarQubeManaging code quality with SonarQube
Managing code quality with SonarQube
 
Advanced API Debugging
Advanced API DebuggingAdvanced API Debugging
Advanced API Debugging
 
Devops Devops Devops
Devops Devops DevopsDevops Devops Devops
Devops Devops Devops
 
How to implement DevOps in your Organization
How to implement DevOps in your OrganizationHow to implement DevOps in your Organization
How to implement DevOps in your Organization
 
Introduction cypress
Introduction cypressIntroduction cypress
Introduction cypress
 

Similar to Ingesting Data at Blazing Speed Using Apache Orc

Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
DataWorks Summit
 
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892Torsten Steinbach
 
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGISAnalyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGISIBM Cloud Data Services
 
Benchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herdBenchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herd
Gord Sissons
 
Spark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny AppsSpark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny Apps
Data Con LA
 
Using GPUs to Achieve Massive Parallelism in Java 8
Using GPUs to Achieve Massive Parallelism in Java 8Using GPUs to Achieve Massive Parallelism in Java 8
Using GPUs to Achieve Massive Parallelism in Java 8
Dev_Events
 
Session 2546 - Solving Performance Problems in CICS using CICS Performance A...
Session 2546 -  Solving Performance Problems in CICS using CICS Performance A...Session 2546 -  Solving Performance Problems in CICS using CICS Performance A...
Session 2546 - Solving Performance Problems in CICS using CICS Performance A...
nick_garrod
 
IBM Informix on cloud webcast August 2017
IBM Informix on cloud webcast August 2017IBM Informix on cloud webcast August 2017
IBM Informix on cloud webcast August 2017
Pradeep Natarajan
 
IBM Impact Session 2351 hybrid apps
IBM Impact Session 2351 hybrid appsIBM Impact Session 2351 hybrid apps
IBM Impact Session 2351 hybrid apps
nick_garrod
 
IBM i and digital transformation
IBM i and digital transformationIBM i and digital transformation
IBM i and digital transformation
Gerard Suren
 
2016 02-16-announce-overview-zsp04505 usen
2016 02-16-announce-overview-zsp04505 usen2016 02-16-announce-overview-zsp04505 usen
2016 02-16-announce-overview-zsp04505 usen
David Morlitz
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
DataWorks Summit
 
Stephan Hummel – IT-Tage 2015 – DB2 In-Memory - Eine Technologie nicht nur fü...
Stephan Hummel – IT-Tage 2015 – DB2 In-Memory - Eine Technologie nicht nur fü...Stephan Hummel – IT-Tage 2015 – DB2 In-Memory - Eine Technologie nicht nur fü...
Stephan Hummel – IT-Tage 2015 – DB2 In-Memory - Eine Technologie nicht nur fü...
Informatik Aktuell
 
FOSDEM2016 - Ruby and OMR
FOSDEM2016 - Ruby and OMRFOSDEM2016 - Ruby and OMR
FOSDEM2016 - Ruby and OMR
Charlie Gracie
 
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB
 
14 guendert pres
14 guendert pres14 guendert pres
14 guendert pres
Rodrigo Campos
 
How to combine Db2 on Z, IBM Db2 Analytics Accelerator and IBM Machine Learni...
How to combine Db2 on Z, IBM Db2 Analytics Accelerator and IBM Machine Learni...How to combine Db2 on Z, IBM Db2 Analytics Accelerator and IBM Machine Learni...
How to combine Db2 on Z, IBM Db2 Analytics Accelerator and IBM Machine Learni...
Gustav Lundström
 
2829 liberty
2829 liberty2829 liberty
2829 liberty
nick_garrod
 

Similar to Ingesting Data at Blazing Speed Using Apache Orc (20)

Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
 
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
 
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGISAnalyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
 
Benchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herdBenchmarking Hadoop - Which hadoop sql engine leads the herd
Benchmarking Hadoop - Which hadoop sql engine leads the herd
 
Spark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny AppsSpark working with a Cloud IDE: Notebook/Shiny Apps
Spark working with a Cloud IDE: Notebook/Shiny Apps
 
Using GPUs to Achieve Massive Parallelism in Java 8
Using GPUs to Achieve Massive Parallelism in Java 8Using GPUs to Achieve Massive Parallelism in Java 8
Using GPUs to Achieve Massive Parallelism in Java 8
 
Session 2546 - Solving Performance Problems in CICS using CICS Performance A...
Session 2546 -  Solving Performance Problems in CICS using CICS Performance A...Session 2546 -  Solving Performance Problems in CICS using CICS Performance A...
Session 2546 - Solving Performance Problems in CICS using CICS Performance A...
 
Vision2015-CBS-1148-Final
Vision2015-CBS-1148-FinalVision2015-CBS-1148-Final
Vision2015-CBS-1148-Final
 
IOD 2012_ADP_092912
IOD 2012_ADP_092912 IOD 2012_ADP_092912
IOD 2012_ADP_092912
 
IBM Informix on cloud webcast August 2017
IBM Informix on cloud webcast August 2017IBM Informix on cloud webcast August 2017
IBM Informix on cloud webcast August 2017
 
IBM Impact Session 2351 hybrid apps
IBM Impact Session 2351 hybrid appsIBM Impact Session 2351 hybrid apps
IBM Impact Session 2351 hybrid apps
 
IBM i and digital transformation
IBM i and digital transformationIBM i and digital transformation
IBM i and digital transformation
 
2016 02-16-announce-overview-zsp04505 usen
2016 02-16-announce-overview-zsp04505 usen2016 02-16-announce-overview-zsp04505 usen
2016 02-16-announce-overview-zsp04505 usen
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
Stephan Hummel – IT-Tage 2015 – DB2 In-Memory - Eine Technologie nicht nur fü...
Stephan Hummel – IT-Tage 2015 – DB2 In-Memory - Eine Technologie nicht nur fü...Stephan Hummel – IT-Tage 2015 – DB2 In-Memory - Eine Technologie nicht nur fü...
Stephan Hummel – IT-Tage 2015 – DB2 In-Memory - Eine Technologie nicht nur fü...
 
FOSDEM2016 - Ruby and OMR
FOSDEM2016 - Ruby and OMRFOSDEM2016 - Ruby and OMR
FOSDEM2016 - Ruby and OMR
 
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
 
14 guendert pres
14 guendert pres14 guendert pres
14 guendert pres
 
How to combine Db2 on Z, IBM Db2 Analytics Accelerator and IBM Machine Learni...
How to combine Db2 on Z, IBM Db2 Analytics Accelerator and IBM Machine Learni...How to combine Db2 on Z, IBM Db2 Analytics Accelerator and IBM Machine Learni...
How to combine Db2 on Z, IBM Db2 Analytics Accelerator and IBM Machine Learni...
 
2829 liberty
2829 liberty2829 liberty
2829 liberty
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

Ingesting Data at Blazing Speed Using Apache Orc

  • 1. © 2017 IBM Corporation Ingesting Data at Blazing Speed with Apache ORC Gustavo Arocena IBM Toronto Lab
  • 2. © 2017 IBM Corporation2 IBM Canada Lab Toronto
  • 3. © 2017 IBM Corporation3 Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2017. All rights reserved. U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo, ibm.com, BigInsights, and Big SQL are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at ▪“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml ▪TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council ▪Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera. ▪Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries. ▪Other company, product, or service names may be trademarks or service marks of others.
  • 4. © 2017 IBM Corporation4 Agenda What is Big SQL? From Parquet to ORC Reading ORC Files Fast Scaling Up Tuning Big SQL for ORC
  • 5. © 2017 IBM Corporation5 Big SQL Background
  • 6. © 2017 IBM Corporation6 What is Big SQL? Data Security Metastore Cluster Mgmt. Administration Runs on Data Platform
  • 7. © 2017 IBM Corporation7 What’s the Big Deal? Grace Under Pressure
  • 8. © 2017 IBM Corporation8 The Big SQL Advantages Scale • Only engine to run TPC-DS at 100TB scale Complex SQL • Capable of running all 99 TPC-DS queries since 2014 • Complex queries optimized with IBM Cost-based optimizer Concurrency • Handles highly concurrent workloads gracefully • 12 stream TPC-DS Efficient Resource Utilization • Memory • CPU • IO What’s the Big Deal?
  • 9. © 2017 IBM Corporation9 Metrics for Big SQL 4.2.5 vs. Spark SQL 2.1 ▪ Hadoop DS @ 100TB, 4 streams 13.7 43.2 BIG SQL SPARK SQL Hours Elapsed Time 76.4 88.2 BIG SQL SPARK SQL % CPU Utilization 107 388 BIG SQL SPARK SQL MB/Sec Disk Reads 25 237 BIG SQL SPARK SQL MB/Sec Disk Writes - 15% 1/3 1/3 1/9 https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/
  • 10. © 2017 IBM Corporation10 From Parquet to ORC
  • 11. © 2017 IBM Corporation11 Big SQL Architecture (as of 2016) Head Node Worker Node Worker Node Worker Node Parquet IO Hive Compat. IO Hive Metastore HDFS HDFS NN
  • 12. © 2017 IBM Corporation12 2017: Big SQL on HDP Most popular data format on HDP is ORC ORC performance becomes top priority
  • 13. © 2017 IBM Corporation13 0 5000 10000 15000 20000 25000 30000 35000 1 Stream 4 Streams ElapsedTime(sec) Parquet vs ORC 1TB TPC-DS Parquet ORC v0 70% Slower with ORC ORC vs Parquet in Big SQL 4.2 > 50000 315% Slower with ORC
  • 14. © 2017 IBM Corporation14 Limitations of Hive Compatibility IO Engine Slow Ingestion Single row at a time JIT unfriendly Data values as Java objects Low Scalability Large memory footprint per scan Excessive CPU use Overloaded disks
  • 15. © 2017 IBM Corporation15 The Roadmap Towards Fast ORC ingestion 1st Phase • Big SQL 4.2.5, Dec ‘16 • Fast ORC Ingestion 2nd Phase • Big SQL 5.0.1, Aug ‘17 • ORC at Scale
  • 16. © 2017 IBM Corporation16 1st Phase – Fast Ingestion using Apache ORC 0 5000 10000 15000 20000 25000 1 Stream 4 Streams ElapsedTime(sec) Parquet vs ORC 1TB TPC-DS Parquet ORC v1 Apache ORC libs key benefits ▪ Many-row-at-a-time API ▪ Enable JIT-friendly code ▪ Represent data using primitive Java types ▪ Make projection and selection pushdown very easy 2% Faster with ORC 65% Slower with ORC
  • 17. © 2017 IBM Corporation17 2nd Phase - Managing Resources 0 2000 4000 6000 8000 10000 12000 14000 1 Stream 4 Streams ElapsedTime(sec) Parquet vs ORC 1TB TPC-DS Parquet ORC v2 Resource Manager has global oversight over ▪ Total number of threads ▪ Overall JVM heap consumption ▪ Degree of parallelism per scan 15% Faster with ORC 3.4% Faster with ORC
  • 18. © 2017 IBM Corporation18 ORC as a First Class Citizen in 5.0.1 Head Node Worker Node Worker Node Worker Node Parquet IO Hive Compat. IO Hive Metastore HDFS HDFS NN ORC IO
  • 19. © 2017 IBM Corporation19 ORC Background
  • 20. © 2017 IBM Corporation20 What is Apache ORC? ORC = efficient storage + fast ingestion Compression • Type-specific encodings (RLE for numbers, dictionary for strings, etc) • Generic compression (Zlib, Snappy) Data skipping • Column skipping based on data layout • Row skipping based on MIN/MAX stats and bloom filters JIT friendly • Vectorized APIs (retrieve data as arrays of primitive values) ▪ Engines leverage all these features ▪ Apache ORC libs allow applications to leverage them too
  • 21. © 2017 IBM Corporation21 ORC Physical Data Layout CREATE HADOOP TABLE SALES(id INTEGER, quantity INTEGER, amount DOUBLE) Stripe stats Stripe stats Stripe stats File stats Stripe (HDFS block) Row group (10K rows) 10K id values 10K quantity values 10K amount values Row group stats
  • 22. © 2017 IBM Corporation22 Leveraging the Apache ORC Libraries
  • 23. © 2017 IBM Corporation23 Dependencies and Classes ▪ Java Dependencies (orc.apache.org group id in Maven) orc-core-1.4.0-nohive.jar aircompressor-0.3.jar ▪ Java Classes for “vectorized” processing import org.apache.orc.OrcFile; import org.apache.orc.Reader; import org.apache.orc.RecordReader; import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch; import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector; import org.apache.orc.storage.ql.io.sarg.SearchArgument;
  • 24. © 2017 IBM Corporation24 Using Vectorized ORC APIs Reader r = OrcFile.createReader(path, OrcFile.readerOptions(conf)); RecordReader rr = r.rows(); VectorizedRowBatch batch = r.getSchema().createRowBatch(1000); 1000 Values ID QUANTITY AMOUNT long quantity[1000]; long id[1000]; double amount[1000]; ▪ A vectorized row batch is a Java object that contains 1000 decoded rows
  • 25. © 2017 IBM Corporation25 JIT friendly code // Compute sum(amount) double sum = 0; while (rr.nextBatch(batch)) { long[] qty = ((LongColumnVector) batch.cols[1]).vector; double[] amt = ((DoubleColumnVector) batch.cols[2]).vector; for (int i=0; i < batch.size; i++) if (qty[i] < 500) sum += amt[i]; } ▪ Get the total for sales involving less than 500 items SELECT sum(amount) FROM sales WHERE quantity < 500 • No objects • No method calls • Tight loop compiles to machine code
  • 26. © 2017 IBM Corporation26 Column Skipping/Pruning (a.k.a. Projection Pushdown) ▪ If we don’t say otherwise, ORC will read all the columns ▪ But our query is using only two columns ID QUANTITY AMOUNT SELECT sum(amount) FROM sales WHERE quantity < 500
  • 27. © 2017 IBM Corporation27 Column Skipping/Pruning (a.k.a. Projection Pushdown) // Projection boolean projection[] = new boolean[] {false, true, true}); // ORC RecordReader with projection pushdown RecordReader rr = r.rows( new Reader.Options() .include(projection)); // Compute sum(amount) double sum = 0; while (rr.nextBatch(batch)) { … } SELECT sum(amount) FROM sales WHERE quantity < 500
  • 28. © 2017 IBM Corporation28 Column Skipping/Pruning (a.k.a. Projection Pushdown) create external hadoop table web_sales ( ws_sold_date_sk int, ws_sold_time_sk int, ws_ship_date_sk int, ws_item_sk int not null, ws_bill_customer_sk int, ws_bill_cdemo_sk int, ws_bill_hdemo_sk int, ws_bill_addr_sk int, ws_ship_customer_sk int, ws_ship_cdemo_sk int, ws_ship_hdemo_sk int, ws_ship_addr_sk int, ws_web_page_sk int, ws_web_site_sk int, ws_ship_mode_sk int, ws_warehouse_sk int, ws_promo_sk int, ws_order_number bigint not null, ws_quantity bigint, ws_wholesale_cost double, ws_list_price double, ws_sales_price double, ws_ext_discount_amt double, ws_ext_sales_price double, ws_ext_wholesale_cost double, ws_ext_list_price double, ws_ext_tax double, ws_coupon_amt double, ws_ext_ship_cost double, ws_net_paid double, ws_net_paid_inc_tax double, ws_net_paid_inc_ship double, ws_net_paid_inc_ship_tax double, ws_net_profit double ) STORED AS ORC; 0 5 10 15 20 25 30 With Projection Pushdown Without Projection Pushdown Elapsedtime(seconds) SELECT max(ws_sold_date_sk) FROM tpcds_1tb.web_sales 3.5 24.9
  • 29. © 2017 IBM Corporation29 Row Skipping/Pruning (a.k.a. Predicate Pushdown) ▪ Row skipping leverages the MIN/MAX stats quantity MIN:274,MAX:590 quantity MIN:603,MAX:3000 quantity MIN:510, Max:540 quantity MIN:330, Max:420 Stripes Row groups SELECT sum(amount) FROM sales WHERE quantity < 500 • Data must be sorted for pruning to be effective!
  • 30. © 2017 IBM Corporation30 Row Skipping/Pruning (a.k.a. Predicate Pushdown) // Predicates SearchArgument selection = SearchArgumentFactory.newBuilder() .lessThan("quantity", PredicateLeaf.Type.LONG, 500L).build(); // ORC RecordReader with projection and selection pushdown RecordReader rr = r.rows( new Reader.Options() .include(projection) .searchArgument(selection, new String[] {})); // Compute sum(amount) double sum =0; while (rr.nextBatch(batch)) { ... } SELECT sum(amount) FROM sales WHERE quantity < 500
  • 31. © 2017 IBM Corporation31 Scaling Up
  • 32. © 2017 IBM Corporation32 Scaling Up ▪ Reading too many files hurts performance and can cause OOM ▪ How Many?  Number of disks  Java Heap size ▪ Must have multiple disks AND the files must be evenly distributed across the disks ▪ But number of threads must be limited by the Java heap size!  Need 30 to 80 MB of Java Heap per open ORC file/split  Biggest consumers • RecordReader • VectorizedRowBatch
  • 33. © 2017 IBM Corporation33 Scaling Up In an Engine ▪ An engine must handle concurrency (multiple queries/users) ▪ Fixed # of open files per scan leads to OOMs ▪ Must gracefully degrade performance instead of OOM ▪ Need to limit total open files ▪ Multiple issues to deal with  Starvation (all queries must make reasonable progress)  Stragglers (work for a scan must be evenly balanced across nodes)  Adapting parallelism to concurrency • Single query must take full advantage of the available resources • As concurrency increases, parallelism decreases for ongoing scans • When concurrency decreases, parallelism per scan increases
  • 34. © 2017 IBM Corporation34 Big SQL Tuning for ORC • The “fundamentals” (regardless of the data format)  Ensure Big SQL has enough resources • % memory • % CPU • Temp storage spread across multiple disks (e.g. same disks as HDFS)  Use Partitioning  Run ANALYZE on all your tables (or ensure AUTO ANALYZE is enabled) ▪ ORC-specific tuning (Big SQL 5.0.1 & up)  Property bigsql.java.orc.preftp.size controls the max number of open ORC files ▪ ORC file creation  Data sorted by filtering columns  Stripe and row group size  Bloom filters ▪ For more Big SQL tuning tips, see https://developer.ibm.com/hadoop/2016/11/16/top-6-big-sql-v4-2-performance-tips/
  • 35. © 2017 IBM Corporation35 Summary ▪ ORC = Storage Efficiency + Fast Ingestion ▪ Fast ingestion using Vectorized APIs in Apache ORC ▪ Big SQL performs best with data in ORC format!
  • 36. © 2017 IBM Corporation36 Questions?
  • 37. © 2017 IBM Corporation37 Thank you!