SlideShare a Scribd company logo
HUAWEI TECHNOLOGIES CO., LTD.
CarbonData : A New Hadoop File
Format For Faster Data Analysis
2
Outline
 Use Case & Motivation : Why introducing a new file format?
 CarbonData File Format Deep Dive
 Framework Integrated with CarbonData
 Performance
 Demo
 Future Plan
3
 Full table scan
 Big scan & fast batch processing
 Only fetch a few columns of the table
 Common usage scenario:
 ETL job
 Log Analysis
Use case: Sequential scan
C1 C2 C3 C4 C5 C6 C7
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
…..
4
 Multi-dimensional data analysis
 Involves aggregation / join
 Roll-up, Drill-down, Slicing and Dicing
 Low-latency ad-hoc query
 Common usage scenario:
 Dash-board reporting
 Fraud & Ad-hoc Analysis
Use case: OLAP-Style Query
C1 C2 C3 C4 C5 C6 C7
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
5
 Predicate filtering on range of columns
 Full row keys or range of keys lookup
 Narrow scan but might fetch all columns
 Requires second/sub-second level low-latency
 Common usage scenario:
 Operational query
 User profiling
Use case: Random Access
C1 C2 C3 C4 C5 C6 C7
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
……
6
Motivation
Random Access
(narrow scan)
Sequential Access
(big scan)
OLAP Style Query
(multi-dimensional analysis) CarbonData: A Single File Format
suits for different types of access
7
Design Goals
 Low-Latency for various types of data access pattern
 Allow fast query on fast data
 Ensure Space Efficiency
 General format available on Hadoop-ecosystem
 Read-optimized columnar storage
 Leveraging multi-level Index for low-latency
 Support column group to leverage the benefit of row-based
 Enables dictionary encoding for deferred decoding for aggregation
 Optimized streaming ingestion support
 Broader Integration across Hadoop-ecosystem
CarbonData:
8
Outline
 Use cases & Motivation: Why introducing a new file format?
 CarbonData File Format Deep Dive
 Framework Integrated with CarbonData
 Performance
 Demo
 Future Plan
9
Carbon File
CarbonData File Structure
 Blocklet : A set of rows in columnar format
 Default blocklet size: ~120k rows
 Balance between efficient scan and compression
 Column chunk : Data for one column/column group in a Blocklet
 Allow multiple columns forms a column group & stored as row-based
 Column data stored as sorted index
 Footer : Metadata information
 File level metadata & statistics
 Schema
 Blocklet Index & Blocklet level Metadata
Blocklet 1
Col1 Chunk
Col2 Chunk
…
Colgroup1 Chunk
Colgroup2 Chunk
…
Blocklet N
…
Footer
10
Carbon Data File
Blocklet 1
Column 1 Chunk
Column 2 Chunk
…
ColumnGroup 1 Chunk
ColumnGroup 2 Chunk
…
Blocklet N
File Footer
Blocklet Index
Blocklet 1 Index Node
•Minmax index: min, max
•Multi-dimensional index: startKey,
endKey
Blocklet N Index Node
…
…
Blocklet Info
Blocklet 1 Info
Blocklet N Info
•Column 1 Chunk Info
•Compression scheme
•ColumnFormat
•ColumnID list
•ColumnChunk length
•ColumnChunk offset
…
File Metadata
Version, No. Row, …
Segment Info
Schema
Schema for each column
Blocklet Index
Blocklet Info
ColumnGroup1 Chunk Info
…
…
Format
11
Years Quarters Months Territory Country Quantity Sales
2003 QTR1 Jan EMEA Germany 142 11,432
2003 QTR1 Jan APAC China 541 54,702
2003 QTR1 Jan EMEA Spain 443 44,622
2003 QTR1 Feb EMEA Denmark 545 58,871
2003 QTR1 Feb EMEA Italy 675 56,181
2003 QTR1 Mar APAC India 52 9,749
2003 QTR1 Mar EMEA UK 570 51,018
2003 QTR1 Mar Japan Japan 561 55,245
2003 QTR2 Apr APAC Australia 525 50,398
2003 QTR2 Apr EMEA Germany 144 11,532
[1,1,1,1,1] : [142,11432]
[1,1,1,3,2] : [541,54702]
[1,1,1,1,3] : [443,44622]
[1,1,2,1,4] : [545,58871]
[1,1,2,1,5] : [675,56181]
[1,1,3,3,6] : [52,9749]
[1,1,3,1,7] : [570,51018]
[1,1,3,2,8] : [561,55245]
[1,2,4,3,9] : [525,50398]
[1,2,4,1,1] : [144,11532]
Blocklet
• Data are sorted along MDK (multi-dimensional keys)
• data stored as index in columnar format
Encoding
Blocklet Logical View
Sort
(MDK Index)
[1,1,1,1,1] : [142,11432]
[1,1,1,1,3] : [443,44622]
[1,1,1,3,2] : [541,54702]
[1,1,2,1,4] : [545,58871]
[1,1,2,1,5] : [675,56181]
[1,1,3,1,7] : [570,51018]
[1,1,3,2,8] : [561,55245]
[1,1,3,3,6] : [52,9749]
[1,2,4,1,1] : [144,11532]
[1,2,4,3,9] : [525,50398]
Sorted MDK Index
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
3
3
3
4
4
1
1
3
1
1
1
2
3
1
3
142
443
541
545
675
570
561
52
144
525
11432
44622
54702
58871
56181
51018
55245
9749
11532
50398
C1 C2 C3 C4 C5 C6 C7
1
3
2
4
5
7
8
6
1
9
12
File Level Blocklet Index
Block 1
1 1 1 1 1 1 12000
1 1 1 2 1 2 5000
1 1 2 1 1 1 12000
1 1 2 2 1 2 5000
1 1 3 1 1 1 12000
1 1 3 2 1 2 5000
Block 2
1 2 1 3 2 3 11000
1 2 2 3 2 3 11000
1 2 3 3 2 3 11000
1 3 1 4 3 4 2000
1 3 1 5 3 4 1000
1 3 2 4 3 4 2000
Block 3
1 3 2 5 3 4 1000
1 3 3 4 3 4 2000
1 3 3 5 3 4 1000
1 4 1 4 1 1 20000
1 4 2 4 1 1 20000
1 4 3 4 1 1 20000
Block 4
2 1 1 1 1 1 12000
2 1 1 2 1 2 5000
2 1 2 1 1 1 12000
2 1 2 2 1 2 5000
2 1 3 1 1 1 12000
2 1 3 2 1 2 5000
Blocklet Index
Block1
Start Key1
End Key1 Start Key1
End Key4
Start Key1
End Key2
Start Key3
End Key4
Start Key1
End Key1
Start Key2
End Key2
Start Key3
End Key3
Start Key4
End Key4
File FooterBlocklet
• Build in-memory file level MDK index tree for filtering
• Major optimization for efficient scan
C1(Min, Max)
….
C7(Min, Max)
Block4
Start Key4
End Key4
C1(Min, Max)
….
C7(Min, Max)
C1(Min,Max)
…
C7(Min,Max)
C1(Min,Max)
…
C7(Min,Max)
C1(Min,Max)
…
C7(Min,Max)
C1(Min,Max)
…
C7(Min,Max)
13
Blocklet Rows
[1|1] :[1|1] :[1|1] :[1|1] :[1|1] : [142]:[11432]
[1|2] :[1|2] :[1|2] :[1|2] :[1|9] : [443]:[44622]
[1|3] :[1|3] :[1|3] :[1|4] :[2|3] : [541]:[54702]
[1|4] :[1|4] :[2|4] :[1|5] :[3|2] : [545]:[58871]
[1|5] :[1|5] :[2|5] :[1|6] :[4|4] : [675]:[56181]
[1|6] :[1|6] :[3|6] :[1|9] :[5|5] : [570]:[51018]
[1|7] :[1|7] :[3|7] :[2|7] :[6|8] : [561]:[55245]
[1|8] :[1|8] :[3|8] :[3|3] :[7|6] : [52]:[9749]
[1|9] :[2|9] :[4|9] :[3|8] :[8|7] : [144]:[11532]
[1|10]:[2|10]:[4|10]:[3|10] :[9|10] : [525]:[50398]
Blocklet
( sort column within column chunk)
Run Length Encoding & Compression
Dim1 Block
1(1-10)
Dim2 Block
1(1-8)
2(9-10)
Dim3 Block
1(1-3)
2(4-5)
3(6-8)
4(9-10)
Dim4 Block
1(1-2,4-6,9)
2(7)
3(3,8,10)
Measure1
Block
Measure2
Block
Dim5 Block
1(1,9)
2(3)
3(2)
4(4)
5(5)
6(8)
7(6)
8(7)
9(10)
Columnar Store
Column chunk Level
inverted Index
[142]:[11432]
[443]:[44622]
[541]:[54702]
[545]:[58871]
[675]:[56181]
[570]:[51018]
[561]:[55245]
[52]:[9749]
[144]:[11532]
[525]:[50398]
Column Chunk Inverted Index
• Optionally store column data as inverted index
within column chunk
• suitable to low cardinality column
• better compression & fast predicate filtering
Blocklet Physical View
1
10
142
443
541
545
675
570
561
52
144
525
11432
44622
54702
58871
56181
51018
55245
9749
11532
50398
C1
d r d r d r d r d r d r
1
10
1
8
2
2
1
10
1
3
2
2
3
3
4
2
1
10
1
6
2
1
3
3
1
2
4
3
9
1
7
1
3
1
…
1
2
2
1
3
1
4
1
5
1
…
1
1
9
1
3
1
2
1
4
1
…
C2 C3 C4 C5 C6 C7
14
10 2 23 23 38 15.2
10 2 50 15 29 18.5
10 3 51 18 52 22.8
11 6 60 29 16 32.9
12 8 68 32 18 21.6
Blocklet 1
C1 C2 C3 C4 C6C5
Col
Chunk
Col
Chunk
Col
Chunk
Col
Chunk
Column Group
• Allow multiple columns form a column group
• stored as a single column chunk in row-
based format
• suitable to set of columns frequently
fetched together
• saving stitching cost for reconstructing
row
Col
Chunk
15
Nested Data Type Representation
• Represented as a composite of two columns
• One column for the element value
• One column for start_index & length of Array
Arrays
• Represented as a composite of finite number
of columns
• Each struct element is a separate column
Struts
Name Array<Ph_Number>
John [192,191]
Sam [121,345,333]
Bob [198,787]
Name Array
[start,len]
Ph_Number
John 0,2 192
Sam 2,3 191
Bob 5,2 121
345
333
198
787
Name Info Strut<age,gender>
John [31,M]
Sam [45,F]
Bob [16,M]
Name Info.age Info.gender
John 31 M
Sam 45 F
Bob 16 M
16
Encoding & Compression
• Efficient encoding scheme supported:
• DELTA, RLE, BIT_PACKED
• Dictionary:
• medium high cardinality: file level dictionary
• very low cardinality: table level global dictionary
• CUSTOM
• Compression Scheme: Snappy
•Speedup Aggregation
•Reduce run-time memory footprint
•Enable deferred decoding
•Enable fast distinct count
Big Win:
17
Outline
 Use Case & Motivation: Why introducing a new file format?
 CarbonData File Format Deep Dive
 Framework Integrated with CarbonData
 Performance
 Demo
 Future Plan
18
CarbonData Modules
Carbon-format
Carbon-core
Reader/Writer
Thrift definition
Carbon-Spark
Integration
Carbon-Hadoop
Input/Output Format
Language Agnostic Format Specification
Core component of format implementation for
reading/writing Carbon data
Provide Hadoop Input/Output Format interface
Integration of Carbon with Spark including
query optimization
19
Spark Integration
• Query CarbonData Table
• DataFrame API
• Spark SQL Statement
• Support schema evolution of Carbon table via ALTER TABLE
• Add, Delete or Rename Column
• schema update only, data stored on disk is untouched
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name
data_type [COMMENT col_comment], ...)] [COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment],
...)] STORED BY ‘org.carbondata.hive.CarbonHanlder’
[TBLPROPERTIES (property_name=property_value, ...)] [AS
select_statement];
20
Blocklet
Spark Integration
Table
Block
Footer + Index
Blocklet
Blocklet
…
…
C1 C2 C3 C4 C5 C6 C7 C9
Table Level MDK Tree Index
Inverted
Index
• Query optimization
• Vectorized record reading
• Predicate push down by leveraging multi-level index
• Column Pruning
• Defer decoding for aggregation
Block
Blocklet
Blocklet
Footer + Index
Block
Footer + Index
Blocklet
Blocklet
Block
Blocklet
Blocklet
Footer + Index
21
Data Ingestion
• Bulk Data Ingestion
• CSV file conversion
• MDK clustering level: load level vs. node level
• Save Spark dataframe as Carbon data file
df.write
.format("org.apache.spark.CarbonSource")
.options(Map("dbName" -> "db1", "tableName" ->
"tbl1"))
.mode(SaveMode.Overwrite)
.save(“/path”)
LOAD DATA [LOCAL] INPATH 'folder path' [OVERWRITE]
INTO TABLE tablename
OPTIONS(property_name=property_value, ...)
INSERT INTO TABLE tablennme AS select_statement1
FROM table1;
22
Data Compaction
• Data compaction is used to merge small files
• Re-clustering across loads
• Two types of compactions
- Minor compaction
• Compact adjacent files into a single big file (~HDFS block size)
- Major compaction
• Reorganize adjacent loads to achieve better clustering along MDK index
23
Outline
 Use Case & Motivation: Why introducing a new file format?
 CarbonData File Format Deep Dive
 Framework Integrated with CarbonData
 Performance
 Demo
 Future Plan
24
26.28
12.71
9.82 10.38 11.21
23.05
17.33
15.49
17.82
24.64
107.39
101.62
111.86
9.45
4.41
1.62 2.54
8.16
0.89 0.55 0.52 0.54 1.19 0.16
2.24
4.28
0.00
20.00
40.00
60.00
80.00
100.00
120.00
SQL1 SQL2 SQL3 SQL4 SQL5 SQL6 SQL7 SQL8 SQL9 SQL10 SQL11 SQL12 SQL13
ResponseTime(Seconds)
Benchmark Queries
Carbon vs Popular Columnar Stores
Popular
Columnar Stores
Carbon
Performance comparison
High Throughput/Full Scan Query OLAP/Interactive Query Random Access Query
Data Size : 2TB
1.4x to 6x faster 20x – 33x faster 26x – 688x faster
25
Performance comparison - Observations
High Throughput/Full Scan Query
1.4 to 6 times faster
Deferred decoding enables faster aggregation on the fly.
OLAP/Interactive Query
20 to 33 times faster
MDK, Min-Max and Inverted indices enable block pruning
Deferred decoding enables faster aggregation on the fly.
Random Access Query
26 to 688 times faster
Inverted index enables faster row reconstruction.
Column group eliminates implicit joins for row reconstruction.
26
Outline
 Motivation: Why introducing a new file format?
 CarbonData File Format Deep Dive
 Framework Integrated with CarbonData
 Performance
 Demo
 Future Plan
27
Live Demo
Demo Environment
Number of Nodes 5 VM (AWS r3.4xlarge)
vCPU 80 (16/node)
Memory 500 GiB (100 GiB/node)
#Columns 300
Data Size 600GB
#Records 300M
High Throughput/Full Scan Query
SELECT PROD_BRAND_NAME, SUM(STR_ORD_QTY) FROM
oscon_demo GROUP BY PROD_BRAND_NAME;
OLAP/Interactive query
SELECT PROD_COLOR, SUM(STR_ORD_QTY) FROM oscon_demo
WHERE CUST_COUNTRY ='New Zealand' AND CUST_CITY =
'Auckland' AND PRODUCT_NAME = 'Huawei Honor 4X' GROUP BY
PROD_COLOR;
Random Access Query
SELECT * FROM oscon_demo WHERE CUST_PRFRD_FLG= "Y" AND
PROD_BRAND_NAME = "Huawei" AND PROD_COLOR = "BLACK"
AND CUST_LAST_RVW_DATE = "2015-12-11 00:00:00" AND
CUST_COUNTRY ='New Zealand' AND CUST_CITY = 'Auckland' AND
PRODUCT_NAME = 'Huawei Honor 4X' ;
28
Outline
 Motivation: Why introducing a new file format?
 CarbonData File Format Deep Dive
 Framework Integrated with CarbonData
 Performance
 Demo
 Future Plan
29
Future Plan
• Upgrade to Spark 2.0
• Add append support
• Support pre-aggregated table
• Enable offline IUD support
• Broader Integration across Hadoop-ecosystem
30
Community
• CarbonData is open sourced & will become Apache Incubator project
• Welcome contribution to our Github @:
https://github.com/HuaweiBigData/carbondata
• Main Contributors:
• Jihong MA, Vimal, Raghu, Ramana, Ravindra, Vishal, Aniket, Liang Chenliang, Jacky Likun,
Jarry Qiuheng, David Caiqiang, Eason Linyixin, Ashok, Sujith, Manish, Manohar, Shahid,
Ravikiran, Naresh, Krishna, Babu, Ayush, Santosh, Zhangshunyu, Liujunjie, Zhujing (Huawei)
• Jean-Baptiste Onofre (Talend, ASF member), Henry Saputra (eBay, ASF member),
Uma Maheswara Rao G(Intel, Hadoop PMC)
Thank you
www.huawei.com
Copyright©2014 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the
future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could
cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore,
such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change
the information at any time without notice.

More Related Content

What's hot

Planning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera ClusterPlanning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera Cluster
Codership Oy - Creators of Galera Cluster
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
University of California, Santa Cruz
 
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...
Amazon Web Services
 
[오픈소스컨설팅] SELinux : Stop Disabling SELinux
[오픈소스컨설팅] SELinux : Stop Disabling SELinux[오픈소스컨설팅] SELinux : Stop Disabling SELinux
[오픈소스컨설팅] SELinux : Stop Disabling SELinux
Open Source Consulting
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
DataStax Academy
 
FOSDEM 2022 MySQL Devroom: MySQL 8.0 - Logical Backups, Snapshots and Point-...
FOSDEM 2022 MySQL Devroom:  MySQL 8.0 - Logical Backups, Snapshots and Point-...FOSDEM 2022 MySQL Devroom:  MySQL 8.0 - Logical Backups, Snapshots and Point-...
FOSDEM 2022 MySQL Devroom: MySQL 8.0 - Logical Backups, Snapshots and Point-...
Frederic Descamps
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
 
My SYSAUX tablespace is full - please help
My SYSAUX tablespace is full - please helpMy SYSAUX tablespace is full - please help
My SYSAUX tablespace is full - please help
Markus Flechtner
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Databricks
 
Pgday bdr 천정대
Pgday bdr 천정대Pgday bdr 천정대
Pgday bdr 천정대
PgDay.Seoul
 
Percona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemPercona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database System
Frederic Descamps
 
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptxGraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
jexp
 
Kafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedKafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presented
Sumant Tambe
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
Knoldus Inc.
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
MariaDB Galera Cluster - Simple, Transparent, Highly Available
MariaDB Galera Cluster - Simple, Transparent, Highly AvailableMariaDB Galera Cluster - Simple, Transparent, Highly Available
MariaDB Galera Cluster - Simple, Transparent, Highly Available
MariaDB Corporation
 
Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360
Carlos Sierra
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
Tanu Siwag
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
MIJIN AN
 

What's hot (20)

Planning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera ClusterPlanning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera Cluster
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...
[REPEAT 1] Deep Dive on Amazon Aurora with MySQL Compatibility (DAT304-R1) - ...
 
[오픈소스컨설팅] SELinux : Stop Disabling SELinux
[오픈소스컨설팅] SELinux : Stop Disabling SELinux[오픈소스컨설팅] SELinux : Stop Disabling SELinux
[오픈소스컨설팅] SELinux : Stop Disabling SELinux
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
 
FOSDEM 2022 MySQL Devroom: MySQL 8.0 - Logical Backups, Snapshots and Point-...
FOSDEM 2022 MySQL Devroom:  MySQL 8.0 - Logical Backups, Snapshots and Point-...FOSDEM 2022 MySQL Devroom:  MySQL 8.0 - Logical Backups, Snapshots and Point-...
FOSDEM 2022 MySQL Devroom: MySQL 8.0 - Logical Backups, Snapshots and Point-...
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 
My SYSAUX tablespace is full - please help
My SYSAUX tablespace is full - please helpMy SYSAUX tablespace is full - please help
My SYSAUX tablespace is full - please help
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
Pgday bdr 천정대
Pgday bdr 천정대Pgday bdr 천정대
Pgday bdr 천정대
 
Percona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database SystemPercona Live 2022 - The Evolution of a MySQL Database System
Percona Live 2022 - The Evolution of a MySQL Database System
 
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptxGraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
 
Kafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presentedKafka tiered-storage-meetup-2022-final-presented
Kafka tiered-storage-meetup-2022-final-presented
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
MariaDB Galera Cluster - Simple, Transparent, Highly Available
MariaDB Galera Cluster - Simple, Transparent, Highly AvailableMariaDB Galera Cluster - Simple, Transparent, Highly Available
MariaDB Galera Cluster - Simple, Transparent, Highly Available
 
Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 

Similar to Introducing Apache Carbon Data - Hadoop Native Columnar Data Format

Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
liang chen
 
Druid at naver.com - part 1
Druid at naver.com - part 1Druid at naver.com - part 1
Druid at naver.com - part 1
Jungsu Heo
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
Jungsu Heo
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
DataStax Academy
 
Virtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log AnalysisVirtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log Analysis
Kabul Kurniawan
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
 
Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
Amazon Web Services
 
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld
 
Advanced Index, Partitioning and Compression Strategies for SQL Server
Advanced Index, Partitioning and Compression Strategies for SQL ServerAdvanced Index, Partitioning and Compression Strategies for SQL Server
Advanced Index, Partitioning and Compression Strategies for SQL Server
Confio Software
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
Introdução ao data warehouse Amazon Redshift
Introdução ao data warehouse Amazon RedshiftIntrodução ao data warehouse Amazon Redshift
Introdução ao data warehouse Amazon Redshift
Amazon Web Services LATAM
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
Steve Caron
 
SQL Server 2014 for Developers (Cristian Lefter)
SQL Server 2014 for Developers (Cristian Lefter)SQL Server 2014 for Developers (Cristian Lefter)
SQL Server 2014 for Developers (Cristian Lefter)
ITCamp
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
Vijayendra Shamanna
 
Hpverticacertificationguide 150322232921-conversion-gate01
Hpverticacertificationguide 150322232921-conversion-gate01Hpverticacertificationguide 150322232921-conversion-gate01
Hpverticacertificationguide 150322232921-conversion-gate01
Anvith S. Upadhyaya
 
Hp vertica certification guide
Hp vertica certification guideHp vertica certification guide
Hp vertica certification guide
neinamat
 
MySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics ImprovementsMySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics Improvements
Morgan Tocker
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
Gerald Muecke
 

Similar to Introducing Apache Carbon Data - Hadoop Native Columnar Data Format (20)

Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
Druid at naver.com - part 1
Druid at naver.com - part 1Druid at naver.com - part 1
Druid at naver.com - part 1
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Virtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log AnalysisVirtual Knowledge Graphs for Federated Log Analysis
Virtual Knowledge Graphs for Federated Log Analysis
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
 
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
 
Advanced Index, Partitioning and Compression Strategies for SQL Server
Advanced Index, Partitioning and Compression Strategies for SQL ServerAdvanced Index, Partitioning and Compression Strategies for SQL Server
Advanced Index, Partitioning and Compression Strategies for SQL Server
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Introdução ao data warehouse Amazon Redshift
Introdução ao data warehouse Amazon RedshiftIntrodução ao data warehouse Amazon Redshift
Introdução ao data warehouse Amazon Redshift
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
SQL Server 2014 for Developers (Cristian Lefter)
SQL Server 2014 for Developers (Cristian Lefter)SQL Server 2014 for Developers (Cristian Lefter)
SQL Server 2014 for Developers (Cristian Lefter)
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
Hpverticacertificationguide 150322232921-conversion-gate01
Hpverticacertificationguide 150322232921-conversion-gate01Hpverticacertificationguide 150322232921-conversion-gate01
Hpverticacertificationguide 150322232921-conversion-gate01
 
Hp vertica certification guide
Hp vertica certification guideHp vertica certification guide
Hp vertica certification guide
 
MySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics ImprovementsMySQL 5.6 - Operations and Diagnostics Improvements
MySQL 5.6 - Operations and Diagnostics Improvements
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
 

Recently uploaded

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 

Introducing Apache Carbon Data - Hadoop Native Columnar Data Format

  • 1. HUAWEI TECHNOLOGIES CO., LTD. CarbonData : A New Hadoop File Format For Faster Data Analysis
  • 2. 2 Outline  Use Case & Motivation : Why introducing a new file format?  CarbonData File Format Deep Dive  Framework Integrated with CarbonData  Performance  Demo  Future Plan
  • 3. 3  Full table scan  Big scan & fast batch processing  Only fetch a few columns of the table  Common usage scenario:  ETL job  Log Analysis Use case: Sequential scan C1 C2 C3 C4 C5 C6 C7 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 …..
  • 4. 4  Multi-dimensional data analysis  Involves aggregation / join  Roll-up, Drill-down, Slicing and Dicing  Low-latency ad-hoc query  Common usage scenario:  Dash-board reporting  Fraud & Ad-hoc Analysis Use case: OLAP-Style Query C1 C2 C3 C4 C5 C6 C7 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11
  • 5. 5  Predicate filtering on range of columns  Full row keys or range of keys lookup  Narrow scan but might fetch all columns  Requires second/sub-second level low-latency  Common usage scenario:  Operational query  User profiling Use case: Random Access C1 C2 C3 C4 C5 C6 C7 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 ……
  • 6. 6 Motivation Random Access (narrow scan) Sequential Access (big scan) OLAP Style Query (multi-dimensional analysis) CarbonData: A Single File Format suits for different types of access
  • 7. 7 Design Goals  Low-Latency for various types of data access pattern  Allow fast query on fast data  Ensure Space Efficiency  General format available on Hadoop-ecosystem  Read-optimized columnar storage  Leveraging multi-level Index for low-latency  Support column group to leverage the benefit of row-based  Enables dictionary encoding for deferred decoding for aggregation  Optimized streaming ingestion support  Broader Integration across Hadoop-ecosystem CarbonData:
  • 8. 8 Outline  Use cases & Motivation: Why introducing a new file format?  CarbonData File Format Deep Dive  Framework Integrated with CarbonData  Performance  Demo  Future Plan
  • 9. 9 Carbon File CarbonData File Structure  Blocklet : A set of rows in columnar format  Default blocklet size: ~120k rows  Balance between efficient scan and compression  Column chunk : Data for one column/column group in a Blocklet  Allow multiple columns forms a column group & stored as row-based  Column data stored as sorted index  Footer : Metadata information  File level metadata & statistics  Schema  Blocklet Index & Blocklet level Metadata Blocklet 1 Col1 Chunk Col2 Chunk … Colgroup1 Chunk Colgroup2 Chunk … Blocklet N … Footer
  • 10. 10 Carbon Data File Blocklet 1 Column 1 Chunk Column 2 Chunk … ColumnGroup 1 Chunk ColumnGroup 2 Chunk … Blocklet N File Footer Blocklet Index Blocklet 1 Index Node •Minmax index: min, max •Multi-dimensional index: startKey, endKey Blocklet N Index Node … … Blocklet Info Blocklet 1 Info Blocklet N Info •Column 1 Chunk Info •Compression scheme •ColumnFormat •ColumnID list •ColumnChunk length •ColumnChunk offset … File Metadata Version, No. Row, … Segment Info Schema Schema for each column Blocklet Index Blocklet Info ColumnGroup1 Chunk Info … … Format
  • 11. 11 Years Quarters Months Territory Country Quantity Sales 2003 QTR1 Jan EMEA Germany 142 11,432 2003 QTR1 Jan APAC China 541 54,702 2003 QTR1 Jan EMEA Spain 443 44,622 2003 QTR1 Feb EMEA Denmark 545 58,871 2003 QTR1 Feb EMEA Italy 675 56,181 2003 QTR1 Mar APAC India 52 9,749 2003 QTR1 Mar EMEA UK 570 51,018 2003 QTR1 Mar Japan Japan 561 55,245 2003 QTR2 Apr APAC Australia 525 50,398 2003 QTR2 Apr EMEA Germany 144 11,532 [1,1,1,1,1] : [142,11432] [1,1,1,3,2] : [541,54702] [1,1,1,1,3] : [443,44622] [1,1,2,1,4] : [545,58871] [1,1,2,1,5] : [675,56181] [1,1,3,3,6] : [52,9749] [1,1,3,1,7] : [570,51018] [1,1,3,2,8] : [561,55245] [1,2,4,3,9] : [525,50398] [1,2,4,1,1] : [144,11532] Blocklet • Data are sorted along MDK (multi-dimensional keys) • data stored as index in columnar format Encoding Blocklet Logical View Sort (MDK Index) [1,1,1,1,1] : [142,11432] [1,1,1,1,3] : [443,44622] [1,1,1,3,2] : [541,54702] [1,1,2,1,4] : [545,58871] [1,1,2,1,5] : [675,56181] [1,1,3,1,7] : [570,51018] [1,1,3,2,8] : [561,55245] [1,1,3,3,6] : [52,9749] [1,2,4,1,1] : [144,11532] [1,2,4,3,9] : [525,50398] Sorted MDK Index 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 3 3 3 4 4 1 1 3 1 1 1 2 3 1 3 142 443 541 545 675 570 561 52 144 525 11432 44622 54702 58871 56181 51018 55245 9749 11532 50398 C1 C2 C3 C4 C5 C6 C7 1 3 2 4 5 7 8 6 1 9
  • 12. 12 File Level Blocklet Index Block 1 1 1 1 1 1 1 12000 1 1 1 2 1 2 5000 1 1 2 1 1 1 12000 1 1 2 2 1 2 5000 1 1 3 1 1 1 12000 1 1 3 2 1 2 5000 Block 2 1 2 1 3 2 3 11000 1 2 2 3 2 3 11000 1 2 3 3 2 3 11000 1 3 1 4 3 4 2000 1 3 1 5 3 4 1000 1 3 2 4 3 4 2000 Block 3 1 3 2 5 3 4 1000 1 3 3 4 3 4 2000 1 3 3 5 3 4 1000 1 4 1 4 1 1 20000 1 4 2 4 1 1 20000 1 4 3 4 1 1 20000 Block 4 2 1 1 1 1 1 12000 2 1 1 2 1 2 5000 2 1 2 1 1 1 12000 2 1 2 2 1 2 5000 2 1 3 1 1 1 12000 2 1 3 2 1 2 5000 Blocklet Index Block1 Start Key1 End Key1 Start Key1 End Key4 Start Key1 End Key2 Start Key3 End Key4 Start Key1 End Key1 Start Key2 End Key2 Start Key3 End Key3 Start Key4 End Key4 File FooterBlocklet • Build in-memory file level MDK index tree for filtering • Major optimization for efficient scan C1(Min, Max) …. C7(Min, Max) Block4 Start Key4 End Key4 C1(Min, Max) …. C7(Min, Max) C1(Min,Max) … C7(Min,Max) C1(Min,Max) … C7(Min,Max) C1(Min,Max) … C7(Min,Max) C1(Min,Max) … C7(Min,Max)
  • 13. 13 Blocklet Rows [1|1] :[1|1] :[1|1] :[1|1] :[1|1] : [142]:[11432] [1|2] :[1|2] :[1|2] :[1|2] :[1|9] : [443]:[44622] [1|3] :[1|3] :[1|3] :[1|4] :[2|3] : [541]:[54702] [1|4] :[1|4] :[2|4] :[1|5] :[3|2] : [545]:[58871] [1|5] :[1|5] :[2|5] :[1|6] :[4|4] : [675]:[56181] [1|6] :[1|6] :[3|6] :[1|9] :[5|5] : [570]:[51018] [1|7] :[1|7] :[3|7] :[2|7] :[6|8] : [561]:[55245] [1|8] :[1|8] :[3|8] :[3|3] :[7|6] : [52]:[9749] [1|9] :[2|9] :[4|9] :[3|8] :[8|7] : [144]:[11532] [1|10]:[2|10]:[4|10]:[3|10] :[9|10] : [525]:[50398] Blocklet ( sort column within column chunk) Run Length Encoding & Compression Dim1 Block 1(1-10) Dim2 Block 1(1-8) 2(9-10) Dim3 Block 1(1-3) 2(4-5) 3(6-8) 4(9-10) Dim4 Block 1(1-2,4-6,9) 2(7) 3(3,8,10) Measure1 Block Measure2 Block Dim5 Block 1(1,9) 2(3) 3(2) 4(4) 5(5) 6(8) 7(6) 8(7) 9(10) Columnar Store Column chunk Level inverted Index [142]:[11432] [443]:[44622] [541]:[54702] [545]:[58871] [675]:[56181] [570]:[51018] [561]:[55245] [52]:[9749] [144]:[11532] [525]:[50398] Column Chunk Inverted Index • Optionally store column data as inverted index within column chunk • suitable to low cardinality column • better compression & fast predicate filtering Blocklet Physical View 1 10 142 443 541 545 675 570 561 52 144 525 11432 44622 54702 58871 56181 51018 55245 9749 11532 50398 C1 d r d r d r d r d r d r 1 10 1 8 2 2 1 10 1 3 2 2 3 3 4 2 1 10 1 6 2 1 3 3 1 2 4 3 9 1 7 1 3 1 … 1 2 2 1 3 1 4 1 5 1 … 1 1 9 1 3 1 2 1 4 1 … C2 C3 C4 C5 C6 C7
  • 14. 14 10 2 23 23 38 15.2 10 2 50 15 29 18.5 10 3 51 18 52 22.8 11 6 60 29 16 32.9 12 8 68 32 18 21.6 Blocklet 1 C1 C2 C3 C4 C6C5 Col Chunk Col Chunk Col Chunk Col Chunk Column Group • Allow multiple columns form a column group • stored as a single column chunk in row- based format • suitable to set of columns frequently fetched together • saving stitching cost for reconstructing row Col Chunk
  • 15. 15 Nested Data Type Representation • Represented as a composite of two columns • One column for the element value • One column for start_index & length of Array Arrays • Represented as a composite of finite number of columns • Each struct element is a separate column Struts Name Array<Ph_Number> John [192,191] Sam [121,345,333] Bob [198,787] Name Array [start,len] Ph_Number John 0,2 192 Sam 2,3 191 Bob 5,2 121 345 333 198 787 Name Info Strut<age,gender> John [31,M] Sam [45,F] Bob [16,M] Name Info.age Info.gender John 31 M Sam 45 F Bob 16 M
  • 16. 16 Encoding & Compression • Efficient encoding scheme supported: • DELTA, RLE, BIT_PACKED • Dictionary: • medium high cardinality: file level dictionary • very low cardinality: table level global dictionary • CUSTOM • Compression Scheme: Snappy •Speedup Aggregation •Reduce run-time memory footprint •Enable deferred decoding •Enable fast distinct count Big Win:
  • 17. 17 Outline  Use Case & Motivation: Why introducing a new file format?  CarbonData File Format Deep Dive  Framework Integrated with CarbonData  Performance  Demo  Future Plan
  • 18. 18 CarbonData Modules Carbon-format Carbon-core Reader/Writer Thrift definition Carbon-Spark Integration Carbon-Hadoop Input/Output Format Language Agnostic Format Specification Core component of format implementation for reading/writing Carbon data Provide Hadoop Input/Output Format interface Integration of Carbon with Spark including query optimization
  • 19. 19 Spark Integration • Query CarbonData Table • DataFrame API • Spark SQL Statement • Support schema evolution of Carbon table via ALTER TABLE • Add, Delete or Rename Column • schema update only, data stored on disk is untouched CREATE TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] STORED BY ‘org.carbondata.hive.CarbonHanlder’ [TBLPROPERTIES (property_name=property_value, ...)] [AS select_statement];
  • 20. 20 Blocklet Spark Integration Table Block Footer + Index Blocklet Blocklet … … C1 C2 C3 C4 C5 C6 C7 C9 Table Level MDK Tree Index Inverted Index • Query optimization • Vectorized record reading • Predicate push down by leveraging multi-level index • Column Pruning • Defer decoding for aggregation Block Blocklet Blocklet Footer + Index Block Footer + Index Blocklet Blocklet Block Blocklet Blocklet Footer + Index
  • 21. 21 Data Ingestion • Bulk Data Ingestion • CSV file conversion • MDK clustering level: load level vs. node level • Save Spark dataframe as Carbon data file df.write .format("org.apache.spark.CarbonSource") .options(Map("dbName" -> "db1", "tableName" -> "tbl1")) .mode(SaveMode.Overwrite) .save(“/path”) LOAD DATA [LOCAL] INPATH 'folder path' [OVERWRITE] INTO TABLE tablename OPTIONS(property_name=property_value, ...) INSERT INTO TABLE tablennme AS select_statement1 FROM table1;
  • 22. 22 Data Compaction • Data compaction is used to merge small files • Re-clustering across loads • Two types of compactions - Minor compaction • Compact adjacent files into a single big file (~HDFS block size) - Major compaction • Reorganize adjacent loads to achieve better clustering along MDK index
  • 23. 23 Outline  Use Case & Motivation: Why introducing a new file format?  CarbonData File Format Deep Dive  Framework Integrated with CarbonData  Performance  Demo  Future Plan
  • 24. 24 26.28 12.71 9.82 10.38 11.21 23.05 17.33 15.49 17.82 24.64 107.39 101.62 111.86 9.45 4.41 1.62 2.54 8.16 0.89 0.55 0.52 0.54 1.19 0.16 2.24 4.28 0.00 20.00 40.00 60.00 80.00 100.00 120.00 SQL1 SQL2 SQL3 SQL4 SQL5 SQL6 SQL7 SQL8 SQL9 SQL10 SQL11 SQL12 SQL13 ResponseTime(Seconds) Benchmark Queries Carbon vs Popular Columnar Stores Popular Columnar Stores Carbon Performance comparison High Throughput/Full Scan Query OLAP/Interactive Query Random Access Query Data Size : 2TB 1.4x to 6x faster 20x – 33x faster 26x – 688x faster
  • 25. 25 Performance comparison - Observations High Throughput/Full Scan Query 1.4 to 6 times faster Deferred decoding enables faster aggregation on the fly. OLAP/Interactive Query 20 to 33 times faster MDK, Min-Max and Inverted indices enable block pruning Deferred decoding enables faster aggregation on the fly. Random Access Query 26 to 688 times faster Inverted index enables faster row reconstruction. Column group eliminates implicit joins for row reconstruction.
  • 26. 26 Outline  Motivation: Why introducing a new file format?  CarbonData File Format Deep Dive  Framework Integrated with CarbonData  Performance  Demo  Future Plan
  • 27. 27 Live Demo Demo Environment Number of Nodes 5 VM (AWS r3.4xlarge) vCPU 80 (16/node) Memory 500 GiB (100 GiB/node) #Columns 300 Data Size 600GB #Records 300M High Throughput/Full Scan Query SELECT PROD_BRAND_NAME, SUM(STR_ORD_QTY) FROM oscon_demo GROUP BY PROD_BRAND_NAME; OLAP/Interactive query SELECT PROD_COLOR, SUM(STR_ORD_QTY) FROM oscon_demo WHERE CUST_COUNTRY ='New Zealand' AND CUST_CITY = 'Auckland' AND PRODUCT_NAME = 'Huawei Honor 4X' GROUP BY PROD_COLOR; Random Access Query SELECT * FROM oscon_demo WHERE CUST_PRFRD_FLG= "Y" AND PROD_BRAND_NAME = "Huawei" AND PROD_COLOR = "BLACK" AND CUST_LAST_RVW_DATE = "2015-12-11 00:00:00" AND CUST_COUNTRY ='New Zealand' AND CUST_CITY = 'Auckland' AND PRODUCT_NAME = 'Huawei Honor 4X' ;
  • 28. 28 Outline  Motivation: Why introducing a new file format?  CarbonData File Format Deep Dive  Framework Integrated with CarbonData  Performance  Demo  Future Plan
  • 29. 29 Future Plan • Upgrade to Spark 2.0 • Add append support • Support pre-aggregated table • Enable offline IUD support • Broader Integration across Hadoop-ecosystem
  • 30. 30 Community • CarbonData is open sourced & will become Apache Incubator project • Welcome contribution to our Github @: https://github.com/HuaweiBigData/carbondata • Main Contributors: • Jihong MA, Vimal, Raghu, Ramana, Ravindra, Vishal, Aniket, Liang Chenliang, Jacky Likun, Jarry Qiuheng, David Caiqiang, Eason Linyixin, Ashok, Sujith, Manish, Manohar, Shahid, Ravikiran, Naresh, Krishna, Babu, Ayush, Santosh, Zhangshunyu, Liujunjie, Zhujing (Huawei) • Jean-Baptiste Onofre (Talend, ASF member), Henry Saputra (eBay, ASF member), Uma Maheswara Rao G(Intel, Hadoop PMC)
  • 31. Thank you www.huawei.com Copyright©2014 Huawei Technologies Co., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.