Apache CarbonData+Spark to realize data convergence and Unified high performance Data Analytics - Raghunandan Subramanya (Huawei Technologies India Pvt Ltd)

Data convergence and
Unified high performance
Data Analytics with Apache CarbonData + Spark
Presented By : Raghunandan S

I am …
Raghunandan S
• Head of the technical team working on Bigdata technologies including Hadoop,
Spark, HBase, CarbonData and ZooKeeper
• Systems Group leader of the team which developed BI solution based on
MOLAP
• Project Manager of the CarbonData team with responsibility to cleanup, re-
architect, Incubate into Apache and support till it reaches Top Level Project
• Prior experience in NMS, Softswitch products
Chief Architect at Huawei’s India R&D centre’s BigData group
raghunandan@apache.org

Use cases & Challenges
Unified Data Analytics
CarbonData Technology & Solution
CarbonData Roadmap
CarbonData Community

Use Cases and Challenges
Background of CarbonData project

Network
• 54B records per day
• 750TB per month
• Complex correlated
data
Consumer
• 100 thousands of sensors
• >2 million events per
second
• Time series, geospatial
data
Enterprise & Banking
• 100 GB to TB per day
• Data across different
regions
Data
Report & Dashboard
OLAP & Ad-hoc
Batch processing
Machine learning
Real Time Analytics
Graph analysis
Time series
analysis
Geo spatial and
position analysis
Textual
matching and
analysis
Fraud detection
Traffic analysis
User behavior
Advertising
Failure
detection
Home
automation
Data analysis use cases

Big Data and Analytics Challenges
➢ Data Size
• Single Table >10 billion rows
➢ Multi-dimensional
• Wide table: > 100 dimensions
➢ Rich of Detail
• High cardinality (> 1 billion)
• 1B terminal * 200K cell / minute
• Complex & Nested Structure
➢ Real Time
• Time Series streaming data
➢ Reporting
• Big Scan
• Distributed Compute
➢ Searching
• Located Scan
• Isolated Compute
➢ AI
• Iterative Scan
• Iterative Compute

No-SQL
Document store
TXT
Data Warehouse
HOT
COLD
Time series
databaseStorage
Input
Sources
• Multiple Stores for specific analysis
• Complex data pipe lines and high maintenance cost
Query
Processing
ETL pipelines
Data stores
Real time
0
1
2
3
4
5
6
2015 2016 2017 2018
Analysis

• Multiple team co-ordination, there by
tough to achieve Self-BI for complex and
big data
IT
Business
Data Engineers
Hello ahdojq
1,2,3 ,
3434
34,34,3
34
Data Analysts
Complex
Pipe lines
Various
stores
Business
needsComplex
Pipe lines
Various
stores
Business
needs
Complex
Pipe lines
Various
stores
Business
needs

• Data lake
• Performance not comparable to data warehouse
• Variety of Scans
• Corrections/Updates to data
• Real time and history data analysis
• Consistency of query results
• With all the above challenges, still need to be simple to
analyze
• Store now and analyze later.
• Will also facilitate Self assist BI.
Data lake

• Real time and History data
• Data consistency and Duplication
• Materialized Views or indexes
Query
Processing
(History +
Real time)
Real time data
History data
0
1
2
3
4
5
6
2015 2016 2017 2018
Analysis

Unified Data Analytics
Solution for data silos; data redundancy; Multiple use cases

Unified data analytics
Store in CarbonData format and Analyze on need basis
What is Apache CarbonData?
Query Processing
(History + Real time)
0
1
2
3
4
5
6
0
5
10
15
20
25
30
35
0
1
2
3
4
5
Analysis
Real time data
CarbonData
format
Data
Indexes
Data maps
Indexes
Unified data analysis and Unified storage format

CarbonData Solution
Technology, Features, Design adopted for Unified Storage

CarbonData Solution & Features
High Scalability
• Storage and processing
separated
• Suitable for cloud
Data consistency
• No intermediate data on failures
• Supported for every operation on data
(Query, IUD, Indexes, Data maps,
Streaming)
ACID
Insert - Update - Delete
• Support bulk Insert, Update,
Deletes
Fast Query Execution
• Filters based on Multi-level Indexes
• Columnar format
• Query processing on dictionary
encoded values.
AI
Reporting Searching
Unified
Storage
Fast: Construct multi-dimension
indexes
Faster: Based on fast
intelligent scanning
Efficient data compression:
dictionary encoding
Concurrent data import:
Spark parallel task

CarbonData Solution & Features
Decoupled Storage & Processing
• Data in CarbonDataFormat
• Processing in Distributed frameworks
Real time + History data
• real time and history data into same
table
• Query can combine results
automatically
• Auto handoff to columnar format
0I00I0110
0II00010II000
Multi level Index
• Supports Multi–dimensional Btree
• Min-max and inverted indexes
External Indexes for
Optimizations
• External Indexes like
Lucene for text
• MV Data maps for
aggregate tables, time
series queries
• Query auto selects the
required data map
AI
Reporting Searching
Unified
Storage
Fast: Construct multi-dimension
indexes
Faster: Based on fast
intelligent scanning
Efficient data compression:
dictionary encoding
Concurrent data import:
Spark parallel task

Carbon Data File
Blocklet 1
Column 1 Chunk
Column 2 Chunk
…
Column n Chunk
Blocklet N
File Footer
…
Blocklet Offset, Index & Stats
File Header
Version
Schema
Page1 …Page2header
CarbonData File Layout
➢ Data Layout
• Block : A HDFS file
• Blocklet : A set of rows in columnar format
o Column chunk : Data for one column in a blocklet; smallest IO
unit
o Page : Data Page in the Column chunk; smallest decoding unit
➢ Metadata Information
• Header : Version, Schema
• Footer : Blocklet Offset, Index & File level statistics
➢ Built-in indexes and statistics
• Blocklet Index : B-Tree start key, end key
• Blocklet & Page level statistics : min, max etc

TimeseriesOLAP GraphTextGeospatial AI
Data
CarbonData
CarbonTable DataMap
DataMap DSL
CarbonData Engine
• Stores Data / Metadata / Index / Statistics for Query optimization.
• Auto selects the required data map during Query.
• Updated Inline or offline based on configuration.
• Supports custom data maps.
DataMaps

Index Datamap(s):
MV Datamap(s):
Time series Cube Datamap
OLAP
Time series Queries
Lucene Datamap
Min-Max Datamap BloomFilter Datamap
Rtree Datamap
Text search
Spatial Queries
Point and Filter Queries
Aggregate table Datamap Materialized Views
CREATE DATAMAP agg_sales ON TABLE sales USING "preaggregate" AS SELECT country, sum(quantity),
avg(price) FROM sales GROUP BY country
CREATE DATAMAP datamap_name ON TABLE main_table USING 'lucene' DMPROPERTIES
('index_columns'='city, name', ...)
DataMaps

DataMap Execution Model
Data Node
Spark Driver
DataMap Execution
(centralized, Distributed)
Spark Driver
Query
Executor
Carbon
File
Data
Map
Data Node
Carbon
File
Data
Map
Data Node
Carbon
File
Data
Map
Executor Executor
• Mainly includes Index DataMap
and MV DataMap
• Push filter and projection to
DataMap
• DataMap can be centralized or
distributed (avoiding huge memory
issues)
• DataMap can be cached in
memory or stored on disk

Spark Driver
Executor
Carbon File
Data
Footer
Carbon File
Data
Footer
Carbon File
Data
Footer
File Level Index
& Scanner
Table Level Index
Executor
File Level Index
& Scanner
Catalyst
Inverted
Index
Rich Multi-Level Index Support
• Using the index info in footer, two
level B+ tree index can be built:
• File level index: local B+ tree,
efficient blocklet level filtering
• Table level index: global B+
tree, efficient file level filtering
• Column chunk inverted index:
efficient column chunk scan

Spark Executor
Spark Driver
Blocklet
HDFS
File
Footer
Blocklet
Blocklet
…
…
C1 C2 C3 C4 Cn
1. File pruning
File
Blocklet
Blocklet
Footer
File
Footer
Blocklet
Blocklet
File
Blocklet
Blocklet
Footer
Blocklet Blocklet Blocklet Blocklet
Task Task
2. Blocklet
pruning
3. Read and decompress filter column
4. Binary search using inverted index,
skip to next blocklet if no matching
…
5. Decompress projection column
6. Return decoded data to spark
SELECT c3, c4 FROM t1 WHERE c2=’boston’
Efficient Filtering via Index

Hybrid format
– Streaming: appendable format
– Batch: columnar
Query merging on both streaming and batch
segments
Auto data conversion on Handoff
– streaming segment size
– Updates Indexes and data maps
Support Materialized View!
Streaming with Kafka & Spark

Delete Delta:
• Store RowId that are
deleted
• Bitmap file format
Insert Delta：
• Store newly added row
• CarbonData file format
Insert Delta
(Base)
Update flow：
1. Find all rows that need update, by
executing the sub query
2. Write the “Delete Delta” file
3. Write the “Insert Delta” file
Read flow：
1. Read “Base” file
2. Read “Delete Delta” and exclude RowId in the file
3. Read “Update Delta” and merge new row
Data Update / Delete

CarbonData Core Engine
Table, Segment, Index, Caching
Search
Acceleration
Multi-Dimensional detail
data query
OLAP
Acceleration
BI & Batch data analytics
Streaming
Acceleration
Near real-time data
analytics
AI
Acceleration
AI-enabled data
analytics
Tools
Open Format
CarbonFile
Searchlet
Cloud Storage
CarbonData Access
API, RPC, REST, Parallel Framework
CarbonData Technology Stack

• Bank
– Fraud detection
– Risk analysis
• Telco
– Churn Analysis
– VIP Care
• Monitoring
– IOV
– Unusual Human behavior analysis
• Internet
– Video access analysis
– Device size; resolution
– Server loads
Scenario: Half year of telecom data in one state
Cluster: 70 machines, 1120 cores
Data: 200 to 1000 Billion Rows, 80 columns
Carbon Table Index built on c1~c8
Workload:
Q1: filter (c1~c4), select *
Q2: filter (c1,c5), big result set
Q3: filter (c3)
Q4: filter (c1) and aggregate
Q5: full scan aggregate
0%
50%
100%
150%
200%
250%
300%
350%
400%
450%
500%
200B 400B 600B 800B 1000B
Response Time (% of 200B
records)
Q1 Q2 Q3 Q4 Q5
Observation
When data grows:
⚫ Index is efficient to reduce response time: Q1, Q3
⚫ When selectivity is low or full scan, query
response time is linear: Q2, Q5
⚫ Spark compute time scale linearly: Q4
Cluster:178 machines, 1368 cores,
5550 GB Mem
Data Size:3 PB, 10+ trillion rows
Largest Deployment
CarbonData use cases in Production

Roadmap
Releases, road Ahead, usage in Huawei Public Cloud

0.1.0-
incubating
1.5.0
Aug-2016
• Indexed Columnar
Store on HDFS
• Integration with Apache
Spark
• Bulk load support
Nov-2016
• MR support
• Spark DataFrame support
• Configurable block size
• Performance Improvements
0.1.1-
incubating
0.2.0-
incubating
1.0.0-
incubating
Jan-2017
• Kettle Dependency
removal
• Spark 2.1 support
• IUD support for Spark 1.6
• Adaptive compression
• V2 Format of CarbonData
• Vectorized readers
• Off-heap memory support
• Single Pass loading
1.1.0 1.1.1 1.2.0 1.3.0 1.3.1
May-2017
• CarbonData V3 format
• Alter Table
• Range Filter
• Large cluster optimizations
• Code refactoring
Sep-2017
• Presto support
• Sort columns configuration
• Partition support
• Datamaps
• IUD support for Spark 2.1
• Dynamic property
configuration
• RLE codec support
• Compaction performance
Feb-2018
• Spark 2.2.1 support
• Streaming
• Pre-aggregate tables
• CTAS
• Performance improvements
• Read from S3 support
1.4.0 1.4.1
Jun-2018
• Query Performance improvements
• Compaction performance
improvements
• Data loading performance
improvements
• SDK support
• Streaming on pre-agg , partitioned
tables
• MV support
• Bloom filter
• Write to S3 support
Aug-2018
• Local Dictionary support
• Query performance
improvements
• Custom compaction
• Support varchar
• Hadoop 2.8.3 support
Oct-2018
• Spark 2.3.2 support
• Hadoop 3.1.1 support
• C++ reader through SDK
• StreamingSQL from Kafka
• Data Summary Tool
• Better Avro compliance with more datatypes
• Adaptive encoding for all numeric columns
• LightWeight integration with Spark(fileformat)
Release milestones, Features

Towards Objectives / Goals
➢ Unify CarbonData for more use cases
➢ Performance enhancements
➢ Broader ecosystem Integration
➢ Auto tuning storage
In near future
➢ DataMaps and MVs to strengthen more use
cases (GeoSpatial, Graph, Rtrees, Time
Series)
➢ Easy Data Maintenance
➢ In-Memory Caching
➢ Compliance to Spark Data Source V2
➢ Continuous Streaming
➢ Improve cloud storage performance
➢ More encodings
➢ ML SQL
Road Ahead

CDM
DIS
CS
CarbonData Store
OBS
Input Sources Ingestion
Services
DataLake based on Object Store Intelligence
Services
DLI
DWS
MRS
FRS
MLS
GES
CarbonData in Huawei Cloud

Store in CarbonData format and Analyze on need basis
What is Apache CarbonData?
Query Processing
(History + Real time)
0
1
2
3
4
5
6
0
5
10
15
20
25
30
35
0
1
2
3
4
5
Analysis
Real time data
CarbonData
format
Data
Indexes
Data maps
Indexes
Unified data analysis and Unified storage format
Recap …

Community
Community Contributions, Usage in Production

Companies Contributing
Apache CarbonData Graduated in April 2017
12 Stable releases
130+ Contributors in a very short span
CarbonData Incubated into Apache in June 2016
Started working on CarbonData from 2015
Apache CarbonData Community

We Love more community involvement & feedback
• Subscribe to dev mailing list
• Mail list: dev@carbondata.apache.org, user@carbondata.apache.org
• Mailing list Archive: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
• Welcome any type of contribution: feature, documentation or bug report:
• Code: https://github.com/apache/carbondata
• JIRA: https://issues.apache.org/jira/browse/CARBONDATA
• Website: http://carbondata.apache.org
• cwiki: https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home
Apache CarbonData community is very open for fresh contributions. We welcome anyone who brings
new perspective to use cases, design, requirements.
Community welcomes all kinds of contributions including test, documentation enhancements, website
enhancements, examples implementation, Integration with other execution engines.
Contributions …

Apache CarbonData+Spark to realize data convergence and Unified high performance Data Analytics - Raghunandan Subramanya (Huawei Technologies India Pvt Ltd)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache CarbonData+Spark to realize data convergence and Unified high performance Data Analytics - Raghunandan Subramanya (Huawei Technologies India Pvt Ltd)

Similar to Apache CarbonData+Spark to realize data convergence and Unified high performance Data Analytics - Raghunandan Subramanya (Huawei Technologies India Pvt Ltd) (20)

More from Tech Triveni

More from Tech Triveni (20)

Recently uploaded

Recently uploaded (20)

Apache CarbonData+Spark to realize data convergence and Unified high performance Data Analytics - Raghunandan Subramanya (Huawei Technologies India Pvt Ltd)