Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data convergence and
Unified high performance
Data Analytics with Apache CarbonData + Spark
Presented By : Raghunandan S
I am …
Raghunandan S
• Head of the technical team working on Bigdata technologies including Hadoop,
Spark, HBase, CarbonDa...
Use cases & Challenges
Unified Data Analytics
CarbonData Technology & Solution
CarbonData Roadmap
CarbonData Community
Use Cases and Challenges
Background of CarbonData project
Network
• 54B records per day
• 750TB per month
• Complex correlated
data
Consumer
• 100 thousands of sensors
• >2 million...
Big Data and Analytics Challenges
➢ Data Size
• Single Table >10 billion rows
➢ Multi-dimensional
• Wide table: > 100 dime...
No-SQL
Document store
TXT
Data Warehouse
HOT
COLD
Time series
databaseStorage
Input
Sources
• Multiple Stores for specific...
• Multiple team co-ordination, there by
tough to achieve Self-BI for complex and
big data
IT
Business
Data Engineers
Hello...
• Data lake
• Performance not comparable to data warehouse
• Variety of Scans
• Corrections/Updates to data
• Real time an...
• Real time and History data
• Data consistency and Duplication
• Materialized Views or indexes
Query
Processing
(History ...
Unified Data Analytics
Solution for data silos; data redundancy; Multiple use cases
Unified data analytics
Store in CarbonData format and Analyze on need basis
What is Apache CarbonData?
Query Processing
(H...
CarbonData Solution
Technology, Features, Design adopted for Unified Storage
CarbonData Solution & Features
High Scalability
• Storage and processing
separated
• Suitable for cloud
Data consistency
•...
CarbonData Solution & Features
Decoupled Storage & Processing
• Data in CarbonDataFormat
• Processing in Distributed frame...
Carbon Data File
Blocklet 1
Column 1 Chunk
Column 2 Chunk
…
Column n Chunk
Blocklet N
File Footer
…
Blocklet Offset, Index...
TimeseriesOLAP GraphTextGeospatial AI
Data
CarbonData
CarbonTable DataMap
DataMap DSL
CarbonData Engine
• Stores Data / Me...
Index Datamap(s):
MV Datamap(s):
Time series Cube Datamap
OLAP
Time series Queries
Lucene Datamap
Min-Max Datamap BloomFil...
DataMap Execution Model
Data Node
Spark Driver
DataMap Execution
(centralized, Distributed)
Spark Driver
Query
Executor
Ca...
Spark Driver
Executor
Carbon File
Data
Footer
Carbon File
Data
Footer
Carbon File
Data
Footer
File Level Index
& Scanner
T...
Spark Executor
Spark Driver
Blocklet
HDFS
File
Footer
Blocklet
Blocklet
…
…
C1 C2 C3 C4 Cn
1. File pruning
File
Blocklet
B...
Hybrid format
– Streaming: appendable format
– Batch: columnar
Query merging on both streaming and batch
segments
Auto dat...
Delete Delta:
• Store RowId that are
deleted
• Bitmap file format
Insert Delta:
• Store newly added row
• CarbonData file ...
CarbonData Core Engine
Table, Segment, Index, Caching
Search
Acceleration
Multi-Dimensional detail
data query
OLAP
Acceler...
• Bank
– Fraud detection
– Risk analysis
• Telco
– Churn Analysis
– VIP Care
• Monitoring
– IOV
– Unusual Human behavior a...
Roadmap
Releases, road Ahead, usage in Huawei Public Cloud
0.1.0-
incubating
1.5.0
Aug-2016
• Indexed Columnar
Store on HDFS
• Integration with Apache
Spark
• Bulk load support
Nov-...
Towards Objectives / Goals
➢ Unify CarbonData for more use cases
➢ Performance enhancements
➢ Broader ecosystem Integratio...
CDM
DIS
CS
CarbonData Store
OBS
Input Sources Ingestion
Services
DataLake based on Object Store Intelligence
Services
DLI
...
Store in CarbonData format and Analyze on need basis
What is Apache CarbonData?
Query Processing
(History + Real time)
0
1...
Community
Community Contributions, Usage in Production
Companies Contributing
Apache CarbonData Graduated in April 2017
12 Stable releases
130+ Contributors in a very short span...
We Love more community involvement & feedback
• Subscribe to dev mailing list
• Mail list: dev@carbondata.apache.org, user...
Thank You
Upcoming SlideShare
Loading in …5
×

Apache CarbonData+Spark to realize data convergence and Unified high performance Data Analytics - Raghunandan Subramanya (Huawei Technologies India Pvt Ltd)

52 views

Published on

Challenges in Data Analytics:
Different application scenarios need different storage solutions: HBASE is ideal for point query scenarios but unsuitable for multi-dimensional queries. MPP is suitable for data warehouse scenarios but engine and data are coupled together which hampers scalability. OLAP stores used in BI applications perform best for Aggregate queries but full scan queries perform at a sub-optimal performance. Moreover, they are not suitable for real-time analysis. These distinct systems lead to low resource sharing and need different pipelines for data and application management.

Published in: Technology
  • Be the first to comment

Apache CarbonData+Spark to realize data convergence and Unified high performance Data Analytics - Raghunandan Subramanya (Huawei Technologies India Pvt Ltd)

  1. 1. Data convergence and Unified high performance Data Analytics with Apache CarbonData + Spark Presented By : Raghunandan S
  2. 2. I am … Raghunandan S • Head of the technical team working on Bigdata technologies including Hadoop, Spark, HBase, CarbonData and ZooKeeper • Systems Group leader of the team which developed BI solution based on MOLAP • Project Manager of the CarbonData team with responsibility to cleanup, re- architect, Incubate into Apache and support till it reaches Top Level Project • Prior experience in NMS, Softswitch products Chief Architect at Huawei’s India R&D centre’s BigData group raghunandan@apache.org
  3. 3. Use cases & Challenges Unified Data Analytics CarbonData Technology & Solution CarbonData Roadmap CarbonData Community
  4. 4. Use Cases and Challenges Background of CarbonData project
  5. 5. Network • 54B records per day • 750TB per month • Complex correlated data Consumer • 100 thousands of sensors • >2 million events per second • Time series, geospatial data Enterprise & Banking • 100 GB to TB per day • Data across different regions Data Report & Dashboard OLAP & Ad-hoc Batch processing Machine learning Real Time Analytics Graph analysis Time series analysis Geo spatial and position analysis Textual matching and analysis Fraud detection Traffic analysis User behavior Advertising Failure detection Home automation Data analysis use cases
  6. 6. Big Data and Analytics Challenges ➢ Data Size • Single Table >10 billion rows ➢ Multi-dimensional • Wide table: > 100 dimensions ➢ Rich of Detail • High cardinality (> 1 billion) • 1B terminal * 200K cell / minute • Complex & Nested Structure ➢ Real Time • Time Series streaming data ➢ Reporting • Big Scan • Distributed Compute ➢ Searching • Located Scan • Isolated Compute ➢ AI • Iterative Scan • Iterative Compute
  7. 7. No-SQL Document store TXT Data Warehouse HOT COLD Time series databaseStorage Input Sources • Multiple Stores for specific analysis • Complex data pipe lines and high maintenance cost Query Processing ETL pipelines Data stores Real time 0 1 2 3 4 5 6 2015 2016 2017 2018 Analysis Big Data and Analytics Challenges
  8. 8. • Multiple team co-ordination, there by tough to achieve Self-BI for complex and big data IT Business Data Engineers Hello ahdojq 1,2,3 , 3434 34,34,3 34 Data Analysts Complex Pipe lines Various stores Business needsComplex Pipe lines Various stores Business needs Complex Pipe lines Various stores Business needs Big Data and Analytics Challenges
  9. 9. • Data lake • Performance not comparable to data warehouse • Variety of Scans • Corrections/Updates to data • Real time and history data analysis • Consistency of query results • With all the above challenges, still need to be simple to analyze • Store now and analyze later. • Will also facilitate Self assist BI. Data lake Big Data and Analytics Challenges
  10. 10. • Real time and History data • Data consistency and Duplication • Materialized Views or indexes Query Processing (History + Real time) Real time data History data 0 1 2 3 4 5 6 2015 2016 2017 2018 Analysis Big Data and Analytics Challenges
  11. 11. Unified Data Analytics Solution for data silos; data redundancy; Multiple use cases
  12. 12. Unified data analytics Store in CarbonData format and Analyze on need basis What is Apache CarbonData? Query Processing (History + Real time) 0 1 2 3 4 5 6 0 5 10 15 20 25 30 35 0 1 2 3 4 5 Analysis Real time data CarbonData format Data Indexes Data maps Indexes Unified data analysis and Unified storage format
  13. 13. CarbonData Solution Technology, Features, Design adopted for Unified Storage
  14. 14. CarbonData Solution & Features High Scalability • Storage and processing separated • Suitable for cloud Data consistency • No intermediate data on failures • Supported for every operation on data (Query, IUD, Indexes, Data maps, Streaming) ACID Insert - Update - Delete • Support bulk Insert, Update, Deletes Fast Query Execution • Filters based on Multi-level Indexes • Columnar format • Query processing on dictionary encoded values. AI Reporting Searching Unified Storage Fast: Construct multi-dimension indexes Faster: Based on fast intelligent scanning Efficient data compression: dictionary encoding Concurrent data import: Spark parallel task
  15. 15. CarbonData Solution & Features Decoupled Storage & Processing • Data in CarbonDataFormat • Processing in Distributed frameworks Real time + History data • real time and history data into same table • Query can combine results automatically • Auto handoff to columnar format 0I00I0110 0II00010II000 Multi level Index • Supports Multi–dimensional Btree • Min-max and inverted indexes External Indexes for Optimizations • External Indexes like Lucene for text • MV Data maps for aggregate tables, time series queries • Query auto selects the required data map AI Reporting Searching Unified Storage Fast: Construct multi-dimension indexes Faster: Based on fast intelligent scanning Efficient data compression: dictionary encoding Concurrent data import: Spark parallel task
  16. 16. Carbon Data File Blocklet 1 Column 1 Chunk Column 2 Chunk … Column n Chunk Blocklet N File Footer … Blocklet Offset, Index & Stats File Header Version Schema Page1 …Page2header CarbonData File Layout ➢ Data Layout • Block : A HDFS file • Blocklet : A set of rows in columnar format o Column chunk : Data for one column in a blocklet; smallest IO unit o Page : Data Page in the Column chunk; smallest decoding unit ➢ Metadata Information • Header : Version, Schema • Footer : Blocklet Offset, Index & File level statistics ➢ Built-in indexes and statistics • Blocklet Index : B-Tree start key, end key • Blocklet & Page level statistics : min, max etc
  17. 17. TimeseriesOLAP GraphTextGeospatial AI Data CarbonData CarbonTable DataMap DataMap DSL CarbonData Engine • Stores Data / Metadata / Index / Statistics for Query optimization. • Auto selects the required data map during Query. • Updated Inline or offline based on configuration. • Supports custom data maps. DataMaps
  18. 18. Index Datamap(s): MV Datamap(s): Time series Cube Datamap OLAP Time series Queries Lucene Datamap Min-Max Datamap BloomFilter Datamap Rtree Datamap Text search Spatial Queries Point and Filter Queries Aggregate table Datamap Materialized Views CREATE DATAMAP agg_sales ON TABLE sales USING "preaggregate" AS SELECT country, sum(quantity), avg(price) FROM sales GROUP BY country CREATE DATAMAP datamap_name ON TABLE main_table USING 'lucene' DMPROPERTIES ('index_columns'='city, name', ...) DataMaps
  19. 19. DataMap Execution Model Data Node Spark Driver DataMap Execution (centralized, Distributed) Spark Driver Query Executor Carbon File Data Map Data Node Carbon File Data Map Data Node Carbon File Data Map Executor Executor • Mainly includes Index DataMap and MV DataMap • Push filter and projection to DataMap • DataMap can be centralized or distributed (avoiding huge memory issues) • DataMap can be cached in memory or stored on disk
  20. 20. Spark Driver Executor Carbon File Data Footer Carbon File Data Footer Carbon File Data Footer File Level Index & Scanner Table Level Index Executor File Level Index & Scanner Catalyst Inverted Index Rich Multi-Level Index Support • Using the index info in footer, two level B+ tree index can be built: • File level index: local B+ tree, efficient blocklet level filtering • Table level index: global B+ tree, efficient file level filtering • Column chunk inverted index: efficient column chunk scan
  21. 21. Spark Executor Spark Driver Blocklet HDFS File Footer Blocklet Blocklet … … C1 C2 C3 C4 Cn 1. File pruning File Blocklet Blocklet Footer File Footer Blocklet Blocklet File Blocklet Blocklet Footer Blocklet Blocklet Blocklet Blocklet Task Task 2. Blocklet pruning 3. Read and decompress filter column 4. Binary search using inverted index, skip to next blocklet if no matching … 5. Decompress projection column 6. Return decoded data to spark SELECT c3, c4 FROM t1 WHERE c2=’boston’ Efficient Filtering via Index
  22. 22. Hybrid format – Streaming: appendable format – Batch: columnar Query merging on both streaming and batch segments Auto data conversion on Handoff – streaming segment size – Updates Indexes and data maps Support Materialized View! Streaming with Kafka & Spark
  23. 23. Delete Delta: • Store RowId that are deleted • Bitmap file format Insert Delta: • Store newly added row • CarbonData file format Insert Delta (Base) Update flow: 1. Find all rows that need update, by executing the sub query 2. Write the “Delete Delta” file 3. Write the “Insert Delta” file Read flow: 1. Read “Base” file 2. Read “Delete Delta” and exclude RowId in the file 3. Read “Update Delta” and merge new row Data Update / Delete
  24. 24. CarbonData Core Engine Table, Segment, Index, Caching Search Acceleration Multi-Dimensional detail data query OLAP Acceleration BI & Batch data analytics Streaming Acceleration Near real-time data analytics AI Acceleration AI-enabled data analytics Tools Open Format CarbonFile Searchlet Cloud Storage CarbonData Access API, RPC, REST, Parallel Framework CarbonData Technology Stack
  25. 25. • Bank – Fraud detection – Risk analysis • Telco – Churn Analysis – VIP Care • Monitoring – IOV – Unusual Human behavior analysis • Internet – Video access analysis – Device size; resolution – Server loads Scenario: Half year of telecom data in one state Cluster: 70 machines, 1120 cores Data: 200 to 1000 Billion Rows, 80 columns Carbon Table Index built on c1~c8 Workload: Q1: filter (c1~c4), select * Q2: filter (c1,c5), big result set Q3: filter (c3) Q4: filter (c1) and aggregate Q5: full scan aggregate 0% 50% 100% 150% 200% 250% 300% 350% 400% 450% 500% 200B 400B 600B 800B 1000B Response Time (% of 200B records) Q1 Q2 Q3 Q4 Q5 Observation When data grows: ⚫ Index is efficient to reduce response time: Q1, Q3 ⚫ When selectivity is low or full scan, query response time is linear: Q2, Q5 ⚫ Spark compute time scale linearly: Q4 Cluster:178 machines, 1368 cores, 5550 GB Mem Data Size:3 PB, 10+ trillion rows Largest Deployment CarbonData use cases in Production
  26. 26. Roadmap Releases, road Ahead, usage in Huawei Public Cloud
  27. 27. 0.1.0- incubating 1.5.0 Aug-2016 • Indexed Columnar Store on HDFS • Integration with Apache Spark • Bulk load support Nov-2016 • MR support • Spark DataFrame support • Configurable block size • Performance Improvements 0.1.1- incubating 0.2.0- incubating 1.0.0- incubating Jan-2017 • Kettle Dependency removal • Spark 2.1 support • IUD support for Spark 1.6 • Adaptive compression • V2 Format of CarbonData • Vectorized readers • Off-heap memory support • Single Pass loading 1.1.0 1.1.1 1.2.0 1.3.0 1.3.1 May-2017 • CarbonData V3 format • Alter Table • Range Filter • Large cluster optimizations • Code refactoring Sep-2017 • Presto support • Sort columns configuration • Partition support • Datamaps • IUD support for Spark 2.1 • Dynamic property configuration • RLE codec support • Compaction performance Feb-2018 • Spark 2.2.1 support • Streaming • Pre-aggregate tables • CTAS • Code refactoring • Performance improvements • Read from S3 support 1.4.0 1.4.1 Jun-2018 • Query Performance improvements • Compaction performance improvements • Data loading performance improvements • SDK support • Streaming on pre-agg , partitioned tables • MV support • Bloom filter • Write to S3 support Aug-2018 • Local Dictionary support • Query performance improvements • Custom compaction • Support varchar • Code refactoring • Hadoop 2.8.3 support Oct-2018 • Spark 2.3.2 support • Hadoop 3.1.1 support • C++ reader through SDK • StreamingSQL from Kafka • Data Summary Tool • Better Avro compliance with more datatypes • Adaptive encoding for all numeric columns • LightWeight integration with Spark(fileformat) Release milestones, Features
  28. 28. Towards Objectives / Goals ➢ Unify CarbonData for more use cases ➢ Performance enhancements ➢ Broader ecosystem Integration ➢ Auto tuning storage In near future ➢ DataMaps and MVs to strengthen more use cases (GeoSpatial, Graph, Rtrees, Time Series) ➢ Easy Data Maintenance ➢ In-Memory Caching ➢ Compliance to Spark Data Source V2 ➢ Continuous Streaming ➢ Improve cloud storage performance ➢ More encodings ➢ ML SQL Road Ahead
  29. 29. CDM DIS CS CarbonData Store OBS Input Sources Ingestion Services DataLake based on Object Store Intelligence Services DLI DWS MRS FRS MLS GES CarbonData in Huawei Cloud
  30. 30. Store in CarbonData format and Analyze on need basis What is Apache CarbonData? Query Processing (History + Real time) 0 1 2 3 4 5 6 0 5 10 15 20 25 30 35 0 1 2 3 4 5 Analysis Real time data CarbonData format Data Indexes Data maps Indexes Unified data analysis and Unified storage format Recap …
  31. 31. Community Community Contributions, Usage in Production
  32. 32. Companies Contributing Apache CarbonData Graduated in April 2017 12 Stable releases 130+ Contributors in a very short span CarbonData Incubated into Apache in June 2016 Started working on CarbonData from 2015 Apache CarbonData Community
  33. 33. We Love more community involvement & feedback • Subscribe to dev mailing list • Mail list: dev@carbondata.apache.org, user@carbondata.apache.org • Mailing list Archive: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ • Welcome any type of contribution: feature, documentation or bug report: • Code: https://github.com/apache/carbondata • JIRA: https://issues.apache.org/jira/browse/CARBONDATA • Website: http://carbondata.apache.org • cwiki: https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home Apache CarbonData community is very open for fresh contributions. We welcome anyone who brings new perspective to use cases, design, requirements. Community welcomes all kinds of contributions including test, documentation enhancements, website enhancements, examples implementation, Integration with other execution engines. Contributions …
  34. 34. Thank You

×