SlideShare a Scribd company logo
Data convergence and
Unified high performance
Data Analytics with Apache CarbonData + Spark
Presented By : Raghunandan S
I am …
Raghunandan S
• Head of the technical team working on Bigdata technologies including Hadoop,
Spark, HBase, CarbonData and ZooKeeper
• Systems Group leader of the team which developed BI solution based on
MOLAP
• Project Manager of the CarbonData team with responsibility to cleanup, re-
architect, Incubate into Apache and support till it reaches Top Level Project
• Prior experience in NMS, Softswitch products
Chief Architect at Huawei’s India R&D centre’s BigData group
raghunandan@apache.org
Use cases & Challenges
Unified Data Analytics
CarbonData Technology & Solution
CarbonData Roadmap
CarbonData Community
Use Cases and Challenges
Background of CarbonData project
Network
• 54B records per day
• 750TB per month
• Complex correlated
data
Consumer
• 100 thousands of sensors
• >2 million events per
second
• Time series, geospatial
data
Enterprise & Banking
• 100 GB to TB per day
• Data across different
regions
Data
Report & Dashboard
OLAP & Ad-hoc
Batch processing
Machine learning
Real Time Analytics
Graph analysis
Time series
analysis
Geo spatial and
position analysis
Textual
matching and
analysis
Fraud detection
Traffic analysis
User behavior
Advertising
Failure
detection
Home
automation
Data analysis use cases
Big Data and Analytics Challenges
➢ Data Size
• Single Table >10 billion rows
➢ Multi-dimensional
• Wide table: > 100 dimensions
➢ Rich of Detail
• High cardinality (> 1 billion)
• 1B terminal * 200K cell / minute
• Complex & Nested Structure
➢ Real Time
• Time Series streaming data
➢ Reporting
• Big Scan
• Distributed Compute
➢ Searching
• Located Scan
• Isolated Compute
➢ AI
• Iterative Scan
• Iterative Compute
No-SQL
Document store
TXT
Data Warehouse
HOT
COLD
Time series
databaseStorage
Input
Sources
• Multiple Stores for specific analysis
• Complex data pipe lines and high maintenance cost
Query
Processing
ETL pipelines
Data stores
Real time
0
1
2
3
4
5
6
2015 2016 2017 2018
Analysis
Big Data and Analytics Challenges
• Multiple team co-ordination, there by
tough to achieve Self-BI for complex and
big data
IT
Business
Data Engineers
Hello ahdojq
1,2,3 ,
3434
34,34,3
34
Data Analysts
Complex
Pipe lines
Various
stores
Business
needsComplex
Pipe lines
Various
stores
Business
needs
Complex
Pipe lines
Various
stores
Business
needs
Big Data and Analytics Challenges
• Data lake
• Performance not comparable to data warehouse
• Variety of Scans
• Corrections/Updates to data
• Real time and history data analysis
• Consistency of query results
• With all the above challenges, still need to be simple to
analyze
• Store now and analyze later.
• Will also facilitate Self assist BI.
Data lake
Big Data and Analytics Challenges
• Real time and History data
• Data consistency and Duplication
• Materialized Views or indexes
Query
Processing
(History +
Real time)
Real time data
History data
0
1
2
3
4
5
6
2015 2016 2017 2018
Analysis
Big Data and Analytics Challenges
Unified Data Analytics
Solution for data silos; data redundancy; Multiple use cases
Unified data analytics
Store in CarbonData format and Analyze on need basis
What is Apache CarbonData?
Query Processing
(History + Real time)
0
1
2
3
4
5
6
0
5
10
15
20
25
30
35
0
1
2
3
4
5
Analysis
Real time data
CarbonData
format
Data
Indexes
Data maps
Indexes
Unified data analysis and Unified storage format
CarbonData Solution
Technology, Features, Design adopted for Unified Storage
CarbonData Solution & Features
High Scalability
• Storage and processing
separated
• Suitable for cloud
Data consistency
• No intermediate data on failures
• Supported for every operation on data
(Query, IUD, Indexes, Data maps,
Streaming)
ACID
Insert - Update - Delete
• Support bulk Insert, Update,
Deletes
Fast Query Execution
• Filters based on Multi-level Indexes
• Columnar format
• Query processing on dictionary
encoded values.
AI
Reporting Searching
Unified
Storage
Fast: Construct multi-dimension
indexes
Faster: Based on fast
intelligent scanning
Efficient data compression:
dictionary encoding
Concurrent data import:
Spark parallel task
CarbonData Solution & Features
Decoupled Storage & Processing
• Data in CarbonDataFormat
• Processing in Distributed frameworks
Real time + History data
• real time and history data into same
table
• Query can combine results
automatically
• Auto handoff to columnar format
0I00I0110
0II00010II000
Multi level Index
• Supports Multi–dimensional Btree
• Min-max and inverted indexes
External Indexes for
Optimizations
• External Indexes like
Lucene for text
• MV Data maps for
aggregate tables, time
series queries
• Query auto selects the
required data map
AI
Reporting Searching
Unified
Storage
Fast: Construct multi-dimension
indexes
Faster: Based on fast
intelligent scanning
Efficient data compression:
dictionary encoding
Concurrent data import:
Spark parallel task
Carbon Data File
Blocklet 1
Column 1 Chunk
Column 2 Chunk
…
Column n Chunk
Blocklet N
File Footer
…
Blocklet Offset, Index & Stats
File Header
Version
Schema
Page1 …Page2header
CarbonData File Layout
➢ Data Layout
• Block : A HDFS file
• Blocklet : A set of rows in columnar format
o Column chunk : Data for one column in a blocklet; smallest IO
unit
o Page : Data Page in the Column chunk; smallest decoding unit
➢ Metadata Information
• Header : Version, Schema
• Footer : Blocklet Offset, Index & File level statistics
➢ Built-in indexes and statistics
• Blocklet Index : B-Tree start key, end key
• Blocklet & Page level statistics : min, max etc
TimeseriesOLAP GraphTextGeospatial AI
Data
CarbonData
CarbonTable DataMap
DataMap DSL
CarbonData Engine
• Stores Data / Metadata / Index / Statistics for Query optimization.
• Auto selects the required data map during Query.
• Updated Inline or offline based on configuration.
• Supports custom data maps.
DataMaps
Index Datamap(s):
MV Datamap(s):
Time series Cube Datamap
OLAP
Time series Queries
Lucene Datamap
Min-Max Datamap BloomFilter Datamap
Rtree Datamap
Text search
Spatial Queries
Point and Filter Queries
Aggregate table Datamap Materialized Views
CREATE DATAMAP agg_sales ON TABLE sales USING "preaggregate" AS SELECT country, sum(quantity),
avg(price) FROM sales GROUP BY country
CREATE DATAMAP datamap_name ON TABLE main_table USING 'lucene' DMPROPERTIES
('index_columns'='city, name', ...)
DataMaps
DataMap Execution Model
Data Node
Spark Driver
DataMap Execution
(centralized, Distributed)
Spark Driver
Query
Executor
Carbon
File
Data
Map
Data Node
Carbon
File
Data
Map
Data Node
Carbon
File
Data
Map
Executor Executor
• Mainly includes Index DataMap
and MV DataMap
• Push filter and projection to
DataMap
• DataMap can be centralized or
distributed (avoiding huge memory
issues)
• DataMap can be cached in
memory or stored on disk
Spark Driver
Executor
Carbon File
Data
Footer
Carbon File
Data
Footer
Carbon File
Data
Footer
File Level Index
& Scanner
Table Level Index
Executor
File Level Index
& Scanner
Catalyst
Inverted
Index
Rich Multi-Level Index Support
• Using the index info in footer, two
level B+ tree index can be built:
• File level index: local B+ tree,
efficient blocklet level filtering
• Table level index: global B+
tree, efficient file level filtering
• Column chunk inverted index:
efficient column chunk scan
Spark Executor
Spark Driver
Blocklet
HDFS
File
Footer
Blocklet
Blocklet
…
…
C1 C2 C3 C4 Cn
1. File pruning
File
Blocklet
Blocklet
Footer
File
Footer
Blocklet
Blocklet
File
Blocklet
Blocklet
Footer
Blocklet Blocklet Blocklet Blocklet
Task Task
2. Blocklet
pruning
3. Read and decompress filter column
4. Binary search using inverted index,
skip to next blocklet if no matching
…
5. Decompress projection column
6. Return decoded data to spark
SELECT c3, c4 FROM t1 WHERE c2=’boston’
Efficient Filtering via Index
Hybrid format
– Streaming: appendable format
– Batch: columnar
Query merging on both streaming and batch
segments
Auto data conversion on Handoff
– streaming segment size
– Updates Indexes and data maps
Support Materialized View!
Streaming with Kafka & Spark
Delete Delta:
• Store RowId that are
deleted
• Bitmap file format
Insert Delta:
• Store newly added row
• CarbonData file format
Insert Delta
(Base)
Update flow:
1. Find all rows that need update, by
executing the sub query
2. Write the “Delete Delta” file
3. Write the “Insert Delta” file
Read flow:
1. Read “Base” file
2. Read “Delete Delta” and exclude RowId in the file
3. Read “Update Delta” and merge new row
Data Update / Delete
CarbonData Core Engine
Table, Segment, Index, Caching
Search
Acceleration
Multi-Dimensional detail
data query
OLAP
Acceleration
BI & Batch data analytics
Streaming
Acceleration
Near real-time data
analytics
AI
Acceleration
AI-enabled data
analytics
Tools
Open Format
CarbonFile
Searchlet
Cloud Storage
CarbonData Access
API, RPC, REST, Parallel Framework
CarbonData Technology Stack
• Bank
– Fraud detection
– Risk analysis
• Telco
– Churn Analysis
– VIP Care
• Monitoring
– IOV
– Unusual Human behavior analysis
• Internet
– Video access analysis
– Device size; resolution
– Server loads
Scenario: Half year of telecom data in one state
Cluster: 70 machines, 1120 cores
Data: 200 to 1000 Billion Rows, 80 columns
Carbon Table Index built on c1~c8
Workload:
Q1: filter (c1~c4), select *
Q2: filter (c1,c5), big result set
Q3: filter (c3)
Q4: filter (c1) and aggregate
Q5: full scan aggregate
0%
50%
100%
150%
200%
250%
300%
350%
400%
450%
500%
200B 400B 600B 800B 1000B
Response Time (% of 200B
records)
Q1 Q2 Q3 Q4 Q5
Observation
When data grows:
⚫ Index is efficient to reduce response time: Q1, Q3
⚫ When selectivity is low or full scan, query
response time is linear: Q2, Q5
⚫ Spark compute time scale linearly: Q4
Cluster:178 machines, 1368 cores,
5550 GB Mem
Data Size:3 PB, 10+ trillion rows
Largest Deployment
CarbonData use cases in Production
Roadmap
Releases, road Ahead, usage in Huawei Public Cloud
0.1.0-
incubating
1.5.0
Aug-2016
• Indexed Columnar
Store on HDFS
• Integration with Apache
Spark
• Bulk load support
Nov-2016
• MR support
• Spark DataFrame support
• Configurable block size
• Performance Improvements
0.1.1-
incubating
0.2.0-
incubating
1.0.0-
incubating
Jan-2017
• Kettle Dependency
removal
• Spark 2.1 support
• IUD support for Spark 1.6
• Adaptive compression
• V2 Format of CarbonData
• Vectorized readers
• Off-heap memory support
• Single Pass loading
1.1.0 1.1.1 1.2.0 1.3.0 1.3.1
May-2017
• CarbonData V3 format
• Alter Table
• Range Filter
• Large cluster optimizations
• Code refactoring
Sep-2017
• Presto support
• Sort columns configuration
• Partition support
• Datamaps
• IUD support for Spark 2.1
• Dynamic property
configuration
• RLE codec support
• Compaction performance
Feb-2018
• Spark 2.2.1 support
• Streaming
• Pre-aggregate tables
• CTAS
• Code refactoring
• Performance improvements
• Read from S3 support
1.4.0 1.4.1
Jun-2018
• Query Performance improvements
• Compaction performance
improvements
• Data loading performance
improvements
• SDK support
• Streaming on pre-agg , partitioned
tables
• MV support
• Bloom filter
• Write to S3 support
Aug-2018
• Local Dictionary support
• Query performance
improvements
• Custom compaction
• Support varchar
• Code refactoring
• Hadoop 2.8.3 support
Oct-2018
• Spark 2.3.2 support
• Hadoop 3.1.1 support
• C++ reader through SDK
• StreamingSQL from Kafka
• Data Summary Tool
• Better Avro compliance with more datatypes
• Adaptive encoding for all numeric columns
• LightWeight integration with Spark(fileformat)
Release milestones, Features
Towards Objectives / Goals
➢ Unify CarbonData for more use cases
➢ Performance enhancements
➢ Broader ecosystem Integration
➢ Auto tuning storage
In near future
➢ DataMaps and MVs to strengthen more use
cases (GeoSpatial, Graph, Rtrees, Time
Series)
➢ Easy Data Maintenance
➢ In-Memory Caching
➢ Compliance to Spark Data Source V2
➢ Continuous Streaming
➢ Improve cloud storage performance
➢ More encodings
➢ ML SQL
Road Ahead
CDM
DIS
CS
CarbonData Store
OBS
Input Sources Ingestion
Services
DataLake based on Object Store Intelligence
Services
DLI
DWS
MRS
FRS
MLS
GES
CarbonData in Huawei Cloud
Store in CarbonData format and Analyze on need basis
What is Apache CarbonData?
Query Processing
(History + Real time)
0
1
2
3
4
5
6
0
5
10
15
20
25
30
35
0
1
2
3
4
5
Analysis
Real time data
CarbonData
format
Data
Indexes
Data maps
Indexes
Unified data analysis and Unified storage format
Recap …
Community
Community Contributions, Usage in Production
Companies Contributing
Apache CarbonData Graduated in April 2017
12 Stable releases
130+ Contributors in a very short span
CarbonData Incubated into Apache in June 2016
Started working on CarbonData from 2015
Apache CarbonData Community
We Love more community involvement & feedback
• Subscribe to dev mailing list
• Mail list: dev@carbondata.apache.org, user@carbondata.apache.org
• Mailing list Archive: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
• Welcome any type of contribution: feature, documentation or bug report:
• Code: https://github.com/apache/carbondata
• JIRA: https://issues.apache.org/jira/browse/CARBONDATA
• Website: http://carbondata.apache.org
• cwiki: https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home
Apache CarbonData community is very open for fresh contributions. We welcome anyone who brings
new perspective to use cases, design, requirements.
Community welcomes all kinds of contributions including test, documentation enhancements, website
enhancements, examples implementation, Integration with other execution engines.
Contributions …
Thank You

More Related Content

What's hot

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
Databricks
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
Yan Zhou
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
MongoDB
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Databricks
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Databricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 

What's hot (20)

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 

Similar to Apache CarbonData+Spark to realize data convergence and Unified high performance Data Analytics - Raghunandan Subramanya (Huawei Technologies India Pvt Ltd)

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
Mark Smith
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
Cambridge Semantics
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
Torsten Steinbach
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
Frank Kienle
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
Data Driven Innovation
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical Overview
Raheel Retiwalla
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0
Amr Kamel Deklel
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Amazon Web Services
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
MongoDB
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Caserta
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
DataWorks Summit
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
Amazon Web Services
 

Similar to Apache CarbonData+Spark to realize data convergence and Unified high performance Data Analytics - Raghunandan Subramanya (Huawei Technologies India Pvt Ltd) (20)

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical Overview
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 

More from Tech Triveni

UI Dev in Big data world using open source
UI Dev in Big data world using open sourceUI Dev in Big data world using open source
UI Dev in Big data world using open source
Tech Triveni
 
Why should a Java programmer shifts towards Functional Programming Paradigm
Why should a Java programmer shifts towards Functional Programming ParadigmWhy should a Java programmer shifts towards Functional Programming Paradigm
Why should a Java programmer shifts towards Functional Programming Paradigm
Tech Triveni
 
Reactive - Is it really a Magic Pill?
Reactive - Is it really a Magic Pill?Reactive - Is it really a Magic Pill?
Reactive - Is it really a Magic Pill?
Tech Triveni
 
Let’s go reactive with JAVA
Let’s go reactive with JAVALet’s go reactive with JAVA
Let’s go reactive with JAVA
Tech Triveni
 
Tackling Asynchrony with Kotlin Coroutines
Tackling Asynchrony with Kotlin CoroutinesTackling Asynchrony with Kotlin Coroutines
Tackling Asynchrony with Kotlin Coroutines
Tech Triveni
 
Programmatic Ad Tracking: Let the power of Reactive Microservices do talking
Programmatic Ad Tracking: Let the power of Reactive Microservices do talkingProgrammatic Ad Tracking: Let the power of Reactive Microservices do talking
Programmatic Ad Tracking: Let the power of Reactive Microservices do talking
Tech Triveni
 
Let's refine your Scala Code
Let's refine your Scala CodeLet's refine your Scala Code
Let's refine your Scala Code
Tech Triveni
 
Supercharged imperative programming with Haskell and Functional Programming
Supercharged imperative programming with Haskell and Functional ProgrammingSupercharged imperative programming with Haskell and Functional Programming
Supercharged imperative programming with Haskell and Functional Programming
Tech Triveni
 
Observability at scale with Neural Networks: A more proactive approach
Observability at scale with Neural Networks: A more proactive approachObservability at scale with Neural Networks: A more proactive approach
Observability at scale with Neural Networks: A more proactive approach
Tech Triveni
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Tech Triveni
 
Finding the best solution for Image Processing
Finding the best solution for Image ProcessingFinding the best solution for Image Processing
Finding the best solution for Image Processing
Tech Triveni
 
Proximity Targeting at Scale using Big Data Platforms
Proximity Targeting at Scale using Big Data PlatformsProximity Targeting at Scale using Big Data Platforms
Proximity Targeting at Scale using Big Data Platforms
Tech Triveni
 
Effecting Pure Change - How anything ever gets done in functional programming...
Effecting Pure Change - How anything ever gets done in functional programming...Effecting Pure Change - How anything ever gets done in functional programming...
Effecting Pure Change - How anything ever gets done in functional programming...
Tech Triveni
 
Becoming a Functional Programmer - Harit Himanshu (Nomis Solutions)
Becoming a Functional Programmer - Harit Himanshu (Nomis Solutions)Becoming a Functional Programmer - Harit Himanshu (Nomis Solutions)
Becoming a Functional Programmer - Harit Himanshu (Nomis Solutions)
Tech Triveni
 
Live coding session on AI / ML using Google Tensorflow (Python) - Tanmoy Deb ...
Live coding session on AI / ML using Google Tensorflow (Python) - Tanmoy Deb ...Live coding session on AI / ML using Google Tensorflow (Python) - Tanmoy Deb ...
Live coding session on AI / ML using Google Tensorflow (Python) - Tanmoy Deb ...
Tech Triveni
 
Distributing the SMACK stack - Kubernetes VS DCOS - Sahil Sawhney (Knoldus Inc.)
Distributing the SMACK stack - Kubernetes VS DCOS - Sahil Sawhney (Knoldus Inc.)Distributing the SMACK stack - Kubernetes VS DCOS - Sahil Sawhney (Knoldus Inc.)
Distributing the SMACK stack - Kubernetes VS DCOS - Sahil Sawhney (Knoldus Inc.)
Tech Triveni
 
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Tech Triveni
 
UX in Big Data Analytics - Paramjit Jolly (Guavus)
UX in Big Data Analytics - Paramjit Jolly (Guavus)UX in Big Data Analytics - Paramjit Jolly (Guavus)
UX in Big Data Analytics - Paramjit Jolly (Guavus)
Tech Triveni
 
Simplified Scala Monads And Transformation - Harmeet Singh (Knoldus Inc.)
Simplified Scala Monads And Transformation - Harmeet Singh (Knoldus Inc.)Simplified Scala Monads And Transformation - Harmeet Singh (Knoldus Inc.)
Simplified Scala Monads And Transformation - Harmeet Singh (Knoldus Inc.)
Tech Triveni
 
Micro Frontends Architecture - Jitendra kumawat (Guavus)
Micro Frontends Architecture - Jitendra kumawat (Guavus)Micro Frontends Architecture - Jitendra kumawat (Guavus)
Micro Frontends Architecture - Jitendra kumawat (Guavus)
Tech Triveni
 

More from Tech Triveni (20)

UI Dev in Big data world using open source
UI Dev in Big data world using open sourceUI Dev in Big data world using open source
UI Dev in Big data world using open source
 
Why should a Java programmer shifts towards Functional Programming Paradigm
Why should a Java programmer shifts towards Functional Programming ParadigmWhy should a Java programmer shifts towards Functional Programming Paradigm
Why should a Java programmer shifts towards Functional Programming Paradigm
 
Reactive - Is it really a Magic Pill?
Reactive - Is it really a Magic Pill?Reactive - Is it really a Magic Pill?
Reactive - Is it really a Magic Pill?
 
Let’s go reactive with JAVA
Let’s go reactive with JAVALet’s go reactive with JAVA
Let’s go reactive with JAVA
 
Tackling Asynchrony with Kotlin Coroutines
Tackling Asynchrony with Kotlin CoroutinesTackling Asynchrony with Kotlin Coroutines
Tackling Asynchrony with Kotlin Coroutines
 
Programmatic Ad Tracking: Let the power of Reactive Microservices do talking
Programmatic Ad Tracking: Let the power of Reactive Microservices do talkingProgrammatic Ad Tracking: Let the power of Reactive Microservices do talking
Programmatic Ad Tracking: Let the power of Reactive Microservices do talking
 
Let's refine your Scala Code
Let's refine your Scala CodeLet's refine your Scala Code
Let's refine your Scala Code
 
Supercharged imperative programming with Haskell and Functional Programming
Supercharged imperative programming with Haskell and Functional ProgrammingSupercharged imperative programming with Haskell and Functional Programming
Supercharged imperative programming with Haskell and Functional Programming
 
Observability at scale with Neural Networks: A more proactive approach
Observability at scale with Neural Networks: A more proactive approachObservability at scale with Neural Networks: A more proactive approach
Observability at scale with Neural Networks: A more proactive approach
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text Data
 
Finding the best solution for Image Processing
Finding the best solution for Image ProcessingFinding the best solution for Image Processing
Finding the best solution for Image Processing
 
Proximity Targeting at Scale using Big Data Platforms
Proximity Targeting at Scale using Big Data PlatformsProximity Targeting at Scale using Big Data Platforms
Proximity Targeting at Scale using Big Data Platforms
 
Effecting Pure Change - How anything ever gets done in functional programming...
Effecting Pure Change - How anything ever gets done in functional programming...Effecting Pure Change - How anything ever gets done in functional programming...
Effecting Pure Change - How anything ever gets done in functional programming...
 
Becoming a Functional Programmer - Harit Himanshu (Nomis Solutions)
Becoming a Functional Programmer - Harit Himanshu (Nomis Solutions)Becoming a Functional Programmer - Harit Himanshu (Nomis Solutions)
Becoming a Functional Programmer - Harit Himanshu (Nomis Solutions)
 
Live coding session on AI / ML using Google Tensorflow (Python) - Tanmoy Deb ...
Live coding session on AI / ML using Google Tensorflow (Python) - Tanmoy Deb ...Live coding session on AI / ML using Google Tensorflow (Python) - Tanmoy Deb ...
Live coding session on AI / ML using Google Tensorflow (Python) - Tanmoy Deb ...
 
Distributing the SMACK stack - Kubernetes VS DCOS - Sahil Sawhney (Knoldus Inc.)
Distributing the SMACK stack - Kubernetes VS DCOS - Sahil Sawhney (Knoldus Inc.)Distributing the SMACK stack - Kubernetes VS DCOS - Sahil Sawhney (Knoldus Inc.)
Distributing the SMACK stack - Kubernetes VS DCOS - Sahil Sawhney (Knoldus Inc.)
 
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
 
UX in Big Data Analytics - Paramjit Jolly (Guavus)
UX in Big Data Analytics - Paramjit Jolly (Guavus)UX in Big Data Analytics - Paramjit Jolly (Guavus)
UX in Big Data Analytics - Paramjit Jolly (Guavus)
 
Simplified Scala Monads And Transformation - Harmeet Singh (Knoldus Inc.)
Simplified Scala Monads And Transformation - Harmeet Singh (Knoldus Inc.)Simplified Scala Monads And Transformation - Harmeet Singh (Knoldus Inc.)
Simplified Scala Monads And Transformation - Harmeet Singh (Knoldus Inc.)
 
Micro Frontends Architecture - Jitendra kumawat (Guavus)
Micro Frontends Architecture - Jitendra kumawat (Guavus)Micro Frontends Architecture - Jitendra kumawat (Guavus)
Micro Frontends Architecture - Jitendra kumawat (Guavus)
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

Apache CarbonData+Spark to realize data convergence and Unified high performance Data Analytics - Raghunandan Subramanya (Huawei Technologies India Pvt Ltd)

  • 1. Data convergence and Unified high performance Data Analytics with Apache CarbonData + Spark Presented By : Raghunandan S
  • 2. I am … Raghunandan S • Head of the technical team working on Bigdata technologies including Hadoop, Spark, HBase, CarbonData and ZooKeeper • Systems Group leader of the team which developed BI solution based on MOLAP • Project Manager of the CarbonData team with responsibility to cleanup, re- architect, Incubate into Apache and support till it reaches Top Level Project • Prior experience in NMS, Softswitch products Chief Architect at Huawei’s India R&D centre’s BigData group raghunandan@apache.org
  • 3. Use cases & Challenges Unified Data Analytics CarbonData Technology & Solution CarbonData Roadmap CarbonData Community
  • 4. Use Cases and Challenges Background of CarbonData project
  • 5. Network • 54B records per day • 750TB per month • Complex correlated data Consumer • 100 thousands of sensors • >2 million events per second • Time series, geospatial data Enterprise & Banking • 100 GB to TB per day • Data across different regions Data Report & Dashboard OLAP & Ad-hoc Batch processing Machine learning Real Time Analytics Graph analysis Time series analysis Geo spatial and position analysis Textual matching and analysis Fraud detection Traffic analysis User behavior Advertising Failure detection Home automation Data analysis use cases
  • 6. Big Data and Analytics Challenges ➢ Data Size • Single Table >10 billion rows ➢ Multi-dimensional • Wide table: > 100 dimensions ➢ Rich of Detail • High cardinality (> 1 billion) • 1B terminal * 200K cell / minute • Complex & Nested Structure ➢ Real Time • Time Series streaming data ➢ Reporting • Big Scan • Distributed Compute ➢ Searching • Located Scan • Isolated Compute ➢ AI • Iterative Scan • Iterative Compute
  • 7. No-SQL Document store TXT Data Warehouse HOT COLD Time series databaseStorage Input Sources • Multiple Stores for specific analysis • Complex data pipe lines and high maintenance cost Query Processing ETL pipelines Data stores Real time 0 1 2 3 4 5 6 2015 2016 2017 2018 Analysis Big Data and Analytics Challenges
  • 8. • Multiple team co-ordination, there by tough to achieve Self-BI for complex and big data IT Business Data Engineers Hello ahdojq 1,2,3 , 3434 34,34,3 34 Data Analysts Complex Pipe lines Various stores Business needsComplex Pipe lines Various stores Business needs Complex Pipe lines Various stores Business needs Big Data and Analytics Challenges
  • 9. • Data lake • Performance not comparable to data warehouse • Variety of Scans • Corrections/Updates to data • Real time and history data analysis • Consistency of query results • With all the above challenges, still need to be simple to analyze • Store now and analyze later. • Will also facilitate Self assist BI. Data lake Big Data and Analytics Challenges
  • 10. • Real time and History data • Data consistency and Duplication • Materialized Views or indexes Query Processing (History + Real time) Real time data History data 0 1 2 3 4 5 6 2015 2016 2017 2018 Analysis Big Data and Analytics Challenges
  • 11. Unified Data Analytics Solution for data silos; data redundancy; Multiple use cases
  • 12. Unified data analytics Store in CarbonData format and Analyze on need basis What is Apache CarbonData? Query Processing (History + Real time) 0 1 2 3 4 5 6 0 5 10 15 20 25 30 35 0 1 2 3 4 5 Analysis Real time data CarbonData format Data Indexes Data maps Indexes Unified data analysis and Unified storage format
  • 13. CarbonData Solution Technology, Features, Design adopted for Unified Storage
  • 14. CarbonData Solution & Features High Scalability • Storage and processing separated • Suitable for cloud Data consistency • No intermediate data on failures • Supported for every operation on data (Query, IUD, Indexes, Data maps, Streaming) ACID Insert - Update - Delete • Support bulk Insert, Update, Deletes Fast Query Execution • Filters based on Multi-level Indexes • Columnar format • Query processing on dictionary encoded values. AI Reporting Searching Unified Storage Fast: Construct multi-dimension indexes Faster: Based on fast intelligent scanning Efficient data compression: dictionary encoding Concurrent data import: Spark parallel task
  • 15. CarbonData Solution & Features Decoupled Storage & Processing • Data in CarbonDataFormat • Processing in Distributed frameworks Real time + History data • real time and history data into same table • Query can combine results automatically • Auto handoff to columnar format 0I00I0110 0II00010II000 Multi level Index • Supports Multi–dimensional Btree • Min-max and inverted indexes External Indexes for Optimizations • External Indexes like Lucene for text • MV Data maps for aggregate tables, time series queries • Query auto selects the required data map AI Reporting Searching Unified Storage Fast: Construct multi-dimension indexes Faster: Based on fast intelligent scanning Efficient data compression: dictionary encoding Concurrent data import: Spark parallel task
  • 16. Carbon Data File Blocklet 1 Column 1 Chunk Column 2 Chunk … Column n Chunk Blocklet N File Footer … Blocklet Offset, Index & Stats File Header Version Schema Page1 …Page2header CarbonData File Layout ➢ Data Layout • Block : A HDFS file • Blocklet : A set of rows in columnar format o Column chunk : Data for one column in a blocklet; smallest IO unit o Page : Data Page in the Column chunk; smallest decoding unit ➢ Metadata Information • Header : Version, Schema • Footer : Blocklet Offset, Index & File level statistics ➢ Built-in indexes and statistics • Blocklet Index : B-Tree start key, end key • Blocklet & Page level statistics : min, max etc
  • 17. TimeseriesOLAP GraphTextGeospatial AI Data CarbonData CarbonTable DataMap DataMap DSL CarbonData Engine • Stores Data / Metadata / Index / Statistics for Query optimization. • Auto selects the required data map during Query. • Updated Inline or offline based on configuration. • Supports custom data maps. DataMaps
  • 18. Index Datamap(s): MV Datamap(s): Time series Cube Datamap OLAP Time series Queries Lucene Datamap Min-Max Datamap BloomFilter Datamap Rtree Datamap Text search Spatial Queries Point and Filter Queries Aggregate table Datamap Materialized Views CREATE DATAMAP agg_sales ON TABLE sales USING "preaggregate" AS SELECT country, sum(quantity), avg(price) FROM sales GROUP BY country CREATE DATAMAP datamap_name ON TABLE main_table USING 'lucene' DMPROPERTIES ('index_columns'='city, name', ...) DataMaps
  • 19. DataMap Execution Model Data Node Spark Driver DataMap Execution (centralized, Distributed) Spark Driver Query Executor Carbon File Data Map Data Node Carbon File Data Map Data Node Carbon File Data Map Executor Executor • Mainly includes Index DataMap and MV DataMap • Push filter and projection to DataMap • DataMap can be centralized or distributed (avoiding huge memory issues) • DataMap can be cached in memory or stored on disk
  • 20. Spark Driver Executor Carbon File Data Footer Carbon File Data Footer Carbon File Data Footer File Level Index & Scanner Table Level Index Executor File Level Index & Scanner Catalyst Inverted Index Rich Multi-Level Index Support • Using the index info in footer, two level B+ tree index can be built: • File level index: local B+ tree, efficient blocklet level filtering • Table level index: global B+ tree, efficient file level filtering • Column chunk inverted index: efficient column chunk scan
  • 21. Spark Executor Spark Driver Blocklet HDFS File Footer Blocklet Blocklet … … C1 C2 C3 C4 Cn 1. File pruning File Blocklet Blocklet Footer File Footer Blocklet Blocklet File Blocklet Blocklet Footer Blocklet Blocklet Blocklet Blocklet Task Task 2. Blocklet pruning 3. Read and decompress filter column 4. Binary search using inverted index, skip to next blocklet if no matching … 5. Decompress projection column 6. Return decoded data to spark SELECT c3, c4 FROM t1 WHERE c2=’boston’ Efficient Filtering via Index
  • 22. Hybrid format – Streaming: appendable format – Batch: columnar Query merging on both streaming and batch segments Auto data conversion on Handoff – streaming segment size – Updates Indexes and data maps Support Materialized View! Streaming with Kafka & Spark
  • 23. Delete Delta: • Store RowId that are deleted • Bitmap file format Insert Delta: • Store newly added row • CarbonData file format Insert Delta (Base) Update flow: 1. Find all rows that need update, by executing the sub query 2. Write the “Delete Delta” file 3. Write the “Insert Delta” file Read flow: 1. Read “Base” file 2. Read “Delete Delta” and exclude RowId in the file 3. Read “Update Delta” and merge new row Data Update / Delete
  • 24. CarbonData Core Engine Table, Segment, Index, Caching Search Acceleration Multi-Dimensional detail data query OLAP Acceleration BI & Batch data analytics Streaming Acceleration Near real-time data analytics AI Acceleration AI-enabled data analytics Tools Open Format CarbonFile Searchlet Cloud Storage CarbonData Access API, RPC, REST, Parallel Framework CarbonData Technology Stack
  • 25. • Bank – Fraud detection – Risk analysis • Telco – Churn Analysis – VIP Care • Monitoring – IOV – Unusual Human behavior analysis • Internet – Video access analysis – Device size; resolution – Server loads Scenario: Half year of telecom data in one state Cluster: 70 machines, 1120 cores Data: 200 to 1000 Billion Rows, 80 columns Carbon Table Index built on c1~c8 Workload: Q1: filter (c1~c4), select * Q2: filter (c1,c5), big result set Q3: filter (c3) Q4: filter (c1) and aggregate Q5: full scan aggregate 0% 50% 100% 150% 200% 250% 300% 350% 400% 450% 500% 200B 400B 600B 800B 1000B Response Time (% of 200B records) Q1 Q2 Q3 Q4 Q5 Observation When data grows: ⚫ Index is efficient to reduce response time: Q1, Q3 ⚫ When selectivity is low or full scan, query response time is linear: Q2, Q5 ⚫ Spark compute time scale linearly: Q4 Cluster:178 machines, 1368 cores, 5550 GB Mem Data Size:3 PB, 10+ trillion rows Largest Deployment CarbonData use cases in Production
  • 26. Roadmap Releases, road Ahead, usage in Huawei Public Cloud
  • 27. 0.1.0- incubating 1.5.0 Aug-2016 • Indexed Columnar Store on HDFS • Integration with Apache Spark • Bulk load support Nov-2016 • MR support • Spark DataFrame support • Configurable block size • Performance Improvements 0.1.1- incubating 0.2.0- incubating 1.0.0- incubating Jan-2017 • Kettle Dependency removal • Spark 2.1 support • IUD support for Spark 1.6 • Adaptive compression • V2 Format of CarbonData • Vectorized readers • Off-heap memory support • Single Pass loading 1.1.0 1.1.1 1.2.0 1.3.0 1.3.1 May-2017 • CarbonData V3 format • Alter Table • Range Filter • Large cluster optimizations • Code refactoring Sep-2017 • Presto support • Sort columns configuration • Partition support • Datamaps • IUD support for Spark 2.1 • Dynamic property configuration • RLE codec support • Compaction performance Feb-2018 • Spark 2.2.1 support • Streaming • Pre-aggregate tables • CTAS • Code refactoring • Performance improvements • Read from S3 support 1.4.0 1.4.1 Jun-2018 • Query Performance improvements • Compaction performance improvements • Data loading performance improvements • SDK support • Streaming on pre-agg , partitioned tables • MV support • Bloom filter • Write to S3 support Aug-2018 • Local Dictionary support • Query performance improvements • Custom compaction • Support varchar • Code refactoring • Hadoop 2.8.3 support Oct-2018 • Spark 2.3.2 support • Hadoop 3.1.1 support • C++ reader through SDK • StreamingSQL from Kafka • Data Summary Tool • Better Avro compliance with more datatypes • Adaptive encoding for all numeric columns • LightWeight integration with Spark(fileformat) Release milestones, Features
  • 28. Towards Objectives / Goals ➢ Unify CarbonData for more use cases ➢ Performance enhancements ➢ Broader ecosystem Integration ➢ Auto tuning storage In near future ➢ DataMaps and MVs to strengthen more use cases (GeoSpatial, Graph, Rtrees, Time Series) ➢ Easy Data Maintenance ➢ In-Memory Caching ➢ Compliance to Spark Data Source V2 ➢ Continuous Streaming ➢ Improve cloud storage performance ➢ More encodings ➢ ML SQL Road Ahead
  • 29. CDM DIS CS CarbonData Store OBS Input Sources Ingestion Services DataLake based on Object Store Intelligence Services DLI DWS MRS FRS MLS GES CarbonData in Huawei Cloud
  • 30. Store in CarbonData format and Analyze on need basis What is Apache CarbonData? Query Processing (History + Real time) 0 1 2 3 4 5 6 0 5 10 15 20 25 30 35 0 1 2 3 4 5 Analysis Real time data CarbonData format Data Indexes Data maps Indexes Unified data analysis and Unified storage format Recap …
  • 32. Companies Contributing Apache CarbonData Graduated in April 2017 12 Stable releases 130+ Contributors in a very short span CarbonData Incubated into Apache in June 2016 Started working on CarbonData from 2015 Apache CarbonData Community
  • 33. We Love more community involvement & feedback • Subscribe to dev mailing list • Mail list: dev@carbondata.apache.org, user@carbondata.apache.org • Mailing list Archive: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ • Welcome any type of contribution: feature, documentation or bug report: • Code: https://github.com/apache/carbondata • JIRA: https://issues.apache.org/jira/browse/CARBONDATA • Website: http://carbondata.apache.org • cwiki: https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home Apache CarbonData community is very open for fresh contributions. We welcome anyone who brings new perspective to use cases, design, requirements. Community welcomes all kinds of contributions including test, documentation enhancements, website enhancements, examples implementation, Integration with other execution engines. Contributions …