SlideShare a Scribd company logo
1 of 33
Download to read offline
Apache Doris
(incubating)
A simple and single tightly coupled OLAP system
lide@apache.org
About me
 I’m Reed, an Apache committer, come from China
 I’m working for Baidu Inc., China‘s Google as a senior
software engineer and core developer of Doris.
 10+ year experience in software development, and 5-
year experience in managing a team.
 Once worked for Naver, South Korea's largest web
search engine and Qihoo 360, China's largest internet
security company.
Overview
2 ARCHITECTURE – Key techniques
3 FEATURES – Features and functions
4 STORAGE – Data storage model
1 INTRODUCTION – What is Doris
Overview
2 ARCHITECTURE – Key techniques
3
FEATURES – Features and
functions
4 STORAGE – Data storage model
1 INTRODUCTION – What is Doris
What is Doris
 An MPP-based interactive SQL data
warehousing for reporting and analysis
 A simple and single tightly coupled system, not
depending on other systems.
2014
Project start
2018
What is Based on
 Doris is the technology combination of Google Mesa
and Apache Impala.
Used in Baidu Inc.
1000+ 200+ 1PB
Deployed
machines
Business
lines
Maximum in
one cluster
Companies who use
User case – JD.com
-- Provided by JD.com
Why Doris by JD
 Google’s Mesa theory with Baidu's
engineering practice
 Compatible MySQL protocol
 High concurrency and high QPS support
 Convenient operation and maintenance with
clear structure
-- Provided by JD.com
Overview
3
FEATURES – Features and
functions
4 STORAGE – Data storage model
1 INTRODUCTION – What is Doris
2
ARCHITECTURE – Key
techniques
Doris Architecture
…
Catalog
FE
Catalog
FE
…
Frontend and Backend
consists of
query
coordinator and
catalog
manager
stores the data
and executes
the query
fragments
Sending query
fragments to BE
Executing query
fragments &
send result to FE
Catalog
Doris' implementation consists of two daemons:
frontend (FE) and backend (BE).
Key Technology
Doris =
Aggregate Data
Model
Versioned Data
Management
Prefix Index
Online Schema
Change
...
Query Engine
...
File Format
Indexes
Compression
Encoding
...
Google Mesa + Apache Impala + Apache ORCFile
Overview
2 ARCHITECTURE – Key techniques
4 STORAGE – Data storage model
1 INTRODUCTION – What is Doris
3 FEATURES – Features and functions
Compatible MySQL Protocol
ODBC
/JDBC
Compatible
MySQL protocol
Multiple Frontends
Distributed Queries
Two-Level Partitioning
1. The Range partitioning
2. The hash partitioning
User can specify a column (usually the
time series column) range of values for
the data partition.
User can also specify one or more
columns and a number of buckets to
do the hash partitioning.
Two-Level Partitioning
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "user id",
`date` DATE NOT NULL COMMENT "date and time",
... /* Omit other columns */
) AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
PARTITION BY RANGE(`date`)
(
PARTITION `p201801` VALUES LESS THAN ("2018-02-01"),
PARTITION `p201802` VALUES LESS THAN ("2018-03-01"),
PARTITION `p201803` VALUES LESS THAN ("2018-04-01")
)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16
... /* Omitted other information */
;
MVCC and Compaction
Materialized view
Materialized view
ALTER TABLE expamle_tbl ADD ROLLUP rollup_cost(user_id, cost);
SHOW ALTER TABLE ROLLUP;
EXPLAIN SELECT user_id, sum(cost) FROM expamle_tbl GROUP BY user_id;
+-----------------------------------------+
| Explain String |
+-----------------------------------------+
| PLAN FRAGMENT 0 |
...
| 1:AGGREGATE (update serialize) |
...
| 0:OlapScanNode |
| TABLE: expamle_tbl |
| rollup: rollup_cost |
...
Columnar storage
Overview
2 ARCHITECTURE – Key techniques
3
FEATURES – Features and
functions
1 INTRODUCTION – What is Doris
4 STORAGE – Data storage model
Data Model
In Mesa, a table schema specifies key space K
for table and corresponding value space V.
Doris combines Mesa‘s data model and ORC
File / Parquet storage technology
The table schema specifies the aggregation
function F : V ×V → V
Data Model
Types of Model
01 03
02
DUPLICATE
KEY
SUM, REPLACE,
MAX and MIN
UNIQUE KEY
AGGREGATE
KEY
Aggregate key model
CREATE TABLE IF NOT EXISTS example_db.expamle_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "user ID",
`date` DATE NOT NULL COMMENT "date and time",
`city` VARCHAR(20) COMMENT "the city user lives",
`age` SMALLINT COMMENT "age of user",
`sex` TINYINT COMMENT "sex of user",
`last_visit_date` DATETIME REPLACE COMMENT "the date last visited",
`cost` BIGINT SUM DEFAULT "0" COMMENT "total cost",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "maximum dwell time",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "minimum dwell time"
) AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
... /* Omitted the information of Partition and Distribution */
;
Duplicate key model
CREATE TABLE IF NOT EXISTS example_db.expamle_tb2
(
`timestamp` DATETIME NOT NULL COMMENT "log time",
`type` INT NOT NULL COMMENT "log type",
`error_code` INT COMMENT "error code",
`error_msg` VARCHAR(1024) COMMENT "error message",
`op_id` BIGINT COMMENT "operator ID",
`op_time` DATETIME COMMENT "operation time"
)
DUPLICATE KEY(`timestamp`, `type`)
... /* Omitted the information of Partition and Distribution */
;
Unique key model
CREATE TABLE IF NOT EXISTS example_db.expamle_tb3
(
`user_id` LARGEINT NOT NULL COMMENT "user id",
`username` VARCHAR(50) NOT NULL COMMENT ”user name",
`city` VARCHAR(20) COMMENT "the city user lives",
`age` SMALLINT COMMENT "age of user",
`sex` TINYINT COMMENT "sex of user",
`phone` LARGEINT COMMENT "phone number",
`address` VARCHAR(500) COMMENT "address of user",
`register_time` DATETIME COMMENT "registered time"
)
UNIQUE KEY(`user_id`, `user_name`)
... /* Omitted the information of Partition and Distribution */
;
Contact us
• Official website and source code
• http://doris.apache.org
• https://github.com/apache/incubator-doris
• E-Mail
• dev@doris.apache.org
• Document
• http://doris.incubator.apache.org/documentation/en/
Thank you

More Related Content

What's hot

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 

What's hot (20)

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 

Similar to Apache Doris (incubating) - A simple and single tightly coupled OLAP system

When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaoneSimon Elliston Ball
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Maxime Beugnet
 
Database Management System
Database Management SystemDatabase Management System
Database Management SystemAbishek V S
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Stefan Urbanek
 
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Kyle Davis
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Roberto Franchini
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020Thodoris Bais
 
Full-stack Web Development with MongoDB, Node.js and AWS
Full-stack Web Development with MongoDB, Node.js and AWSFull-stack Web Development with MongoDB, Node.js and AWS
Full-stack Web Development with MongoDB, Node.js and AWSMongoDB
 
01 nosql and multi model database
01   nosql and multi model database01   nosql and multi model database
01 nosql and multi model databaseMahdi Atawneh
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital.AI
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App developmentLuca Garulli
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Anuj Sahni
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
Data stores: beyond relational databases
Data stores: beyond relational databasesData stores: beyond relational databases
Data stores: beyond relational databasesJavier García Magna
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)Michael Rys
 

Similar to Apache Doris (incubating) - A simple and single tightly coupled OLAP system (20)

When to no sql and when to know sql javaone
When to no sql and when to know sql   javaoneWhen to no sql and when to know sql   javaone
When to no sql and when to know sql javaone
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...
 
The Heterogeneous Data lake
The Heterogeneous Data lakeThe Heterogeneous Data lake
The Heterogeneous Data lake
 
Database Management System
Database Management SystemDatabase Management System
Database Management System
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
 
Full-stack Web Development with MongoDB, Node.js and AWS
Full-stack Web Development with MongoDB, Node.js and AWSFull-stack Web Development with MongoDB, Node.js and AWS
Full-stack Web Development with MongoDB, Node.js and AWS
 
01 nosql and multi model database
01   nosql and multi model database01   nosql and multi model database
01 nosql and multi model database
 
Introduction to azure document db
Introduction to azure document dbIntroduction to azure document db
Introduction to azure document db
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Data stores: beyond relational databases
Data stores: beyond relational databasesData stores: beyond relational databases
Data stores: beyond relational databases
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)
 

Recently uploaded

IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (17)

IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 

Apache Doris (incubating) - A simple and single tightly coupled OLAP system

  • 1. Apache Doris (incubating) A simple and single tightly coupled OLAP system lide@apache.org
  • 2. About me  I’m Reed, an Apache committer, come from China  I’m working for Baidu Inc., China‘s Google as a senior software engineer and core developer of Doris.  10+ year experience in software development, and 5- year experience in managing a team.  Once worked for Naver, South Korea's largest web search engine and Qihoo 360, China's largest internet security company.
  • 3. Overview 2 ARCHITECTURE – Key techniques 3 FEATURES – Features and functions 4 STORAGE – Data storage model 1 INTRODUCTION – What is Doris
  • 4. Overview 2 ARCHITECTURE – Key techniques 3 FEATURES – Features and functions 4 STORAGE – Data storage model 1 INTRODUCTION – What is Doris
  • 5. What is Doris  An MPP-based interactive SQL data warehousing for reporting and analysis  A simple and single tightly coupled system, not depending on other systems. 2014 Project start 2018
  • 6. What is Based on  Doris is the technology combination of Google Mesa and Apache Impala.
  • 7. Used in Baidu Inc. 1000+ 200+ 1PB Deployed machines Business lines Maximum in one cluster
  • 9. User case – JD.com -- Provided by JD.com
  • 10. Why Doris by JD  Google’s Mesa theory with Baidu's engineering practice  Compatible MySQL protocol  High concurrency and high QPS support  Convenient operation and maintenance with clear structure -- Provided by JD.com
  • 11. Overview 3 FEATURES – Features and functions 4 STORAGE – Data storage model 1 INTRODUCTION – What is Doris 2 ARCHITECTURE – Key techniques
  • 13. Frontend and Backend consists of query coordinator and catalog manager stores the data and executes the query fragments Sending query fragments to BE Executing query fragments & send result to FE Catalog Doris' implementation consists of two daemons: frontend (FE) and backend (BE).
  • 14. Key Technology Doris = Aggregate Data Model Versioned Data Management Prefix Index Online Schema Change ... Query Engine ... File Format Indexes Compression Encoding ... Google Mesa + Apache Impala + Apache ORCFile
  • 15. Overview 2 ARCHITECTURE – Key techniques 4 STORAGE – Data storage model 1 INTRODUCTION – What is Doris 3 FEATURES – Features and functions
  • 19. Two-Level Partitioning 1. The Range partitioning 2. The hash partitioning User can specify a column (usually the time series column) range of values for the data partition. User can also specify one or more columns and a number of buckets to do the hash partitioning.
  • 20. Two-Level Partitioning CREATE TABLE IF NOT EXISTS example_db.expamle_tbl ( `user_id` LARGEINT NOT NULL COMMENT "user id", `date` DATE NOT NULL COMMENT "date and time", ... /* Omit other columns */ ) AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`) PARTITION BY RANGE(`date`) ( PARTITION `p201801` VALUES LESS THAN ("2018-02-01"), PARTITION `p201802` VALUES LESS THAN ("2018-03-01"), PARTITION `p201803` VALUES LESS THAN ("2018-04-01") ) DISTRIBUTED BY HASH(`user_id`) BUCKETS 16 ... /* Omitted other information */ ;
  • 23. Materialized view ALTER TABLE expamle_tbl ADD ROLLUP rollup_cost(user_id, cost); SHOW ALTER TABLE ROLLUP; EXPLAIN SELECT user_id, sum(cost) FROM expamle_tbl GROUP BY user_id; +-----------------------------------------+ | Explain String | +-----------------------------------------+ | PLAN FRAGMENT 0 | ... | 1:AGGREGATE (update serialize) | ... | 0:OlapScanNode | | TABLE: expamle_tbl | | rollup: rollup_cost | ...
  • 25. Overview 2 ARCHITECTURE – Key techniques 3 FEATURES – Features and functions 1 INTRODUCTION – What is Doris 4 STORAGE – Data storage model
  • 26. Data Model In Mesa, a table schema specifies key space K for table and corresponding value space V. Doris combines Mesa‘s data model and ORC File / Parquet storage technology The table schema specifies the aggregation function F : V ×V → V
  • 28. Types of Model 01 03 02 DUPLICATE KEY SUM, REPLACE, MAX and MIN UNIQUE KEY AGGREGATE KEY
  • 29. Aggregate key model CREATE TABLE IF NOT EXISTS example_db.expamle_tbl ( `user_id` LARGEINT NOT NULL COMMENT "user ID", `date` DATE NOT NULL COMMENT "date and time", `city` VARCHAR(20) COMMENT "the city user lives", `age` SMALLINT COMMENT "age of user", `sex` TINYINT COMMENT "sex of user", `last_visit_date` DATETIME REPLACE COMMENT "the date last visited", `cost` BIGINT SUM DEFAULT "0" COMMENT "total cost", `max_dwell_time` INT MAX DEFAULT "0" COMMENT "maximum dwell time", `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "minimum dwell time" ) AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`) ... /* Omitted the information of Partition and Distribution */ ;
  • 30. Duplicate key model CREATE TABLE IF NOT EXISTS example_db.expamle_tb2 ( `timestamp` DATETIME NOT NULL COMMENT "log time", `type` INT NOT NULL COMMENT "log type", `error_code` INT COMMENT "error code", `error_msg` VARCHAR(1024) COMMENT "error message", `op_id` BIGINT COMMENT "operator ID", `op_time` DATETIME COMMENT "operation time" ) DUPLICATE KEY(`timestamp`, `type`) ... /* Omitted the information of Partition and Distribution */ ;
  • 31. Unique key model CREATE TABLE IF NOT EXISTS example_db.expamle_tb3 ( `user_id` LARGEINT NOT NULL COMMENT "user id", `username` VARCHAR(50) NOT NULL COMMENT ”user name", `city` VARCHAR(20) COMMENT "the city user lives", `age` SMALLINT COMMENT "age of user", `sex` TINYINT COMMENT "sex of user", `phone` LARGEINT COMMENT "phone number", `address` VARCHAR(500) COMMENT "address of user", `register_time` DATETIME COMMENT "registered time" ) UNIQUE KEY(`user_id`, `user_name`) ... /* Omitted the information of Partition and Distribution */ ;
  • 32. Contact us • Official website and source code • http://doris.apache.org • https://github.com/apache/incubator-doris • E-Mail • dev@doris.apache.org • Document • http://doris.incubator.apache.org/documentation/en/

Editor's Notes

  1. Hello, I‘m very happy to have this opportunity to share Apache Doris with you. Thank you so much for coming. thank you! Today my topic is “Apache Doris (incubating) —— A simple and single tightly coupled OLAP system”
  2. Well, firstly, please let me make a self-introduction. I’m Reed, an Apache committer, come from China I’m working for Baidu Inc., China‘s Google as a senior software engineer and core developer of Doris. I have more than ten years experience in software development, and I have 5-year experience in managing a team. I once worked for Naver, South Korea's largest web search engine and Qihoo 360, China's largest internet security company.
  3. There are four parts of this topic, firstly, I will make an introduction for Doris. Secondly, I will explain the architecture and key techniques. thirdly, I will list the features and functions of Doris. Lastly, I will let you know what is the data storage model of Doris.
  4. 1. The introduction, what is Doris.
  5. Doris is an MPP-based interactive SQL data warehousing for reporting and analysis. It was originally start by Baidu Inc in two thousand fourteen. and It was contributed to the Apache foundation by Baidu in July two thousand eighteen, and is currently in its incubation phase. Unlike other popular SQL-on-Hadoop systems, Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris not only provides high concurrent, low latency and point query performance, but also provides high throughput['θrʊ'pʊt] queries of ad-hoc analysis. Doris not only provides batch data loading, but also provides real-time stream data loading. Doris also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main feature of Doris.
  6. Generally speaking, Doris is the technology combination of Google Mesa and Apache Impala. Mesa is a highly scalable analytic data storage system that stores critical measurement data related to Google’s Internet advertising business. Mesa is designed to satisfy complex and challenging set of users' and systems' requirements, including near real-time data ingestion and query ability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Mesa can satisfy the needs of many of our storage requirements, however Mesa itself does not provide a SQL query engine; Impala is a very well MPP SQL query engine, but the lack of a perfect distributed storage engine in that time. So in the end we chose the combination of these two technologies.
  7. Doris has been used in Baidu for more than 1000 machines, and 200 business lines including Baidu Fengchao and Baidu Statistics. The maximum single business data volume exceeds one PB. meanwhile, it has also been highly recognized in Baidu's public cloud and toB business.
  8. Since the open source, more than ten companies including Sina Weibo, Sohu, Meituan, lianjia, Guazi, ipinyou, jingdong, sichuan airlines, goldwind, Shanghai electric, xiaommi and etc. have used Doris in their online businesses.
  9. Take JD.com for example. JD.com is one of the biggest e-commerce platforms in China. In two thousand eighteen, JD Group's market transaction volume was close to 1.7 trillion yuan. JD created the famous six eighteen shopping festival. In the store celebration month, JD will launch a series of large-scale promotional activities. For JD.com, advertising is one of its important business, especially during six-eighteen each year. In this year‘s six-eighteen festival, Doris shows that its strong stable and efficient performance according to the advertising platform team of JD.com who applied Doris to their system.
  10. Why JD.com decided to use Doris in their online business, there are four causes according to JD.com. 1. Google Mesa theory with Baidu‘s engineering practice. And Doris has entered the Apache Foundation incubation, the future is expected, and subsequent development and maintenance is guaranteed. 2. Perfect feature support, standard MySQL protocol. Many existing peripheral MySQL function modules can be used, and the overall use is very convenient. And the user migration cost of the original MySQL database is very low. 3. High concurrency and high QPS support. Because the core code is all implemented in C++, performance is better than other languages. On the other hand, good design also guarantees that Apache Doris performs better than other open source products when dealing with high concurrency. 4. Convenient operation and maintenance with clear structure. Only FE and BE modules with fewer external dependencies. You can concentrate on maintaining the Doris system. The work of other ETLs can be handled by the business department, freeing up manpower.
  11. 2. The ARCHITECTURE and Key techniques of Doris.
  12. Doris‘ implementation consists of two daemons: frontend (FE) and backend (BE). It is so simple that just has two processes: FE and BE. Users can use any client that compatible MySQL protocol to connect FE daemon. Of course, users can use Native C API, JDBC, ODBC, PHP, Python, Perl, Ruby and etc.. to connect FE daemon. A typical Doris cluster generally composes of several frontend daemons and dozens to hundreds of backend daemons.
  13. Frontend daemon consists of query coordinator and catalog manager. Query coordinator is responsible for receiving user‘s sql queries, compiling queries and managing queries execution. Catalog manager is responsible for managing metadata such as databases, tables, partitions, replicas and etc. Several frontend daemons could be deployed to guarantee fault-tolerance, and load balancing. Backend daemon stores the data and executes the query fragments. Many backend daemons could also be deployed to provide scalability and fault-tolerance. Clients can use MySQL-related tools to connect any frontend daemon to submit SQL query. The frontend receives the query and compiles it into query plans executable by the backends. Then frontend sends the query plan fragments to backend. Backends will build a query execution DAG. Data is fetched and pipelined into the DAG. The final result response is sent to client via frontend.
  14. As I said before, generally speaking, Doris is the technologies combination of Google mesa and impala, also Apache ORCFile. actually, we developed a distributed storage engine based on Google Mesa and Apache ORCFile. To be more exact, the techniques learned from Mesa include aggreate data model, versioned data management, prefix index, online schema change and etc.. For In data format, indexes, compression encoding, we refer to Apache ORCFile. Unlike Mesa, the storage engine of Doris does not rely on any distributed file system. We also deeply integrate this storage engine with Impala query engine. Query compiling, query execution coordination and catalog management of storage engine are integrated to be frontend daemon; query execution and data storage are integrated to be backend daemon. With this integration, we implemented a single, full-featured, high performance MPP database, as well as maintaining the simplicity. As I said before, the simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main feature of Doris.
  15. 3. Features and functions
  16. As an OLAP system, Doris can ingest data from local files, real-time data come from Kafka and HDFS files, and when these data enters Doris, they can been pre-aggregated. MySQL compatible networking protocol is implemented in Doris‘ frontend. The main cause of using MySQL compatible protocol as following: Firstly, SQL interface is preferred for engineers; Secondly, compatibility with MySQL protocol makes the integrating with current existing BI software, such as Tableau, easier; Lastly, rich MySQL client libraries and tools reduce our development costs, but also reduces the user's using cost.
  17. The Frontends of Doris can be configured to three kinds of roles: leader, follower and observer. Through a voting protocol, follower frontends firstly elect a leader frontend. All the write requests of metadata are forwarded to the leader, then the leader writes the operation into the replicated log file. If the new log entry will be replicated to at least quorum followers successfully, the leader commits the operation into memory, and responses the write request. Followers always replay the replicated logs to apply them into their memory metadata. If the leader crashes, a new leader will be elected from the leftover followers. Leader and follower mainly solve the problem of write availability and partly solve the problem of read scalability. Leader replicates log stream to observers asynchronously. Observers don't involve leader election. The replicated-state-machine is implemented based on BerkeleyDB java version (in brief BDB-JE). BDB-JE has achieved high availability by implementing a Paxos-like consensus algorithm. We use BDB-JE to implement Doris' log replication and leader election. In-memory catalog storage has three functional modules: real-time memory data structures, memory checkpoints on local disk and an operation relay log. When modifying catalog, the mutation operation is written into the log file firstly. Then, the mutation operation is applied into the memory data structures. Periodically, a thread does the checkpoint that dumps memory data structure image into local disk. Checkpoint mechanism enables the fast startup of frontend and reduces the disk storage occupancy. Actually, in-memory catalog also simplifies the implementation of multiple frontends.
  18. In FE, the query come from client will be translated to the distributed query plan, that is distribution of query fragments. These query fragments will be sent to each Backend node to execute. The distribution of query fragment execution takes minimizing data movement and maximizing scan locality as the main goal. Because Doris is designed to provide interactive analysis, so the average execution time of queries is short. Considering this, we adopt query re-execution to meet the fault tolerance of query execution.
  19. Like most of the distributed database system, data in Doris is horizontally[,hɑrə'zɑntli] partitioned. However, a single level partitioning rule (hash partitioning or range partitioning) may not be a good solution to all scenarios[sə'nærɪos]. For example, there is a user-based fact table that stores rows of the form (date, userid, metric). Choosing only hash partitioning by column userid may lead to uneven[ʌn'ivən] distribution of data, when one user’s data is very large. If choosing range partitioning according to column date, it will also lead to uneven distribution of data due to the likely data explosion in a certain period of time. Therefore we support the two-level partitioning rule. The first level is range partitioning. User can specify a column (usually the time series column) range of values for the data partition. In one partition, the user can also specify one or more columns and a number of buckets to do the hash partitioning. User can combine with different partitioning rules to better divide the data.
  20. In one partition, the user can also specify one or more columns and a number of buckets to do the hash partitioning. User can combine with different partitioning rules to better divide the data. Three benefits are gained by using the two-level partitioning mechanism. Firstly, old and new data could be separated, and stored on different storage mediums; Secondly, storage engine of backend can reduce the consumption of IO and CPU for unnecessary data merging, because the data in some partitions is no longer be updated; Lastly, every partition‘s buckets number can be different and adjusted according to the change of data size.
  21. To achieve high update throughput, Doris only applies updates in batches at the smallest frequency of every minute. Each update batch specifies an increased version number and generates a delta data file, commits the version when updates of quorum replicas are complete. You can query all committed data using the committed version, and the uncommitted version would not be used in query. All update versions are strictly be in increasing order. If an update contains more than one table, the versions of these tables are committed atomically [ə'tɑmɪkly]. The MVCC mechanism allows Doris to guarantee multiple table atomic updates and query consistency. In addition, Doris uses compaction policies to merge delta files to reduce delta number, also reduce the cost of delta merging during query for higher performance.
  22. Mesa also supports creating materialized rollups, which contain a column subset of schema to gain better aggregation effect. Rollup is a materialized view that contains a column subset of schema in Doris. A table may contain multiple rollups with columns in different order. According to sort key index and column covering of the rollups, Doris can select the best rollup for different query. Because most rollups only contain a few columns, the size of aggregated data is typically much smaller and query performance can greatly be improved. All the rollups in the same table are updated atomically. Because rollups are materialized, users should make a trade-off between query latency and storage space when using them.
  23. We can create materialized view, namely rollup by alter table statement actually. After created, we can view by show alter table statement. If we want to know if the rollup take effect, we can use explain statement to view its query plan. and from query plan, we can see the rollup table name.
  24. The storage engine of Doris use columnar storage technology. Compared with the row-oriented database, column-oriented organization is more efficient when an aggregate needs to be computed over many rows but only for a small subset of all columns of data, because reading that smaller subset of data can be faster than reading whole row data. And columnar storage is also space-friendly due to the high compression ratio of each column. Further, column support block-level storage technology such as min/max index and bloom filter index. Query executor can filter out a lot of blocks that do not meet the predicate, to further improve the query performance.
  25. 4. STORAGE – Data storage model
  26. Doris combines Google Mesa's data model and ORCFile / Parquet storage technology. Data in Mesa is inherently multi-dimensional fact table. These facts in table typically consist of two types of attributes: dimensional attributes (which we call keys) and measure attributes (which we call values). The table schema also specifies the aggregation function F: V ×V → V which is used to aggregate the values corresponding to the same key.
  27. To achieve high update throughput, Mesa loads data in batch. Each batch of data will be converted to a delta file. Mesa uses MVCC approach to manage these delta files, and so to enforce update atomicity.
  28. Like traditional databases, Doris stores structured data represented as tables. Each table has a well defined schema consisting of a finite number of columns. User can create three types of table to meet different needs in interactive query scenarios. Aggregate key, duplicate key and unique key. Data in all three types of table are sorted by KEY.
  29. Type one, the aggregate key model. we can define this model by specify the key words of Aggregate key, red words in here. According to Mesa‘s data model, all columns of a table can be divided into two parts: key and value, The columns name list in the parentheses[pə‘rɛnθəsɪz] are the key, and the rest of columns are values, such as last visit date, cost, max dwell time and min dwell time. we can define the aggregation functions for values, such as SUM, MIN, MAX, REPLACE, and this function will aggregate the corresponding value by key.
  30. Next, the type two, the duplicate key model. we can define this model by specify the key words of duplicate key, red words in here. Like the aggregate key, the columns name list in the parentheses[pə‘rɛnθəsɪz] are the key, and the rest of columns are values, such as error code, error message, op id and op time. In this type, Doris will only keep one row for the data which has duplicate key.
  31. Next, the type three, the unique key model. we can define this model by specify the key words of unique key, red words in here. Like the aggregate key, the columns name list in the parentheses[pə‘rɛnθəsɪz] are the key, and the rest of columns are values, such as city, age and etc. In this type, Doris will replace the data which are the same key, actually, this mode completely equally the aggregate key which only define replace function.
  32. At last, I show you that how to contact us, the official website and source code in github; the email and document.
  33. Thank you!