Apache IoTDB: a Time Series
Database for Industrial IoT
Xiangdong Huang1 and Julian Feinauer2 (on behalf of the IoTDB community)
1 Tsinghua University, Beijing, China
2 Pragmatic Minds, Stuttgart, Germany
Berlin, Germany, 2019.10.23
Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
IoTDB
• IoTDB = IoT + DB, a DataBase for managing (Industrial) IoT data
• IoTDB is a IoT DB. (using IoTDB as a keyword on Google, not “IoT DB”)
IoTDB
• IoTDB = IoT + DB, a DataBase for managing (Industrial) IoT data
• IoTDB is a IoT DB. (using IoTDB as a keyword on Google, not “IoT DB”)
• “You can find many substances about IoTDB in Germany”
IoTDB
• IoTDB = IoT + DB, a DataBase for managing (Industrial) IoT data
• “You can find many substances about IoTDB”
• IIoT
turbine excavator trunks modern cars
IoTDB
• IoTDB = IoT + DB, the DataBase for managing (Industrial) IoT data
• “You can find many substances about IoTDB”
• IIoT • DB
deutsche bahn (the real meaning)
Who We Are (The community)
• We come from the Apache IoTDB (incubating) Community
• A young community. 2018.11-18 entered the incubator.
• Mentors: Christofer Dutz, Justin Mclean, (Champion) Kevin A. McGrail, Willem Jiang
• Devoted to building the best time series database (in IoT area) in the world
Who We Are (Individual)
• Xiangdong Huang (sainthxd@gmail.com)
• PhD, PostDoc and Assistant Researcher (now)
in Tsinghua University, Beijing, China
• Use Apache Cassandra (for managing Timeseries Data) from 2012
• Develop IoTDB from 2017
• One of the initial committers of Apache IoTDB incubating
Who We Are (Individual)
• Julian Feinauer (j.feinauer@pragmaticminds.de)
• Founder of Startup pragmatic minds in Germany
• The first committer who is not initial committer
• The Release Manager of the first release version of IoTDB
• The Committer of Apache PLC4x, Edgent etc..
Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
The 4th Industrial Revolution
Industry 4.0 Industry Internet
Data analytics and
utility is the key
Advanced data
analytics
Industry Internet
Data + Model
Germany China USA
Data is becoming the most important aspect of this era
Machine Data (Time Series Data) :
the Largest Volume in Industrial Data
Machine Data
Other Domain Data
EnvironmentMeteorology Geography
Industrial
Big Data
Manufacturing
Enterprise Data
VideoModel
Doc Drawings
How to Manage Time Series Data
Network
MQ PI System
(Pi Server)
queryinsertion
save data
locally
RDBMS
How to Manage Time Series Data
Network
MQ Database
queryinsertion
save data
locally
Network
analysis
The Problems
Network
MQ Database
● millions of data points
per second?
● 10 millions of data points
per second?
● billions of data points
per second?
insertion
Big Data
50Hz,500points/machine,
20K wind-turbines macines,
totally up to 500 million points/sec
Produce Data 7*24
with High Frequency
and Large Volume
� More Features
� Out-of-order sometimes
� Sparse Table
(different machine has
different sensors)
The Problems
Network
MQ Database
query
analysis
The Problems
Network
MQ Database
query
analysis
� Features of Data Query
� Time Dimension is always accessed
� Aggregation is the first-class citizen
■ Sometimes we do not need raw data,
just know the count/min/max/avg
value is ok.
■ (For visualization), the screen
resolution is limited, e.g.,
1024*768. Then no meaning for
getting more than 1024 points
(using aggregation to
Downsampling)
� Time-series-specific query and analysis
● get a mass of data QUICKLY (ETL)
● then convert it into a analysis-friendly file format
● time consuming
The Problems
Network
MQ Database
query
analysis
What we want
� Challenges
� Large Volume
� High Throughput
� Low Cost (historical data)
� Low Latency for Query
� Fast Aggregation
� Query-Analysis hybrid workloads
Different Solutions for Managing Time Series
RDB
KVDB
LSM based
•Efficient file structure
•More query functions
Not optimize for
some application
scenarios
TSDB
Limited number of
columns
1600 Columns in a table
Limited number of rows
<=10M rows is better
Manual Sharding
• Support big data
• Limited Queries
• Lack time filtering
• Lack value filtering
• Lack multiple time series
alignment
Based on PG
•Auto sharding
•Query optimization
Performance degrades
sharply after writing
data for a long time
Hbase/Cassandra based
•Partition by TS-UID
and time range
• Storage inefficiency
• Limit of queries
Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
Time Series DB for Industrial Internet
now called:
Apache IoTDB
(incubating)
Each node can manage:
★ Tens of millions of time series
★ Trillions of data points
★ Tens of TB data
Support Hadoop, Spark, Matlab,
Grafana etc..
“清华数为”工业互联网时间序列数据库
Apache IoTDB Features
Persist data
efficiently
• Millions points
ingestion per sec
per node
• Tens of millions
of time series
Query data
with low latency
• Efficiently filter data:
millions of points
per sec
• Aggregation:
tens of ms latency
on billions of points
Exclusive operations
of time series
• Segmentation
• Representation
• Subsequence
matching
• Time-frequency
transform
• Visualization
Integration with
existing ecosystem
• Kafka
• MatLab
• Spark
• MapReduce
• Grafana
• Connecting Edge
to the Cloud
• Powerful query
engine
• User Friendly
analytics
Collecti
on
Storage
ProcessLearning
Applicat
ion
Cover the
life cycle of data
Architecture
IoTDB Outlier
detection
Machine
learning
UDF
Hadoop/
Spark
Big data
Framework
cluster
TsFile
Time series optimized
file format
TsFile-CLI
Interactive client command line
IoTDB-JDBC
Grafana-Adaptor
Web dashboard to visualize
time series data
IoTDB-CLI
Interactive client command line
I/E Tool
Batch load and export data
Other
Databases
Application
s
Message
Queue
DevOp
s
devic
e
IoTDB IoTDBSync
Concepts in IoTDB (The Schema)
Device (i.e., Data source)
• A machine instance
Measurement (e.g., sensor)
• A device can have many measurements
Time Series
• Device + Measurement
• is represented as a path that begins with root, like
“root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain”
Storage Group (SG)
• A storage group can have many devices
• Storage groups have independent resources
(threads and files) to increase parallelism and
reduce competitions for locks.
Cadillac XT5
The schema mapping
root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain
root.Cadillac_XT5.USA.CA.7BTC409.speed
root.Cadillac_XT5.USA.NV.6BAC321.speed
country state device name timestamp fuelRemain speed
USA CA 7BTC409 t1 5.0 120
USA CA 7BTC409 t2 4.9 109
USA CA 6BAC321 t1 NULL 50
USA CA 6BAC321 t3 NULL 65
Table Name: Cadillac_XT5
Tags and Fields in InfluxDB, KariosDB, OpenTSDB…
called as Measurement in InfluxDB
Set time series group
SET STORAGE GROUP TO root.laptop.d1.s1;
Create Timeseries
CREATE TIMESERIES root.laptop.d1.s1 WITH DATATYPE=INT32, ENCODING=RLE
Insert Data
INSERT INTO (d1.s1,d1.s2,time) VALUES (1000,2000,14735235234);
Delete Data
DALETE FROM d1.s1 WHERE time < 1000;
Update Data
UPDATE d1.s1 SET VALUE = 2000 WHERE time < 2000 and time > 1000;
Query Data (Filter, Aggregation, Group by time interval)
SELECT d1.s1,d2.* FROM BJ.WF1 WHERE d1.s1 < 2000 and d2.s2 > 1000 and freq(d2.s3) > 0.5;
SELECT count(status), max_value(temperature) from root.ln.wf01.wt01;
SELECT count(status) ) from root.ln.wf01.wt01 group by(1h, [2017-11-03T00:00:00, 2017-11-
03T23:00:00]);
SQL in IoTDB
Supported data type
• Boolean
• Int
• Long
• Float
• Double
• String
• GPS (TODO) -> for trajectory data management
• Array (TODO) -> for unstructured data management
30
TsFile: Zip File Born for Time Series Data
Columnar
Store
- Reduce Disk I/O
- Improve Compression
Compression
&
Encoding
- Improve Compression Greatly
- 15% Better than InfluxDB in
Real Applications
Time-domain
Statistics Info
Natively
- Support Fast Query in
- Time Domain
- Value Domain
- Freq Domain (TODO)
detailed specification:
http://iotdb.apache.org/#/Documents/0.8.0/chap7/sec3
https://cwiki.apache.org/confluence/display/IOTDB/TsFile+Format
TsFile: comparison with Parquet
You say, “tomato”...
Parquet
Parquet TsFile Target in TsFile
Row
Group
Chunk
Group
The data that belongs to a device
instance
Column Chunk The data that belongs to the device’s
measurement
Page Page a part of data that belongs to a Chunk
The differences
❏ Each Page has two columns actually
❏ The time column and the value column
❏ No Repeat and Duplication Field Needed
❏ More summary info for a Page/Chunk
❏ min/max timestamp
❏ min/max value
❏ count
❏ FileMetadata
Page Header
Page Data
Timestamps
Values
Difference in TsFile
statistics
FileMetadata
Devices info
Level 1
Devices info
Level 2
TsFile: comparison with Parquet
Apache Parquet
Chunk Group Chunk
File Metadata
Time
Series
Time1 Value1 Time2 Value
2
TsFile
Time series data
General File Format
Adaptive Delta encoding – Int or Long (TODO)
Gorilla encoding – Float or Double
128, 136, 144, 152, 160, …
8, 8, 8, 8 � 1st difference is constant.
0, 0, 0 � 2nd difference is 1-bit storage needed!
128, 135, 143, 154, 163, …
7, 8, 11, 9 � 1st difference is not constant though
1, 3, -2 � 2nd difference is 2-bit storage needed!
• Unified support of fixed frequency times series
or irregular frequency time series
TS2Diff encoding – Int or Long (timestamps)
• A adaptive enhance for TS2Diff.
• See next page.
RLE encoding – repeated Int or Long
• For repeated sequence: store a value and its count
Bit-Packing encoding – Int or Long
• Store data in compact form
• squeeze out wasteful bits
• XOR consecutive data points
• Store with variable length encoding scheme
Snappy Gzip (TODO) LZO (TODO)
Compression Algorithm
TsFile: Encoding and Compression
Adaptive TS2Diff encoding – Int or Long (TODO)
• For time series with outliers or missing points
• Storing second-order delta values and a boolean flag array.
TsFile: Encoding and Compression
Data Query
Only records root nodes in memory and build virtual trees,
for reducing memory cost and disk I/O
35
Fast Aggregation Method for Time Series
IoTDB-SQL
DM L
R
select
raw
aggregate
filter
device
single
across
metric
single
across
time
certain
range
group by
time
interval
series
order by
ASC
DESC
fill
inter-
polation
latest
limit
slimit
index
C
U
D
DDL
✔ 8 types of sub-clause
✔ ≥1052 kinds of query
IoTDB-SQL
——Concise TS Operations Language
JDBC
——Reduce the Cost of Learning
Interfaces: JDBC, TsFile API, CLI, etc.
Time Series Specific Operations (TODO)
Pattern Matching for Streaming Time Series Data
✔ Split the pattern and data stream into
equal length fragments
✔ Extract features to reduce the dimension
✔ Accelerate the search by using features
✔ Scenario:fault alarm in real time
36
SELECT wind_3s FROM china.farm1.tb2
WHERE time > t1 AND time < t2
AND wind_3s LIKE PATTERN(7.2,..,20.3,..,6.0)
Similarity Search of Sub-series
✔ Indexing data using Key-Value form
✔ Scenarios:
✔ Outlier detection
✔ Historical data analysis
✔ …
From Edge to Cloud: Run IoTDB Everywhere
Time series data files: high-tech
write, high compression ratio,
support simple queries. Simply
put, TsFile is a zip file for time
series data.
Suitable for embedded devices,
general servers, data centers, etc.
TsFile (a component of IoTDB)
A zip file of time series
Freely operate time series of
multiple TsFiles, including: CRUD
and advanced query like:max, min,
avg and temporal alignment.
Scene: Embedded equipment, on-
site industrial computer, general
server, etc.
IoTDB
A database of time series
3rd Systems
Easy to use and integrate for
complex analysis(data fusion,
collaborative recommendation,
machine learning)
Scene: Cloud data center
A data warehouse of time series
Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
A Process to Manage Time Series Data
data source
or
JDBC / Session API
JDBC / Session API
Grafana-Adaptor Spark-TsFile-AdaptorJDBC
Analysis with Big Data Framework
(big data set)
Analysis with Matlab
(small data set)
Visualization
(Manual data explore)
Using JDBC to write data
set storage group
create timeseries
insert data
https://iotdb.apache.org/#/Documents/0.8.0/chap6/sec1
Using Session API to write Data
(more efficient)
set storage group
create timeseries
insert data
Using JDBC to Query Data
raw data query
aggregation query
down sampling query
print result
https://iotdb.apache.org/#/Documents/0.8.0/chap6/sec1
Using Grafana to Visualize Data
https://iotdb.apache.org/#/Tools/Grafana
• Install simple-json-datasource plugin
• Config iotdb-grafana-connector
• application.properties
• Start iotdb-grafana-connector
• java -jar iotdb-grafana-0.8.0.war
• Add IoTDB data source(Simplejson)
• choose connector IP
• Config dashboard and Enjoy!
Using Matlab to Analyze Data
read IoTDB by JDBC
fast Fourier
transform
plot
Using Spark to Analyze Data
create table
sql query
read TsFile
write to TsFile
https://iotdb.apache.org/#/Tools/Spark
Demo
• Writing Data Locally
• Show data with Grafana
• Analyze data using SparkSQL
• https://github.com/jixuan1989/iotdb-tutorial
Demo Video
• Writing Data on HDFS directly
• using Hive to analyze it
• Video
Language
• Written by Java
• But the RPC is implemented by Thrift
• Easy to get other language’s API.
Say Hi to the Apache Ecosystem
IoTDB-repository:
RocketMQ: https://github.com/apache/incubator-
iotdb/tree/master/example/rocketmq
Kafka: https://github.com/apache/incubator-
iotdb/tree/master/example/kafka
Third part:
EMQx (MQTT server):
https://github.com/jixuan1989/iotdb-tutorial
Spark: https://github.com/jixuan1989/iotdb-tutorial
Calcite: https://github.com/EJTTianYu/iotdb-calcite
PLC4X:
Mapreduce:
Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
Application 1: The Next Generation of
Big Data Platform for Meteorology
1073
kinds of
meteor-
ological
data
The platform is deployed across
China
Performance improved :
two orders of magnitude
~150K
stations
collect more than 100
metrics/ 5 minutes
upgrade
Application 2:
Data Management for Equipment Monitoring
The data records the operational status of the
equipments,
e.g., the vehicle’s speed, fuel consumption
and malfunction.
© 2015. All Rights
Reserved.
execute
collect
decision
transfer
Komatsu
excavator
TIANYUAN (with Komatsu)
#devices (excavator etc.)
#metrics
collection times per minute
• sharding every day
• only store data in 3 months
• more than 10 minutes for
some queries
• store the whole data
• several seconds for
complex queries
Application 3:
Shanghai METRO Monitoring
…
144 trains
9 KairosDB + Cassandra
3200 points/500 ms/train
14 Restful service just for avoiding
modifying current programs
KDB compatible
Restful Service
KDB compatible
Restful Service
KDB compatible
Restful Service
ONE IoTDB
instance
300 trains
3200 points/200 ms/train
414 Billion
data points
per day
just using
ONE IoTDB
instance
upgrade
Application 4:
Application 4:
Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
Future Works
• Make it easy to use!
• Relational Model: Integration with Calcite
• step 1: supports relational SQL
• step 2: standard JDBC
• Big Data!
• better integration with Hive, etc..
• Cluster!
• now supports writing data on HDFS, but a share-nothing architecture is wanted.
• Advanced functions!
• integration with data streaming engine, etc..
Join Us
• Mail list:
• subscribe: dev-
subscribe@iotdb.incubator.apache.org
• discussion: dev@iotdb.apache.org
• bug report:
https://issues.apache.org/jira/projects/I
OTDB/issues/IOTDB
• Website: https://iotdb.apache.org
• Ecosystem target:
IoTDB v0.8.0 is released! (the first Apache release version)

Apache IOTDB: a Time Series Database for Industrial IoT

  • 1.
    Apache IoTDB: aTime Series Database for Industrial IoT Xiangdong Huang1 and Julian Feinauer2 (on behalf of the IoTDB community) 1 Tsinghua University, Beijing, China 2 Pragmatic Minds, Stuttgart, Germany Berlin, Germany, 2019.10.23
  • 2.
    Outline • Who WeAre • Why IoTDB Was Born • Overview of Apache IoTDB (incubating): Main Features • Working with Current Ecosystems • Performance Evaluation • Use Cases • Future Works
  • 3.
    IoTDB • IoTDB =IoT + DB, a DataBase for managing (Industrial) IoT data • IoTDB is a IoT DB. (using IoTDB as a keyword on Google, not “IoT DB”)
  • 4.
    IoTDB • IoTDB =IoT + DB, a DataBase for managing (Industrial) IoT data • IoTDB is a IoT DB. (using IoTDB as a keyword on Google, not “IoT DB”) • “You can find many substances about IoTDB in Germany”
  • 5.
    IoTDB • IoTDB =IoT + DB, a DataBase for managing (Industrial) IoT data • “You can find many substances about IoTDB” • IIoT turbine excavator trunks modern cars
  • 6.
    IoTDB • IoTDB =IoT + DB, the DataBase for managing (Industrial) IoT data • “You can find many substances about IoTDB” • IIoT • DB deutsche bahn (the real meaning)
  • 7.
    Who We Are(The community) • We come from the Apache IoTDB (incubating) Community • A young community. 2018.11-18 entered the incubator. • Mentors: Christofer Dutz, Justin Mclean, (Champion) Kevin A. McGrail, Willem Jiang • Devoted to building the best time series database (in IoT area) in the world
  • 8.
    Who We Are(Individual) • Xiangdong Huang (sainthxd@gmail.com) • PhD, PostDoc and Assistant Researcher (now) in Tsinghua University, Beijing, China • Use Apache Cassandra (for managing Timeseries Data) from 2012 • Develop IoTDB from 2017 • One of the initial committers of Apache IoTDB incubating
  • 9.
    Who We Are(Individual) • Julian Feinauer (j.feinauer@pragmaticminds.de) • Founder of Startup pragmatic minds in Germany • The first committer who is not initial committer • The Release Manager of the first release version of IoTDB • The Committer of Apache PLC4x, Edgent etc..
  • 10.
    Outline • Who WeAre • Why IoTDB Was Born • Overview of Apache IoTDB (incubating): Main Features • Working with Current Ecosystems • Performance Evaluation • Use Cases • Future Works
  • 11.
    The 4th IndustrialRevolution Industry 4.0 Industry Internet Data analytics and utility is the key Advanced data analytics Industry Internet Data + Model Germany China USA Data is becoming the most important aspect of this era
  • 12.
    Machine Data (TimeSeries Data) : the Largest Volume in Industrial Data Machine Data Other Domain Data EnvironmentMeteorology Geography Industrial Big Data Manufacturing Enterprise Data VideoModel Doc Drawings
  • 13.
    How to ManageTime Series Data Network MQ PI System (Pi Server) queryinsertion save data locally RDBMS
  • 14.
    How to ManageTime Series Data Network MQ Database queryinsertion save data locally Network analysis
  • 15.
    The Problems Network MQ Database ●millions of data points per second? ● 10 millions of data points per second? ● billions of data points per second? insertion Big Data 50Hz,500points/machine, 20K wind-turbines macines, totally up to 500 million points/sec Produce Data 7*24 with High Frequency and Large Volume � More Features � Out-of-order sometimes � Sparse Table (different machine has different sensors)
  • 16.
  • 17.
    The Problems Network MQ Database query analysis �Features of Data Query � Time Dimension is always accessed � Aggregation is the first-class citizen ■ Sometimes we do not need raw data, just know the count/min/max/avg value is ok. ■ (For visualization), the screen resolution is limited, e.g., 1024*768. Then no meaning for getting more than 1024 points (using aggregation to Downsampling) � Time-series-specific query and analysis
  • 18.
    ● get amass of data QUICKLY (ETL) ● then convert it into a analysis-friendly file format ● time consuming The Problems Network MQ Database query analysis
  • 19.
    What we want �Challenges � Large Volume � High Throughput � Low Cost (historical data) � Low Latency for Query � Fast Aggregation � Query-Analysis hybrid workloads
  • 20.
    Different Solutions forManaging Time Series RDB KVDB LSM based •Efficient file structure •More query functions Not optimize for some application scenarios TSDB Limited number of columns 1600 Columns in a table Limited number of rows <=10M rows is better Manual Sharding • Support big data • Limited Queries • Lack time filtering • Lack value filtering • Lack multiple time series alignment Based on PG •Auto sharding •Query optimization Performance degrades sharply after writing data for a long time Hbase/Cassandra based •Partition by TS-UID and time range • Storage inefficiency • Limit of queries
  • 21.
    Outline • Who WeAre • Why IoTDB Was Born • Overview of Apache IoTDB (incubating): Main Features • Working with Current Ecosystems • Performance Evaluation • Use Cases • Future Works
  • 22.
    Time Series DBfor Industrial Internet now called: Apache IoTDB (incubating) Each node can manage: ★ Tens of millions of time series ★ Trillions of data points ★ Tens of TB data Support Hadoop, Spark, Matlab, Grafana etc.. “清华数为”工业互联网时间序列数据库
  • 23.
    Apache IoTDB Features Persistdata efficiently • Millions points ingestion per sec per node • Tens of millions of time series Query data with low latency • Efficiently filter data: millions of points per sec • Aggregation: tens of ms latency on billions of points Exclusive operations of time series • Segmentation • Representation • Subsequence matching • Time-frequency transform • Visualization Integration with existing ecosystem • Kafka • MatLab • Spark • MapReduce • Grafana • Connecting Edge to the Cloud • Powerful query engine • User Friendly analytics Collecti on Storage ProcessLearning Applicat ion Cover the life cycle of data
  • 24.
    Architecture IoTDB Outlier detection Machine learning UDF Hadoop/ Spark Big data Framework cluster TsFile Timeseries optimized file format TsFile-CLI Interactive client command line IoTDB-JDBC Grafana-Adaptor Web dashboard to visualize time series data IoTDB-CLI Interactive client command line I/E Tool Batch load and export data Other Databases Application s Message Queue DevOp s devic e IoTDB IoTDBSync
  • 25.
    Concepts in IoTDB(The Schema) Device (i.e., Data source) • A machine instance Measurement (e.g., sensor) • A device can have many measurements Time Series • Device + Measurement • is represented as a path that begins with root, like “root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain” Storage Group (SG) • A storage group can have many devices • Storage groups have independent resources (threads and files) to increase parallelism and reduce competitions for locks. Cadillac XT5
  • 26.
    The schema mapping root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain root.Cadillac_XT5.USA.CA.7BTC409.speed root.Cadillac_XT5.USA.NV.6BAC321.speed countrystate device name timestamp fuelRemain speed USA CA 7BTC409 t1 5.0 120 USA CA 7BTC409 t2 4.9 109 USA CA 6BAC321 t1 NULL 50 USA CA 6BAC321 t3 NULL 65 Table Name: Cadillac_XT5 Tags and Fields in InfluxDB, KariosDB, OpenTSDB… called as Measurement in InfluxDB
  • 27.
    Set time seriesgroup SET STORAGE GROUP TO root.laptop.d1.s1; Create Timeseries CREATE TIMESERIES root.laptop.d1.s1 WITH DATATYPE=INT32, ENCODING=RLE Insert Data INSERT INTO (d1.s1,d1.s2,time) VALUES (1000,2000,14735235234); Delete Data DALETE FROM d1.s1 WHERE time < 1000; Update Data UPDATE d1.s1 SET VALUE = 2000 WHERE time < 2000 and time > 1000; Query Data (Filter, Aggregation, Group by time interval) SELECT d1.s1,d2.* FROM BJ.WF1 WHERE d1.s1 < 2000 and d2.s2 > 1000 and freq(d2.s3) > 0.5; SELECT count(status), max_value(temperature) from root.ln.wf01.wt01; SELECT count(status) ) from root.ln.wf01.wt01 group by(1h, [2017-11-03T00:00:00, 2017-11- 03T23:00:00]); SQL in IoTDB
  • 28.
    Supported data type •Boolean • Int • Long • Float • Double • String • GPS (TODO) -> for trajectory data management • Array (TODO) -> for unstructured data management
  • 30.
    30 TsFile: Zip FileBorn for Time Series Data Columnar Store - Reduce Disk I/O - Improve Compression Compression & Encoding - Improve Compression Greatly - 15% Better than InfluxDB in Real Applications Time-domain Statistics Info Natively - Support Fast Query in - Time Domain - Value Domain - Freq Domain (TODO) detailed specification: http://iotdb.apache.org/#/Documents/0.8.0/chap7/sec3 https://cwiki.apache.org/confluence/display/IOTDB/TsFile+Format
  • 31.
    TsFile: comparison withParquet You say, “tomato”... Parquet Parquet TsFile Target in TsFile Row Group Chunk Group The data that belongs to a device instance Column Chunk The data that belongs to the device’s measurement Page Page a part of data that belongs to a Chunk The differences ❏ Each Page has two columns actually ❏ The time column and the value column ❏ No Repeat and Duplication Field Needed ❏ More summary info for a Page/Chunk ❏ min/max timestamp ❏ min/max value ❏ count ❏ FileMetadata Page Header Page Data Timestamps Values Difference in TsFile statistics FileMetadata Devices info Level 1 Devices info Level 2
  • 32.
    TsFile: comparison withParquet Apache Parquet Chunk Group Chunk File Metadata Time Series Time1 Value1 Time2 Value 2 TsFile Time series data General File Format
  • 33.
    Adaptive Delta encoding– Int or Long (TODO) Gorilla encoding – Float or Double 128, 136, 144, 152, 160, … 8, 8, 8, 8 � 1st difference is constant. 0, 0, 0 � 2nd difference is 1-bit storage needed! 128, 135, 143, 154, 163, … 7, 8, 11, 9 � 1st difference is not constant though 1, 3, -2 � 2nd difference is 2-bit storage needed! • Unified support of fixed frequency times series or irregular frequency time series TS2Diff encoding – Int or Long (timestamps) • A adaptive enhance for TS2Diff. • See next page. RLE encoding – repeated Int or Long • For repeated sequence: store a value and its count Bit-Packing encoding – Int or Long • Store data in compact form • squeeze out wasteful bits • XOR consecutive data points • Store with variable length encoding scheme Snappy Gzip (TODO) LZO (TODO) Compression Algorithm TsFile: Encoding and Compression
  • 34.
    Adaptive TS2Diff encoding– Int or Long (TODO) • For time series with outliers or missing points • Storing second-order delta values and a boolean flag array. TsFile: Encoding and Compression
  • 35.
    Data Query Only recordsroot nodes in memory and build virtual trees, for reducing memory cost and disk I/O 35 Fast Aggregation Method for Time Series IoTDB-SQL DM L R select raw aggregate filter device single across metric single across time certain range group by time interval series order by ASC DESC fill inter- polation latest limit slimit index C U D DDL ✔ 8 types of sub-clause ✔ ≥1052 kinds of query IoTDB-SQL ——Concise TS Operations Language JDBC ——Reduce the Cost of Learning Interfaces: JDBC, TsFile API, CLI, etc.
  • 36.
    Time Series SpecificOperations (TODO) Pattern Matching for Streaming Time Series Data ✔ Split the pattern and data stream into equal length fragments ✔ Extract features to reduce the dimension ✔ Accelerate the search by using features ✔ Scenario:fault alarm in real time 36 SELECT wind_3s FROM china.farm1.tb2 WHERE time > t1 AND time < t2 AND wind_3s LIKE PATTERN(7.2,..,20.3,..,6.0) Similarity Search of Sub-series ✔ Indexing data using Key-Value form ✔ Scenarios: ✔ Outlier detection ✔ Historical data analysis ✔ …
  • 37.
    From Edge toCloud: Run IoTDB Everywhere Time series data files: high-tech write, high compression ratio, support simple queries. Simply put, TsFile is a zip file for time series data. Suitable for embedded devices, general servers, data centers, etc. TsFile (a component of IoTDB) A zip file of time series Freely operate time series of multiple TsFiles, including: CRUD and advanced query like:max, min, avg and temporal alignment. Scene: Embedded equipment, on- site industrial computer, general server, etc. IoTDB A database of time series 3rd Systems Easy to use and integrate for complex analysis(data fusion, collaborative recommendation, machine learning) Scene: Cloud data center A data warehouse of time series
  • 38.
    Outline • Who WeAre • Why IoTDB Was Born • Overview of Apache IoTDB (incubating): Main Features • Working with Current Ecosystems • Performance Evaluation • Use Cases • Future Works
  • 39.
    A Process toManage Time Series Data data source or JDBC / Session API JDBC / Session API Grafana-Adaptor Spark-TsFile-AdaptorJDBC Analysis with Big Data Framework (big data set) Analysis with Matlab (small data set) Visualization (Manual data explore)
  • 40.
    Using JDBC towrite data set storage group create timeseries insert data https://iotdb.apache.org/#/Documents/0.8.0/chap6/sec1
  • 41.
    Using Session APIto write Data (more efficient) set storage group create timeseries insert data
  • 42.
    Using JDBC toQuery Data raw data query aggregation query down sampling query print result https://iotdb.apache.org/#/Documents/0.8.0/chap6/sec1
  • 43.
    Using Grafana toVisualize Data https://iotdb.apache.org/#/Tools/Grafana • Install simple-json-datasource plugin • Config iotdb-grafana-connector • application.properties • Start iotdb-grafana-connector • java -jar iotdb-grafana-0.8.0.war • Add IoTDB data source(Simplejson) • choose connector IP • Config dashboard and Enjoy!
  • 44.
    Using Matlab toAnalyze Data read IoTDB by JDBC fast Fourier transform plot
  • 45.
    Using Spark toAnalyze Data create table sql query read TsFile write to TsFile https://iotdb.apache.org/#/Tools/Spark
  • 46.
    Demo • Writing DataLocally • Show data with Grafana • Analyze data using SparkSQL • https://github.com/jixuan1989/iotdb-tutorial
  • 47.
    Demo Video • WritingData on HDFS directly • using Hive to analyze it • Video
  • 48.
    Language • Written byJava • But the RPC is implemented by Thrift • Easy to get other language’s API.
  • 49.
    Say Hi tothe Apache Ecosystem IoTDB-repository: RocketMQ: https://github.com/apache/incubator- iotdb/tree/master/example/rocketmq Kafka: https://github.com/apache/incubator- iotdb/tree/master/example/kafka Third part: EMQx (MQTT server): https://github.com/jixuan1989/iotdb-tutorial Spark: https://github.com/jixuan1989/iotdb-tutorial Calcite: https://github.com/EJTTianYu/iotdb-calcite PLC4X: Mapreduce:
  • 50.
    Outline • Who WeAre • Why IoTDB Was Born • Overview of Apache IoTDB (incubating): Main Features • Working with Current Ecosystems • Performance Evaluation • Use Cases • Future Works
  • 51.
    Application 1: TheNext Generation of Big Data Platform for Meteorology 1073 kinds of meteor- ological data The platform is deployed across China Performance improved : two orders of magnitude ~150K stations collect more than 100 metrics/ 5 minutes upgrade
  • 52.
    Application 2: Data Managementfor Equipment Monitoring The data records the operational status of the equipments, e.g., the vehicle’s speed, fuel consumption and malfunction. © 2015. All Rights Reserved. execute collect decision transfer Komatsu excavator TIANYUAN (with Komatsu) #devices (excavator etc.) #metrics collection times per minute • sharding every day • only store data in 3 months • more than 10 minutes for some queries • store the whole data • several seconds for complex queries
  • 53.
    Application 3: Shanghai METROMonitoring … 144 trains 9 KairosDB + Cassandra 3200 points/500 ms/train 14 Restful service just for avoiding modifying current programs KDB compatible Restful Service KDB compatible Restful Service KDB compatible Restful Service ONE IoTDB instance 300 trains 3200 points/200 ms/train 414 Billion data points per day just using ONE IoTDB instance upgrade
  • 54.
  • 55.
  • 56.
    Outline • Who WeAre • Why IoTDB Was Born • Overview of Apache IoTDB (incubating): Main Features • Working with Current Ecosystems • Performance Evaluation • Use Cases • Future Works
  • 57.
    Future Works • Makeit easy to use! • Relational Model: Integration with Calcite • step 1: supports relational SQL • step 2: standard JDBC • Big Data! • better integration with Hive, etc.. • Cluster! • now supports writing data on HDFS, but a share-nothing architecture is wanted. • Advanced functions! • integration with data streaming engine, etc..
  • 58.
    Join Us • Maillist: • subscribe: dev- subscribe@iotdb.incubator.apache.org • discussion: dev@iotdb.apache.org • bug report: https://issues.apache.org/jira/projects/I OTDB/issues/IOTDB • Website: https://iotdb.apache.org • Ecosystem target: IoTDB v0.8.0 is released! (the first Apache release version)