Apache IOTDB: a Time Series Database for Industrial IoT
1. Apache IoTDB: a Time Series
Database for Industrial IoT
Xiangdong Huang1 and Julian Feinauer2 (on behalf of the IoTDB community)
1 Tsinghua University, Beijing, China
2 Pragmatic Minds, Stuttgart, Germany
Berlin, Germany, 2019.10.23
2. Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
3. IoTDB
• IoTDB = IoT + DB, a DataBase for managing (Industrial) IoT data
• IoTDB is a IoT DB. (using IoTDB as a keyword on Google, not “IoT DB”)
4. IoTDB
• IoTDB = IoT + DB, a DataBase for managing (Industrial) IoT data
• IoTDB is a IoT DB. (using IoTDB as a keyword on Google, not “IoT DB”)
• “You can find many substances about IoTDB in Germany”
5. IoTDB
• IoTDB = IoT + DB, a DataBase for managing (Industrial) IoT data
• “You can find many substances about IoTDB”
• IIoT
turbine excavator trunks modern cars
6. IoTDB
• IoTDB = IoT + DB, the DataBase for managing (Industrial) IoT data
• “You can find many substances about IoTDB”
• IIoT • DB
deutsche bahn (the real meaning)
7. Who We Are (The community)
• We come from the Apache IoTDB (incubating) Community
• A young community. 2018.11-18 entered the incubator.
• Mentors: Christofer Dutz, Justin Mclean, (Champion) Kevin A. McGrail, Willem Jiang
• Devoted to building the best time series database (in IoT area) in the world
8. Who We Are (Individual)
• Xiangdong Huang (sainthxd@gmail.com)
• PhD, PostDoc and Assistant Researcher (now)
in Tsinghua University, Beijing, China
• Use Apache Cassandra (for managing Timeseries Data) from 2012
• Develop IoTDB from 2017
• One of the initial committers of Apache IoTDB incubating
9. Who We Are (Individual)
• Julian Feinauer (j.feinauer@pragmaticminds.de)
• Founder of Startup pragmatic minds in Germany
• The first committer who is not initial committer
• The Release Manager of the first release version of IoTDB
• The Committer of Apache PLC4x, Edgent etc..
10. Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
11. The 4th Industrial Revolution
Industry 4.0 Industry Internet
Data analytics and
utility is the key
Advanced data
analytics
Industry Internet
Data + Model
Germany China USA
Data is becoming the most important aspect of this era
12. Machine Data (Time Series Data) :
the Largest Volume in Industrial Data
Machine Data
Other Domain Data
EnvironmentMeteorology Geography
Industrial
Big Data
Manufacturing
Enterprise Data
VideoModel
Doc Drawings
13. How to Manage Time Series Data
Network
MQ PI System
(Pi Server)
queryinsertion
save data
locally
RDBMS
14. How to Manage Time Series Data
Network
MQ Database
queryinsertion
save data
locally
Network
analysis
15. The Problems
Network
MQ Database
● millions of data points
per second?
● 10 millions of data points
per second?
● billions of data points
per second?
insertion
Big Data
50Hz,500points/machine,
20K wind-turbines macines,
totally up to 500 million points/sec
Produce Data 7*24
with High Frequency
and Large Volume
� More Features
� Out-of-order sometimes
� Sparse Table
(different machine has
different sensors)
17. The Problems
Network
MQ Database
query
analysis
� Features of Data Query
� Time Dimension is always accessed
� Aggregation is the first-class citizen
■ Sometimes we do not need raw data,
just know the count/min/max/avg
value is ok.
■ (For visualization), the screen
resolution is limited, e.g.,
1024*768. Then no meaning for
getting more than 1024 points
(using aggregation to
Downsampling)
� Time-series-specific query and analysis
18. ● get a mass of data QUICKLY (ETL)
● then convert it into a analysis-friendly file format
● time consuming
The Problems
Network
MQ Database
query
analysis
19. What we want
� Challenges
� Large Volume
� High Throughput
� Low Cost (historical data)
� Low Latency for Query
� Fast Aggregation
� Query-Analysis hybrid workloads
20. Different Solutions for Managing Time Series
RDB
KVDB
LSM based
•Efficient file structure
•More query functions
Not optimize for
some application
scenarios
TSDB
Limited number of
columns
1600 Columns in a table
Limited number of rows
<=10M rows is better
Manual Sharding
• Support big data
• Limited Queries
• Lack time filtering
• Lack value filtering
• Lack multiple time series
alignment
Based on PG
•Auto sharding
•Query optimization
Performance degrades
sharply after writing
data for a long time
Hbase/Cassandra based
•Partition by TS-UID
and time range
• Storage inefficiency
• Limit of queries
21. Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
22. Time Series DB for Industrial Internet
now called:
Apache IoTDB
(incubating)
Each node can manage:
★ Tens of millions of time series
★ Trillions of data points
★ Tens of TB data
Support Hadoop, Spark, Matlab,
Grafana etc..
“清华数为”工业互联网时间序列数据库
23. Apache IoTDB Features
Persist data
efficiently
• Millions points
ingestion per sec
per node
• Tens of millions
of time series
Query data
with low latency
• Efficiently filter data:
millions of points
per sec
• Aggregation:
tens of ms latency
on billions of points
Exclusive operations
of time series
• Segmentation
• Representation
• Subsequence
matching
• Time-frequency
transform
• Visualization
Integration with
existing ecosystem
• Kafka
• MatLab
• Spark
• MapReduce
• Grafana
• Connecting Edge
to the Cloud
• Powerful query
engine
• User Friendly
analytics
Collecti
on
Storage
ProcessLearning
Applicat
ion
Cover the
life cycle of data
25. Concepts in IoTDB (The Schema)
Device (i.e., Data source)
• A machine instance
Measurement (e.g., sensor)
• A device can have many measurements
Time Series
• Device + Measurement
• is represented as a path that begins with root, like
“root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain”
Storage Group (SG)
• A storage group can have many devices
• Storage groups have independent resources
(threads and files) to increase parallelism and
reduce competitions for locks.
Cadillac XT5
27. Set time series group
SET STORAGE GROUP TO root.laptop.d1.s1;
Create Timeseries
CREATE TIMESERIES root.laptop.d1.s1 WITH DATATYPE=INT32, ENCODING=RLE
Insert Data
INSERT INTO (d1.s1,d1.s2,time) VALUES (1000,2000,14735235234);
Delete Data
DALETE FROM d1.s1 WHERE time < 1000;
Update Data
UPDATE d1.s1 SET VALUE = 2000 WHERE time < 2000 and time > 1000;
Query Data (Filter, Aggregation, Group by time interval)
SELECT d1.s1,d2.* FROM BJ.WF1 WHERE d1.s1 < 2000 and d2.s2 > 1000 and freq(d2.s3) > 0.5;
SELECT count(status), max_value(temperature) from root.ln.wf01.wt01;
SELECT count(status) ) from root.ln.wf01.wt01 group by(1h, [2017-11-03T00:00:00, 2017-11-
03T23:00:00]);
SQL in IoTDB
28. Supported data type
• Boolean
• Int
• Long
• Float
• Double
• String
• GPS (TODO) -> for trajectory data management
• Array (TODO) -> for unstructured data management
29.
30. 30
TsFile: Zip File Born for Time Series Data
Columnar
Store
- Reduce Disk I/O
- Improve Compression
Compression
&
Encoding
- Improve Compression Greatly
- 15% Better than InfluxDB in
Real Applications
Time-domain
Statistics Info
Natively
- Support Fast Query in
- Time Domain
- Value Domain
- Freq Domain (TODO)
detailed specification:
http://iotdb.apache.org/#/Documents/0.8.0/chap7/sec3
https://cwiki.apache.org/confluence/display/IOTDB/TsFile+Format
31. TsFile: comparison with Parquet
You say, “tomato”...
Parquet
Parquet TsFile Target in TsFile
Row
Group
Chunk
Group
The data that belongs to a device
instance
Column Chunk The data that belongs to the device’s
measurement
Page Page a part of data that belongs to a Chunk
The differences
❏ Each Page has two columns actually
❏ The time column and the value column
❏ No Repeat and Duplication Field Needed
❏ More summary info for a Page/Chunk
❏ min/max timestamp
❏ min/max value
❏ count
❏ FileMetadata
Page Header
Page Data
Timestamps
Values
Difference in TsFile
statistics
FileMetadata
Devices info
Level 1
Devices info
Level 2
32. TsFile: comparison with Parquet
Apache Parquet
Chunk Group Chunk
File Metadata
Time
Series
Time1 Value1 Time2 Value
2
TsFile
Time series data
General File Format
33. Adaptive Delta encoding – Int or Long (TODO)
Gorilla encoding – Float or Double
128, 136, 144, 152, 160, …
8, 8, 8, 8 � 1st difference is constant.
0, 0, 0 � 2nd difference is 1-bit storage needed!
128, 135, 143, 154, 163, …
7, 8, 11, 9 � 1st difference is not constant though
1, 3, -2 � 2nd difference is 2-bit storage needed!
• Unified support of fixed frequency times series
or irregular frequency time series
TS2Diff encoding – Int or Long (timestamps)
• A adaptive enhance for TS2Diff.
• See next page.
RLE encoding – repeated Int or Long
• For repeated sequence: store a value and its count
Bit-Packing encoding – Int or Long
• Store data in compact form
• squeeze out wasteful bits
• XOR consecutive data points
• Store with variable length encoding scheme
Snappy Gzip (TODO) LZO (TODO)
Compression Algorithm
TsFile: Encoding and Compression
34. Adaptive TS2Diff encoding – Int or Long (TODO)
• For time series with outliers or missing points
• Storing second-order delta values and a boolean flag array.
TsFile: Encoding and Compression
35. Data Query
Only records root nodes in memory and build virtual trees,
for reducing memory cost and disk I/O
35
Fast Aggregation Method for Time Series
IoTDB-SQL
DM L
R
select
raw
aggregate
filter
device
single
across
metric
single
across
time
certain
range
group by
time
interval
series
order by
ASC
DESC
fill
inter-
polation
latest
limit
slimit
index
C
U
D
DDL
✔ 8 types of sub-clause
✔ ≥1052 kinds of query
IoTDB-SQL
——Concise TS Operations Language
JDBC
——Reduce the Cost of Learning
Interfaces: JDBC, TsFile API, CLI, etc.
36. Time Series Specific Operations (TODO)
Pattern Matching for Streaming Time Series Data
✔ Split the pattern and data stream into
equal length fragments
✔ Extract features to reduce the dimension
✔ Accelerate the search by using features
✔ Scenario:fault alarm in real time
36
SELECT wind_3s FROM china.farm1.tb2
WHERE time > t1 AND time < t2
AND wind_3s LIKE PATTERN(7.2,..,20.3,..,6.0)
Similarity Search of Sub-series
✔ Indexing data using Key-Value form
✔ Scenarios:
✔ Outlier detection
✔ Historical data analysis
✔ …
37. From Edge to Cloud: Run IoTDB Everywhere
Time series data files: high-tech
write, high compression ratio,
support simple queries. Simply
put, TsFile is a zip file for time
series data.
Suitable for embedded devices,
general servers, data centers, etc.
TsFile (a component of IoTDB)
A zip file of time series
Freely operate time series of
multiple TsFiles, including: CRUD
and advanced query like:max, min,
avg and temporal alignment.
Scene: Embedded equipment, on-
site industrial computer, general
server, etc.
IoTDB
A database of time series
3rd Systems
Easy to use and integrate for
complex analysis(data fusion,
collaborative recommendation,
machine learning)
Scene: Cloud data center
A data warehouse of time series
38. Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
39. A Process to Manage Time Series Data
data source
or
JDBC / Session API
JDBC / Session API
Grafana-Adaptor Spark-TsFile-AdaptorJDBC
Analysis with Big Data Framework
(big data set)
Analysis with Matlab
(small data set)
Visualization
(Manual data explore)
40. Using JDBC to write data
set storage group
create timeseries
insert data
https://iotdb.apache.org/#/Documents/0.8.0/chap6/sec1
41. Using Session API to write Data
(more efficient)
set storage group
create timeseries
insert data
42. Using JDBC to Query Data
raw data query
aggregation query
down sampling query
print result
https://iotdb.apache.org/#/Documents/0.8.0/chap6/sec1
43. Using Grafana to Visualize Data
https://iotdb.apache.org/#/Tools/Grafana
• Install simple-json-datasource plugin
• Config iotdb-grafana-connector
• application.properties
• Start iotdb-grafana-connector
• java -jar iotdb-grafana-0.8.0.war
• Add IoTDB data source(Simplejson)
• choose connector IP
• Config dashboard and Enjoy!
44. Using Matlab to Analyze Data
read IoTDB by JDBC
fast Fourier
transform
plot
45. Using Spark to Analyze Data
create table
sql query
read TsFile
write to TsFile
https://iotdb.apache.org/#/Tools/Spark
46. Demo
• Writing Data Locally
• Show data with Grafana
• Analyze data using SparkSQL
• https://github.com/jixuan1989/iotdb-tutorial
48. Language
• Written by Java
• But the RPC is implemented by Thrift
• Easy to get other language’s API.
49. Say Hi to the Apache Ecosystem
IoTDB-repository:
RocketMQ: https://github.com/apache/incubator-
iotdb/tree/master/example/rocketmq
Kafka: https://github.com/apache/incubator-
iotdb/tree/master/example/kafka
Third part:
EMQx (MQTT server):
https://github.com/jixuan1989/iotdb-tutorial
Spark: https://github.com/jixuan1989/iotdb-tutorial
Calcite: https://github.com/EJTTianYu/iotdb-calcite
PLC4X:
Mapreduce:
50. Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
51. Application 1: The Next Generation of
Big Data Platform for Meteorology
1073
kinds of
meteor-
ological
data
The platform is deployed across
China
Performance improved :
two orders of magnitude
~150K
stations
collect more than 100
metrics/ 5 minutes
upgrade
53. Application 3:
Shanghai METRO Monitoring
…
144 trains
9 KairosDB + Cassandra
3200 points/500 ms/train
14 Restful service just for avoiding
modifying current programs
KDB compatible
Restful Service
KDB compatible
Restful Service
KDB compatible
Restful Service
ONE IoTDB
instance
300 trains
3200 points/200 ms/train
414 Billion
data points
per day
just using
ONE IoTDB
instance
upgrade
56. Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works
57. Future Works
• Make it easy to use!
• Relational Model: Integration with Calcite
• step 1: supports relational SQL
• step 2: standard JDBC
• Big Data!
• better integration with Hive, etc..
• Cluster!
• now supports writing data on HDFS, but a share-nothing architecture is wanted.
• Advanced functions!
• integration with data streaming engine, etc..
58. Join Us
• Mail list:
• subscribe: dev-
subscribe@iotdb.incubator.apache.org
• discussion: dev@iotdb.apache.org
• bug report:
https://issues.apache.org/jira/projects/I
OTDB/issues/IOTDB
• Website: https://iotdb.apache.org
• Ecosystem target:
IoTDB v0.8.0 is released! (the first Apache release version)