Apache IOTDB: a Time Series Database for Industrial IoT

Apache IoTDB: a Time Series
Database for Industrial IoT
Xiangdong Huang1 and Julian Feinauer2 (on behalf of the IoTDB community)
1 Tsinghua University, Beijing, China
2 Pragmatic Minds, Stuttgart, Germany
Berlin, Germany, 2019.10.23

Outline
• Who We Are
• Why IoTDB Was Born
• Overview of Apache IoTDB (incubating): Main Features
• Working with Current Ecosystems
• Performance Evaluation
• Use Cases
• Future Works

IoTDB
• IoTDB = IoT + DB, a DataBase for managing (Industrial) IoT data
• IoTDB is a IoT DB. (using IoTDB as a keyword on Google, not “IoT DB”)

IoTDB
• IoTDB is a IoT DB. (using IoTDB as a keyword on Google, not “IoT DB”)
• “You can find many substances about IoTDB in Germany”

IoTDB
• “You can find many substances about IoTDB”
• IIoT
turbine excavator trunks modern cars

IoTDB
• IoTDB = IoT + DB, the DataBase for managing (Industrial) IoT data
• “You can find many substances about IoTDB”
• IIoT • DB
deutsche bahn (the real meaning)

Who We Are (The community)
• We come from the Apache IoTDB (incubating) Community
• A young community. 2018.11-18 entered the incubator.
• Mentors: Christofer Dutz, Justin Mclean, (Champion) Kevin A. McGrail, Willem Jiang
• Devoted to building the best time series database (in IoT area) in the world

Who We Are (Individual)
• Xiangdong Huang (sainthxd@gmail.com)
• PhD, PostDoc and Assistant Researcher (now)
in Tsinghua University, Beijing, China
• Use Apache Cassandra (for managing Timeseries Data) from 2012
• Develop IoTDB from 2017
• One of the initial committers of Apache IoTDB incubating

Who We Are (Individual)
• Julian Feinauer (j.feinauer@pragmaticminds.de)
• Founder of Startup pragmatic minds in Germany
• The first committer who is not initial committer
• The Release Manager of the first release version of IoTDB
• The Committer of Apache PLC4x, Edgent etc..

The 4th Industrial Revolution
Industry 4.0 Industry Internet
Data analytics and
utility is the key
Advanced data
analytics
Industry Internet
Data + Model
Germany China USA
Data is becoming the most important aspect of this era

Machine Data (Time Series Data) :
the Largest Volume in Industrial Data
Machine Data
Other Domain Data
EnvironmentMeteorology Geography
Industrial
Big Data
Manufacturing
Enterprise Data
VideoModel
Doc Drawings

How to Manage Time Series Data
Network
MQ PI System
(Pi Server)
queryinsertion
save data
locally
RDBMS

How to Manage Time Series Data
Network
MQ Database
queryinsertion
save data
locally
Network
analysis

The Problems
Network
MQ Database
● millions of data points
per second?
● 10 millions of data points
per second?
● billions of data points
per second?
insertion
Big Data
50Hz，500points/machine，
20K wind-turbines macines，
totally up to 500 million points/sec
Produce Data 7*24
with High Frequency
and Large Volume
� More Features
� Out-of-order sometimes
� Sparse Table
(different machine has
different sensors)

The Problems
Network
MQ Database
query
analysis

The Problems
Network
MQ Database
query
analysis
� Features of Data Query
� Time Dimension is always accessed
� Aggregation is the first-class citizen
■ Sometimes we do not need raw data,
just know the count/min/max/avg
value is ok.
■ (For visualization), the screen
resolution is limited, e.g.,
1024*768. Then no meaning for
getting more than 1024 points
(using aggregation to
Downsampling)
� Time-series-specific query and analysis

● get a mass of data QUICKLY (ETL)
● then convert it into a analysis-friendly file format
● time consuming
The Problems
Network
MQ Database
query
analysis

What we want
� Challenges
� Large Volume
� High Throughput
� Low Cost (historical data)
� Low Latency for Query
� Fast Aggregation
� Query-Analysis hybrid workloads

Different Solutions for Managing Time Series
RDB
KVDB
LSM based
•Efficient file structure
•More query functions
Not optimize for
some application
scenarios
TSDB
Limited number of
columns
1600 Columns in a table
Limited number of rows
<=10M rows is better
Manual Sharding
• Support big data
• Limited Queries
• Lack time filtering
• Lack value filtering
• Lack multiple time series
alignment
Based on PG
•Auto sharding
•Query optimization
Performance degrades
sharply after writing
data for a long time
Hbase/Cassandra based
•Partition by TS-UID
and time range
• Storage inefficiency
• Limit of queries

Time Series DB for Industrial Internet
now called:
Apache IoTDB
(incubating)
Each node can manage:
★ Tens of millions of time series
★ Trillions of data points
★ Tens of TB data
Support Hadoop, Spark, Matlab,
Grafana etc..
“清华数为”工业互联网时间序列数据库

Apache IoTDB Features
Persist data
efficiently
• Millions points
ingestion per sec
per node
• Tens of millions
of time series
Query data
with low latency
• Efficiently filter data:
millions of points
per sec
• Aggregation:
tens of ms latency
on billions of points
Exclusive operations
of time series
• Segmentation
• Representation
• Subsequence
matching
• Time-frequency
transform
• Visualization
Integration with
existing ecosystem
• Kafka
• MatLab
• Spark
• MapReduce
• Grafana
• Connecting Edge
to the Cloud
• Powerful query
engine
• User Friendly
analytics
Collecti
on
Storage
ProcessLearning
Applicat
ion
Cover the
life cycle of data

Architecture
IoTDB Outlier
detection
Machine
learning
UDF
Hadoop/
Spark
Big data
Framework
cluster
TsFile
Time series optimized
file format
TsFile-CLI
Interactive client command line
IoTDB-JDBC
Grafana-Adaptor
Web dashboard to visualize
time series data
IoTDB-CLI
Interactive client command line
I/E Tool
Batch load and export data
Other
Databases
Application
s
Message
Queue
DevOp
s
devic
e
IoTDB IoTDBSync

Concepts in IoTDB (The Schema)
Device (i.e., Data source)
• A machine instance
Measurement (e.g., sensor)
• A device can have many measurements
Time Series
• Device + Measurement
• is represented as a path that begins with root, like
“root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain”
Storage Group (SG)
• A storage group can have many devices
• Storage groups have independent resources
(threads and files) to increase parallelism and
reduce competitions for locks.
Cadillac XT5

The schema mapping
root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain
root.Cadillac_XT5.USA.CA.7BTC409.speed
root.Cadillac_XT5.USA.NV.6BAC321.speed
country state device name timestamp fuelRemain speed
USA CA 7BTC409 t1 5.0 120
USA CA 7BTC409 t2 4.9 109
USA CA 6BAC321 t1 NULL 50
USA CA 6BAC321 t3 NULL 65
Table Name: Cadillac_XT5
Tags and Fields in InfluxDB, KariosDB, OpenTSDB…
called as Measurement in InfluxDB

Set time series group
SET STORAGE GROUP TO root.laptop.d1.s1;
Create Timeseries
CREATE TIMESERIES root.laptop.d1.s1 WITH DATATYPE=INT32, ENCODING=RLE
Insert Data
INSERT INTO (d1.s1,d1.s2,time) VALUES (1000,2000,14735235234);
Delete Data
DALETE FROM d1.s1 WHERE time < 1000;
Update Data
UPDATE d1.s1 SET VALUE = 2000 WHERE time < 2000 and time > 1000;
Query Data (Filter, Aggregation, Group by time interval)
SELECT d1.s1,d2.* FROM BJ.WF1 WHERE d1.s1 < 2000 and d2.s2 > 1000 and freq(d2.s3) > 0.5;
SELECT count(status), max_value(temperature) from root.ln.wf01.wt01;
SELECT count(status) ) from root.ln.wf01.wt01 group by(1h, [2017-11-03T00:00:00, 2017-11-
03T23:00:00]);
SQL in IoTDB

Supported data type
• Boolean
• Int
• Long
• Float
• Double
• String
• GPS (TODO) -> for trajectory data management
• Array (TODO) -> for unstructured data management

30
TsFile: Zip File Born for Time Series Data
Columnar
Store
- Reduce Disk I/O
- Improve Compression
Compression
&
Encoding
- Improve Compression Greatly
- 15% Better than InfluxDB in
Real Applications
Time-domain
Statistics Info
Natively
- Support Fast Query in
- Time Domain
- Value Domain
- Freq Domain (TODO)
detailed specification:
http://iotdb.apache.org/#/Documents/0.8.0/chap7/sec3
https://cwiki.apache.org/confluence/display/IOTDB/TsFile+Format

TsFile: comparison with Parquet
You say, “tomato”...
Parquet
Parquet TsFile Target in TsFile
Row
Group
Chunk
Group
The data that belongs to a device
instance
Column Chunk The data that belongs to the device’s
measurement
Page Page a part of data that belongs to a Chunk
The differences
❏ Each Page has two columns actually
❏ The time column and the value column
❏ No Repeat and Duplication Field Needed
❏ More summary info for a Page/Chunk
❏ min/max timestamp
❏ min/max value
❏ count
❏ FileMetadata
Page Header
Page Data
Timestamps
Values
Difference in TsFile
statistics
FileMetadata
Devices info
Level 1
Devices info
Level 2

TsFile: comparison with Parquet
Apache Parquet
Chunk Group Chunk
File Metadata
Time
Series
Time1 Value1 Time2 Value
2
TsFile
Time series data
General File Format

Adaptive Delta encoding – Int or Long (TODO)
Gorilla encoding – Float or Double
128, 136, 144, 152, 160, …
8, 8, 8, 8 � 1st difference is constant.
0, 0, 0 � 2nd difference is 1-bit storage needed!
128, 135, 143, 154, 163, …
7, 8, 11, 9 � 1st difference is not constant though
1, 3, -2 � 2nd difference is 2-bit storage needed!
• Unified support of fixed frequency times series
or irregular frequency time series
TS2Diff encoding – Int or Long (timestamps)
• A adaptive enhance for TS2Diff.
• See next page.
RLE encoding – repeated Int or Long
• For repeated sequence: store a value and its count
Bit-Packing encoding – Int or Long
• Store data in compact form
• squeeze out wasteful bits
• XOR consecutive data points
• Store with variable length encoding scheme
Snappy Gzip (TODO) LZO (TODO)
Compression Algorithm
TsFile: Encoding and Compression

Adaptive TS2Diff encoding – Int or Long (TODO)
• For time series with outliers or missing points
• Storing second-order delta values and a boolean flag array.
TsFile: Encoding and Compression

Data Query
Only records root nodes in memory and build virtual trees,
for reducing memory cost and disk I/O
35
Fast Aggregation Method for Time Series
IoTDB-SQL
DM L
R
select
raw
aggregate
filter
device
single
across
metric
single
across
time
certain
range
group by
time
interval
series
order by
ASC
DESC
fill
inter-
polation
latest
limit
slimit
index
C
U
D
DDL
✔ 8 types of sub-clause
✔ ≥1052 kinds of query
IoTDB-SQL
——Concise TS Operations Language
JDBC
——Reduce the Cost of Learning
Interfaces: JDBC, TsFile API, CLI, etc.

Time Series Specific Operations (TODO)
Pattern Matching for Streaming Time Series Data
✔ Split the pattern and data stream into
equal length fragments
✔ Extract features to reduce the dimension
✔ Accelerate the search by using features
✔ Scenario：fault alarm in real time
36
SELECT wind_3s FROM china.farm1.tb2
WHERE time > t1 AND time < t2
AND wind_3s LIKE PATTERN(7.2,..,20.3,..,6.0)
Similarity Search of Sub-series
✔ Indexing data using Key-Value form
✔ Scenarios:
✔ Outlier detection
✔ Historical data analysis
✔ …

From Edge to Cloud: Run IoTDB Everywhere
Time series data files: high-tech
write, high compression ratio,
support simple queries. Simply
put, TsFile is a zip file for time
series data.
Suitable for embedded devices,
general servers, data centers, etc.
TsFile (a component of IoTDB)
A zip file of time series
Freely operate time series of
multiple TsFiles, including: CRUD
and advanced query like：max, min,
avg and temporal alignment.
Scene: Embedded equipment, on-
site industrial computer, general
server, etc.
IoTDB
A database of time series
3rd Systems
Easy to use and integrate for
complex analysis(data fusion,
collaborative recommendation,
machine learning)
Scene: Cloud data center
A data warehouse of time series

A Process to Manage Time Series Data
data source
or
JDBC / Session API
JDBC / Session API
Grafana-Adaptor Spark-TsFile-AdaptorJDBC
Analysis with Big Data Framework
(big data set)
Analysis with Matlab
(small data set)
Visualization
(Manual data explore)

Using JDBC to write data
set storage group
create timeseries
insert data
https://iotdb.apache.org/#/Documents/0.8.0/chap6/sec1

Using Session API to write Data
(more efficient)
set storage group
create timeseries
insert data

Using JDBC to Query Data
raw data query
aggregation query
down sampling query
print result
https://iotdb.apache.org/#/Documents/0.8.0/chap6/sec1

Using Grafana to Visualize Data
https://iotdb.apache.org/#/Tools/Grafana
• Install simple-json-datasource plugin
• Config iotdb-grafana-connector
• application.properties
• Start iotdb-grafana-connector
• java -jar iotdb-grafana-0.8.0.war
• Add IoTDB data source(Simplejson)
• choose connector IP
• Config dashboard and Enjoy!

Using Matlab to Analyze Data
read IoTDB by JDBC
fast Fourier
transform
plot

Using Spark to Analyze Data
create table
sql query
read TsFile
write to TsFile
https://iotdb.apache.org/#/Tools/Spark

Demo
• Writing Data Locally
• Show data with Grafana
• Analyze data using SparkSQL
• https://github.com/jixuan1989/iotdb-tutorial

Demo Video
• Writing Data on HDFS directly
• using Hive to analyze it
• Video

Language
• Written by Java
• But the RPC is implemented by Thrift
• Easy to get other language’s API.

Say Hi to the Apache Ecosystem
IoTDB-repository:
RocketMQ: https://github.com/apache/incubator-
iotdb/tree/master/example/rocketmq
Kafka: https://github.com/apache/incubator-
iotdb/tree/master/example/kafka
Third part:
EMQx (MQTT server):
https://github.com/jixuan1989/iotdb-tutorial
Spark: https://github.com/jixuan1989/iotdb-tutorial
Calcite: https://github.com/EJTTianYu/iotdb-calcite
PLC4X:
Mapreduce:

Application 1: The Next Generation of
Big Data Platform for Meteorology
1073
kinds of
meteor-
ological
data
The platform is deployed across
China
Performance improved :
two orders of magnitude
~150K
stations
collect more than 100
metrics/ 5 minutes
upgrade

Application 2:
Data Management for Equipment Monitoring
The data records the operational status of the
equipments,
e.g., the vehicle’s speed, fuel consumption
and malfunction.
© 2015. All Rights
Reserved.
execute
collect
decision
transfer
Komatsu
excavator
TIANYUAN (with Komatsu)
#devices (excavator etc.)
#metrics
collection times per minute
• sharding every day
• only store data in 3 months
• more than 10 minutes for
some queries
• store the whole data
• several seconds for
complex queries

Application 3:
Shanghai METRO Monitoring
…
144 trains
9 KairosDB + Cassandra
3200 points/500 ms/train
14 Restful service just for avoiding
modifying current programs
KDB compatible
Restful Service
KDB compatible
Restful Service
KDB compatible
Restful Service
ONE IoTDB
instance
300 trains
3200 points/200 ms/train
414 Billion
data points
per day
just using
ONE IoTDB
instance
upgrade

Future Works
• Make it easy to use!
• Relational Model: Integration with Calcite
• step 1: supports relational SQL
• step 2: standard JDBC
• Big Data!
• better integration with Hive, etc..
• Cluster!
• now supports writing data on HDFS, but a share-nothing architecture is wanted.
• Advanced functions!
• integration with data streaming engine, etc..

Join Us
• Mail list:
• subscribe: dev-
subscribe@iotdb.incubator.apache.org
• discussion: dev@iotdb.apache.org
• bug report:
https://issues.apache.org/jira/projects/I
OTDB/issues/IOTDB
• Website: https://iotdb.apache.org
• Ecosystem target:
IoTDB v0.8.0 is released! (the first Apache release version)

Apache IOTDB: a Time Series Database for Industrial IoT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache IOTDB: a Time Series Database for Industrial IoT

Similar to Apache IOTDB: a Time Series Database for Industrial IoT (20)

More from jixuan1989

More from jixuan1989 (7)

Recently uploaded

Recently uploaded (20)

Apache IOTDB: a Time Series Database for Industrial IoT