From a student to an apache committer practice of apache io tdb

从 Apache IoTDB 看高校学生的
Apache 开源实践
Developing Apache IoTDB:
Practice Experience from Young Students
Xiangdong Huang
Tsinghua University, Beijing, China
2019.11.09

Outline
• Who am I
• The Start
• Dream Disillusion
• A New Hope

Who am I
• Xiangdong Huang (sainthxd@gmail.com)
• Was a PhD student and PostDoc in Tsinghua University
• One of the initial committers of Apache IoTDB (incubating)

The Start
• it was the start of the following story when I knocked the door of
my supervisor’s office in 2011…
My supervisor
(Jianmin Wang)
me
My supervisor
(Jianmin Wang)
me

The Start
My supervisor
(Jianmin Wang)
me
Xiangdong, Why do you
want to be a PhD at
School of Software?
I want to develop
something that be used
by millions of people!
Come on!
Do some cool softwares that can be used by many many people.

As an Individual Developer
• Write a lot small “tools“
• But no maintaining
• Just for fun/self-use

Developer as a Student
• Many courses
• Do not need to write to much codes (in some home works)..
• Good for improve skill, and hard to get the full score (because some are really hard!).
Data Mining Modern Database
100 lines? innovation

The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done

Homework magic
weapons:
- Bootstrap
- Django
- MySQL
To use the
demo, we can
Step 1, click..
Step 2, click..
…
student
reviews

Homework magic
weapons:
- Bootstrap
- Django
- MySQL
To use the
demo, we can
Step 1, click..
Step 2, click..
…
What if I click
here first.

Homework magic
weapons:
- Bootstrap
- Django
- MySQL
To use the
demo, we can
Step 1, click..
Step 2, click..
…
STOP!
YOU
CANNOT!
What if I click
here first.

We are writing demo and demo and demo…
• Complex project management?
• Makefile? POM? Gradle?
• Agile? Scrum? Sprint?
• CI? CD?
A pom file example
From Apache PLC4x

At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it

• Hadoop
• HBase
• Cassandra
• ~200 k lines of codes
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!

• Hadoop
• HBase
• Cassandra
• 2.2.0, 2.2.1, …2.2.5;
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!
Why
Cassandra
can update
so frequent?

• Hadoop
• HBase
• Cassandra
• 2.2.0, 2.2.1, …2.2.5;
• Patch
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!
Why
Cassandra
can update
so frequent?
Wow, someone
share a patch
file to fix a bug!
Yes, you are growing! You have known JIRA, etc..

• When can I get rid of writing demo, and do some
nice software like Apache Cassandra, Hadoop, etc..

A New Hope
• Be active in an existing open source community
• Hadoop, Cassandra, Spark etc..
• Be active in a new open source community
• IoTDB etc..

Time series data is everywhere
穿戴设备无人驾驶

A good DB can improve the whole process
Network
MQ Database
queryinsertion
save data
locally
Network
analysis

And no good software
RDB
KVDB
LSM based
•Efficient file structure
•More query functions
Not optimize for
some application
scenarios
TSDB
Limited number of
columns
1600 Columns in a table
Limited number of rows
<=10M rows is better
Manual Sharding
• Support big data
• Limited Queries
• Lack time filtering
• Lack value filtering
• Lack multiple time series
alignment
Based on PG
•Auto sharding
•Query optimization
Performance degrades
sharply after writing
data for a long time
Hbase/Cassandra based
•Partition by TS-UID
and time range
• Storage inefficiency
• Limit of queries

Do it ourselves
supervisor
students
Let’s develop a
time series DB!
Can we?
You can! And you
can do it in an
open source way.
And then learn a lot…

1. Teamwork
• Git with 10+ persons Team
• Commitlog
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
Let your software >= 100K Lines.

2. Learn skills
• Project structure
Let your software powerful.

3. Stability/Agile
• CI/CD
• Jenkins, travis-CI
Let your software really really can be used.

4. Open your mind
• CI/CD
• Jenkins, travis-CI
• Issue -> PR -> Release
Open your minds.
Improve your communication skills.

5. Research and Project
• User requirements -> Implementation -> IoTDB -> User
• Idea -> Implementation -> IoTDB -> Evaluation -> Paper -> User
• Paper -> Implementation -> IoTDB -> Evaluation -> User

OK….
• Past
• I can write a demo
• I like to write something
• I like to write something used
by myself
• Now
• I/We know how to write a
complex software
• I/We know how to write a
software used by people

Do it ourselves
• Know a lot about how Apache project are developed!
• How the website of an Apache project is built?
• Who can be a committer of an Apache project?
• How to release projects?
• Who decides the new features of an Apache project?
• Etc..

Time Series DB for Industrial Internet
“清华数为” 时间序列数据库 -->Apache IoTDB (incubating)
• Apache IoTDB (incubating) is a
high efficient Database for
managing time series data,
especially in Industry Internet
applications.
• A young community. Donated by
Tsinghua University, 2018.11-18
entered the incubator.
• Devoted to building the best time
series database (in IoT area) in the
world.
• Apache IoTDB v0.8.1 is released!
v0.9.0 is coming!

Concepts in IoTDB (The Schema)
Device (i.e., Data source)
• A machine instance
Measurement (e.g., sensor)
• A device can have many measurements
Time Series
• Device + Measurement
• is represented as a path that begins with root, like
“root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain”
Storage Group (SG)
• A storage group can have many devices
• Storage groups have independent resources
(threads and files) to increase parallelism and
reduce competitions for locks.
Cadillac XT5

The schema mapping
root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain
root.Cadillac_XT5.USA.CA.7BTC409.speed
root.Cadillac_XT5.USA.NV.6BAC321.speed
country state device name timestamp fuelRemain speed
USA CA 7BTC409 t1 5.0 120
USA CA 7BTC409 t2 4.9 109
USA CA 6BAC321 t1 NULL 50
USA CA 6BAC321 t3 NULL 65
Table Name: Cadillac_XT5
Tags and Fields in InfluxDB, KariosDB, OpenTSDB…
called as Measurement in InfluxDB

Set time series group
SET STORAGE GROUP TO root.laptop.d1.s1;
Create Timeseries
CREATE TIMESERIES root.laptop.d1.s1 WITH DATATYPE=INT32, ENCODING=RLE
Insert Data
INSERT INTO (d1.s1,d1.s2,time) VALUES (1000,2000,14735235234);
Delete Data
DALETE FROM d1.s1 WHERE time < 1000;
Update Data
UPDATE d1.s1 SET VALUE = 2000 WHERE time < 2000 and time > 1000;
Query Data (Filter, Aggregation, Group by time interval)
SELECT d1.s1,d2.* FROM BJ.WF1 WHERE d1.s1 < 2000 and d2.s2 > 1000 and freq(d2.s3) > 0.5;
SELECT count(status), max_value(temperature) from root.ln.wf01.wt01;
SELECT count(status) ) from root.ln.wf01.wt01 group by(1h, [2017-11-03T00:00:00, 2017-11-
03T23:00:00]);
SQL in IoTDB

Supported data type
• Boolean
• Int
• Long
• Float
• Double
• String
• GPS (TODO) -> for trajectory data management
• Array (TODO) -> for unstructured data management

41
TsFile: Zip File Born for Time Series Data
Columnar
Store
- Reduce Disk I/O
- Improve Compression
Compression
&
Encoding
- Improve Compression Greatly
- 15% Better than InfluxDB in
Real Applications
Time-domain
Statistics Info
Natively
- Support Fast Query in
- Time Domain
- Value Domain
- Freq Domain (TODO)
detailed specification:
http://iotdb.apache.org/#/Documents/0.8.0/chap7/sec3
https://cwiki.apache.org/confluence/display/IOTDB/TsFile+Format

Adaptive Delta encoding – Int or Long (TODO)
Gorilla encoding – Float or Double
128, 136, 144, 152, 160, …
8, 8, 8, 8 1st difference is constant.
0, 0, 0 2nd difference is 1-bit storage needed!
128, 135, 143, 154, 163, …
7, 8, 11, 9 1st difference is not constant though
1, 3, -2 2nd difference is 2-bit storage needed!
• Unified support of fixed frequency times series
or irregular frequency time series
TS2Diff encoding – Int or Long (timestamps)
• A adaptive enhance for TS2Diff.
• See next page.
RLE encoding – repeated Int or Long
• For repeated sequence: store a value and its count
Bit-Packing encoding – Int or Long
• Store data in compact form
• squeeze out wasteful bits
• XOR consecutive data points
• Store with variable length encoding scheme
Snappy Gzip (TODO) LZO (TODO)
Compression Algorithm
TsFile: Encoding and Compression

Adaptive TS2Diff encoding – Int or Long (TODO)
• For time series with outliers or missing points
• Storing second-order delta values and a boolean flag array.
TsFile: Encoding and Compression

Time Series Specific Operations (TODO)
Pattern Matching for Streaming Time Series Data
Split the pattern and data stream into
equal length fragments
Extract features to reduce the dimension
Accelerate the search by using features
Scenario：fault alarm in real time
44
SELECT wind_3s FROM china.farm1.tb2
WHERE time > t1 AND time < t2
AND wind_3s LIKE PATTERN(7.2,..,20.3,..,6.0)
Similarity Search of Sub-series
Indexing data using Key-Value form
Scenarios:
Outlier detection
Historical data analysis
…

From Edge to Cloud: Run IoTDB Everywhere
Time series data files: high-tech
write, high compression ratio,
support simple queries. Simply
put, TsFile is a zip file for time
series data.
Suitable for embedded devices,
general servers, data centers, etc.
TsFile (a component of IoTDB)
A zip file of time series
Freely operate time series of
multiple TsFiles, including: CRUD
and advanced query like：max, min,
avg and temporal alignment.
Scene: Embedded equipment, on-
site industrial computer, general
server, etc.
IoTDB
A database of time series
3rd Systems
Easy to use and integrate for
complex analysis(data fusion,
collaborative recommendation,
machine learning)
Scene: Cloud data center
A data warehouse of time series

A Process to Manage Time Series Data
data source
or
JDBC / Session API
JDBC / Session API
Grafana-Adaptor Spark-TsFile-AdaptorJDBC
Analysis with Big Data Framework
(big data set)
Analysis with Matlab
(small data set)
Visualization
(Manual data explore)
https://github.com/jixuan1989/iotdb-tutorial

Latest version v0.8 (0.9.0-snapshot)
Apache IoTDB-incubating v0.9.0-SNAPSHOT
Xeon E5v4
256G Mem
HDD Disk
#Client #Storage
Group
#Device #Measurem
ent per
Device
DataType Encoding Compressio
n
BatchSize #Point per
Time Series
10 50 1000 100 Float RLE Snappy 100 100000
Insertion
#Client #Storage
Group
#Device #Measure
ment per
Device
DataType Encoding Compressi
on
BatchSize #Point per Time
Series
50 1 1 10 Float RLE Snappy 100 100000000
Query

Compression
Xeon E5v4
256G Mem
HDD Disk
Raw data:
- 12 Bytes per point
- 112 GB totally

Write Performance: points/s（single node）
Xeon E5v4
256G Mem
HDD Disk
* In this experiment, we do not use IoTDB’s JDBC API and SQL interface.
Instead, we use a raw API like Cassnadra’s Raw Thrift API.

Query Performance: aggregation count()
InfluxDB failed to return
any answers in the
100,000,000 setting.
Xeon E5v4
256G Mem
HDD Disk

Shanghai METRO Monitoring
…
144 trains
9 KairosDB + Cassandra
3200 points/500 ms/train
14 Restful service just for avoiding
modifying current programs
KDB compatible
Restful Service
KDB compatible
Restful Service
KDB compatible
Restful Service
ONE IoTDB
instance
300 trains
3200 points/200 ms/train
414 Billion
data points
per day
just using
ONE IoTDB
instance
upgrade

Join Us
• Mail list:
• subscribe: dev-
subscribe@iotdb.incubator.apache.org
• discussion: dev@iotdb.apache.org
• ！中英文皆可！（推荐英文）
• bug report: https://s.apache.org/iotdb-issues
• ！中英文皆可！（推荐英文）
• Website: https://iotdb.apache.org
钉钉用户交流群
官方网站
IoTDB社区建设：
• 邀请更多开发者/用户/学生加入社区，共同成长
• 是本科学生毕设、研究生实习的最佳选择之一！
• 欢迎外地学生/开发者（邀请参加>=1次北京meetup）

From a student to an apache committer practice of apache io tdb

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From a student to an apache committer practice of apache io tdb

Similar to From a student to an apache committer practice of apache io tdb (20)

More from jixuan1989

More from jixuan1989 (7)

Recently uploaded

Recently uploaded (20)

From a student to an apache committer practice of apache io tdb