31. 数据写入
d1,s1, t1, v
d1,s1, t3, v
d1,s1, t2, v
d1,s1, t4, v
d1,s1, t_, v
内存数据
d1 d2
s1 s1 s2
t1, v
t3, v
t2, v
t4, v
t1, v
t2, v
t1, v
t2, v
Disk
cleaned
memory
存在大量乱序数据
乱序时长从0-300分钟不等
30分钟以内乱序数据较多
•WritableMemChunk容忍一定的乱序,且不排序
•过于陈旧的数据,则放入溢出数据对应的内存表中
*一些系统在内存中采用Btree结构
•刷入磁盘时或者服务于查询时,将数据拷贝并排序
•CopyOnRead
d2,s1, t1, v
d2,s1, t2, v
d2,s2, t1, v
d2,s2, t2, v
原始数据
33. 刷写磁盘
MC M M
MC M
M M M M
F1
F2
F3
ThreadPool
并行与串行协同工作
多
线
程
并
行
单文
件所
有M
任务
串行
保序
执行
WritableMemChunk 增大任务调度并行度
1 排序
2 编码
3 IO
1 2 3 1 2 3 1 2 3
1
2
3
1
2
3
1
2
3
34. 数据文件:现有文件格式的问题
Apache Parquet
Time1 Value1
p领域语义
Ø 时间列总是被读取
(otherwise the time dimension is missing)
Ø 数据总是按时间排序
Ø 统计信息的意义很重大
(max/min time/value, for accelerating query)
p空值问题
Ø Parquet 需要存储空值来保证按行返回结果
Ø Parquet has to add R and D fields for supporting nesting, which is
meaningless for time series data.
Time2 Value2
时间序列数据通用文件格式
36. 数据文件 TsFile
ChunkGroup Footer
Chunk
Header
Chunk data Chunk data Chunk data
marker
Chunk
Header
marker
Chunk
Header
marker
TsDeviceMetadata(
d1)
marker
TsFileMetaData(d1,d2)
String: 12 bytes
Magic
TsDeviceMetadata(
d2)
ChunkGroup Footer
Chunk
Header
Chunk data Chunk data Chunk data
marker
Chunk
Header
marker
Chunk
Header
marker
marker
marker
TsFileMetada
ta Size
String: 12 bytes
Magic
d1
d2
37. 数据文件 TsFile
ChunkGroup Footer
Chunk
Header
Chunk data Chunk data Chunk data
marker
Chunk
Header
marker
Chunk
Header
marker
TsDeviceMetadata(
d1)
marker
TsFileMetaData(d1,d2)
String: 12 bytes
Magic
TsDeviceMetadata(
d2)
ChunkGroup Footer
Chunk
Header
Chunk data Chunk data Chunk data
marker
Chunk
Header
marker
Chunk
Header
marker
marker
marker
TsFileMetada
ta Size
String: 12 bytes
Magic
d1
d2
自解释
自修复
向量化读取
38. 编码
128, 136, 144, 152, 160, …
8, 8, 8, 8 à 1st difference is constant.
0, 0, 0 à 2nd difference is 1-bit storage needed!
128, 135, 143, 154, 163, …
7, 8, 11, 9 à 1st difference is not constant
1, 3, -2 à 2nd difference is 2-bit storage needed!
Unified support of fixed frequency times series
or irregular frequency time series
TS2Diff encoding – Optimized for timestamps Bitmap encoding – for enum-type values
RLE encoding - for consecutively repeated
values
BitPacking encoding - for squeezing out
wasteful bits when storing switch values
Gorilla encoding