This is not a contribution
FiloDB:
Reactive, Real-time, In-Memory 

Time Series at Scale
Evan Chan ( @evanfchan)
Apple
October 2018
This is not a contribution
Solving The Time
Series Problem
This is not a contribution
Operational Metrics
This is not a contribution
Requirements
• Massive scale, billions of metrics
• Resiliency and maximum uptime
• Real time (seconds, not minutes)
• Low latency querying
• High concurrency (thousands of dashboards, alerts)
• Easy debugging - flexible ad-hoc queries
This is not a contribution
What Users Wanted
• Flexible data model and queries, tag-based querying
• User-defined “tags” on metrics and data
• Prevents abuse of hierarchical system
• Can query across regions, other boundaries
• Flexible rollups
• Longer views of fine grained data
• or flexible retention policies
This is not a contribution
Design for the
• Internal cloud @Apple similar to public cloud
• Containers and “stateless” apps
• Use of Docker, etc. promotes more containers = more metrics
• Stateless = more frequent restarts, more UUIDs => more
metrics
• Leverage hosted cloud services
• Hosted Cassandra, Kafka, other data services
• Let someone else manage persistent storage
Where are we going?
Dashboards
Real-time
Debugging
Events
Metrics
Tracing
???
Real-time ML/
AI
Actionable
Insights
This is not a contribution
(Re)Introducing FiloDB
A Prometheus-compatible, Distributed, In-Memory
Time Series Database
OPEN SOURCE!
http://www.github.com/filodb/FiloDB
Built on the proven reactive SMACK stack.
This is not a contribution
Core Principles
Designed for Cloud
Infrastructure
Built for Scale and
Resiliency
Flexible
Data
Model
Multi-Tenant
This is not a contribution
Proudly built on the
Reactive Stack
This is not a contribution
In-Memory 

Time Series
This is not a contribution
This is not a contribution
Facebook Gorilla
• Keep most recent time series data IN MEMORY,
stored using efficient time series encoding
techniques
• Serve queries using separate process
• Allows dense, massively scalable TS storage +
very fast, rich queries of recent data
• https://github.com/facebookarchive/beringei
Operational Metrics Flow
Kafka
FiloDB Node
Cassandra
Dashboards
Real-time
Debugging
HTTP
Gateway
Gateway
Gateway
App
App
App
Collector
App
This is not a contribution
Shard 1
Shard 0
Data Flow on a NodeRecords
Records
Records
Monix / RX ingestion / back pressure
Index
Write buffers
Write buffers
Write buffers
Chunks
Chunks
Chunks
Index
Write buffers
Write buffers
Write buffers
Chunks
Chunks
Chunks
Encoding
Encoding
This is not a contribution
Columnar Compression
timestamp value
Row-based
timestamp value
timestamp value
timestamp value
timestamp value
t1 t2 t3 t4 t5 t6 t7 t8
Column-based
v1 v2 v3 v4 v5 v6 v7 v8
• Compressing all timestamps together is much
more efficient
This is not a contribution
Delta-Delta Encoding
• Encode increasing numbers (timestamps,
counters) as deltas from slope
This is not a contribution
Results
• Millions of time series and billions of samples per
node
• Up to 1 million samples/sec per node ingestion rate
peak (measured during recovery)
• Up to 8x better than previous system (Storm/HBase)
• Storage density of ~3 bytes per metric sample
• About 10x better than previous system (HBase)
This is not a contribution
Tackling Heap Issues
• 60+ second GC pauses / OOM
• Filled up old gen, GC stuck finding tiny bit of free space
• Solution: move as many permanent objects offheap as
possible
• Too high rate of allocation on ingest
• Temporary objects only, but producing too many
• Solution: Switch from Protobuf to custom, no-allocation
BinaryRecord
This is not a contribution
Off-heap Data Structures
• BinaryVector - one compressed column of
data (say timestamps, or values)
• BinaryRecord - one ingestion data record,
variable schema
• OffheapLFSortedIDMap - offheap lightweight
sorted map
This is not a contribution
Block
Moving Object Graphs
OffHeap
Write buffer
Chunks Chunks
OffheapOnheap
TSPartition
WriteBufferObject
ConcurrentSkipListMap
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
VectorObject
VectorObject
This is not a contribution
Block
Moving Object Graphs
OffHeap
Write buffer
Chunks Chunks
ChunkMap
OffheapOnheap
TSPartitionTSPartition
WriteBufferObject
ChunkSetInfoPartID
ChunkSetInfo
ConcurrentSkipListMap
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
VectorObject
VectorObject
Ptr Ptr Ptr Ptr
This is not a contribution
Flexible Distributed
Queries
This is not a contribution
Prometheus Compatible
• Don’t reinvent a popular time series query language
• Prom HTTP API gives out of box Grafana support
sum(http_requests{partition=“P2”,dc="DC0",job="A0"}) by (host)
• Filtering/indexing on many time series
• Time windowing-based aggregation with multiple windows
• Group by
This is not a contribution
Queries to Logical Plan
sum(http_requests{partition=“P2”,dc="DC0",job="A0"}) by (host)
Aggregate(Sum,

PeriodicSeries(

RawSeries(

IntervalSelector(t1, t2, step),

List(ColumnFilter(partition,Equals(P2)),

ColumnFilter(dc,Equals(DC0)),

ColumnFilter(job,Equals(A0)),

ColumnFilter(__name__,Equals(http_requests))
AST
This is not a contribution
Shard 0
Physical Plan Execution
• Location transparency of Akka actors is crucial here
ReduceAggregateExec
Sum
SelectRawPartitionsExec
AggregateMapReduce
PeriodicSamplesMapper
SelectRawPartitionsExec
AggregateMapReduce
PeriodicSamplesMapper
Chunks
Chunks
Chunks
Chunks
Shard 1
This is not a contribution
Shard 0
Physical Plan Execution
• Look ma, plan change, no code changes!
ReduceAggregateExec
Sum
SelectRawPartitionsExec
AggregateMapReduce
PeriodicSamplesMapper
SelectRawPartitionsExec
AggregateMapReduce
PeriodicSamplesMapper
Chunks
Chunks
Chunks
Chunks
Shard 1
Query Service
This is not a contribution
Node 2Node 1
Actor Hierarchy
NodeCoordinator
Actor
IngestionActor QueryActor
MemStore
HTTP
NodeCoordinator
Actor
IngestionActor QueryActor
MemStore
HTTP
CLI / Akka Remote
This is not a contribution
Comparisons
• Queries possible on FiloDB and not on old system:
• Tag-based querying (filter, group by etc based on
flexible tags)
• Histograms and quantiles
• Group by and topK queries
• Flexible time series joins
• 100’s millions samples queried/sec
This is not a contribution
Datasets and Data
Model
This is not a contribution
What Kind of Data Works?
• High cardinality of individual time series (operational metrics,
devices, business metrics)
• Many data points in each series, append only
Series1 {k1=v1, k2=v2}
Time
Series2 {k1=v3, k2=v4}
Series3 {k1=v5, k2=v6}
This is not a contribution
Flexible Tags
• Each time series is defined by a metric name and
a unique combination of keys-values
• Index on tags allows filter/search by combo of tags
memstore_partitions_queried {
dataset=timeseries,
host=MacBook-Pro-229.local,
shard=0
}
memstore_partitions_queried {
dataset=timeseries,
host=MacBook-Pro-229.local,
shard=1
}
This is not a contribution
Flexible Schemas and
Datasets
• Datasets allow for namespacing different schemas, ingestion
sources, SLAs, # shards, and offheap memory isolation
• Main dataset with 2-day retention
• Pre-aggregates dataset with 1 week retention
• Histograms dataset with schema for efficient histogram storage
• OpenTracing dataset — start, end, span duration, etc.
• Historical data using different schema
This is not a contribution
The Hard Stuff: Recovery
and Persistence
This is not a contribution
What is persisted?
• Raw time series data - using a custom format
designed for efficient ingestion & recovery - is
stored using and ingested from Apache Kafka
• Compressed, columnar time series data is written
periodically to a ColumnStore, typically
Cassandra
• Time series metadata for reconstructing each
node’s index is persisted as well
Akka Cluster
Ingestion and Sharding
Kafka
Shard0 Shard1 Shard2 Shard3 Shard4
FiloDB
Node
FiloDB
Node
FiloDB
Node
FiloDB
Node
Gateway
S0 S1 S2 S3 S4
Recovery
Kafka
Shard0 Shard1 Shard2 Shard3 Shard4
FiloDB
Node
FAILU
RE
FiloDB
Node
FiloDB
Node
S0 S1 S2 S3 S4
New
Filo
Node
S2
CassandraChunkSink
On-demand paging
FiloDB
Node
S2
Queries
to other DC
This is not a contribution
Recovery
• The most recent raw data - before encoding - is
recovered by replaying Apache Kafka partitions
• Index metadata is recovered
• Compressed data is loaded on-demand from
Cassandra. This works because most data
written is never queried.
This is not a contribution
FiloDB vs Alternatives
This is not a contribution
vs Prometheus
• FiloDB supports PromQL, HTTP query API
• Prometheus is single-node only
• FiloDB is multi-schema and multi-tenant
• FiloDB designed to run as a resilient, distributed,
high-uptime cloud service
• FiloDB open source not as rich feature-wise (yet)
This is not a contribution
vs InfluxDB
• FiloDB data model is very close to Influx: multi-
schema, multiple columns, namespaces, tags on
series
FiloDB InfluxDB
Clustering Peer to peer distributed
Single node (OSS),

clustered ($$)
Query language PromQL SQL (PromQL coming)
Maturity New Established
This is not a contribution
vs Cassandra
• C*: Very well established and widely used, robust
• Like FiloDB: real time, distributed, low-latency
• C*: Very simple queries ideally to one partition
• FiloDB: complex PromQL queries, topK,
groupBy, time series joins and windowing
• FiloDB: much higher storage density and
ingestion throughput for time series
This is not a contribution
vs Druid
• Druid and FiloDB have different data models
• Druid is an OLAP database with an explicit time
dimension. Dimensions are fixed.
• FiloDB supports millions/billions of time series
with flexible tags
• FiloDB stores raw data, Druid stores roll ups
This is not a contribution
Tradeoffs and Lessons
This is not a contribution
Tradeoffs of using the JVM
• Pluses: solid, proven libraries for building
distributed and data systems
• Apache Lucene
• Akka Cluster
• Minuses: Lack of low-level memory layout and
control
• The devil you know best
This is not a contribution
JVM Production Tips
• Get to know different GCs, Eden, OldGen, G1GC, etc.
really really well
• SJK (https://github.com/aragozin/jvm-tools)
• Runtime visibility
• Multiple APIs to access cluster state
• JMXbeans
• Measure measure measure!! (Use JMH)
This is not a contribution
Current Status
• Development at github.com/filodb/FiloDB
• Time/value schema ingestion and querying is
stable
• Looking for partners to work together, add
integrations, etc.
This is not a contribution
Try it out today
• Ingest data using https://github.com/influxdata/
telegraf
• Expose a Prometheus HTTP read endpoint in
your apps
• Use Grafana to visualize metrics
This is not a contribution
Roadmap
• Speed and efficiency improvements in core FiloDB
database
• Histogram optimizations
• Improved cluster state management
• Support for Spark/ML/AI jobs and metrics. How can we
improve observability for data engineers?
• Support for non-metrics schemas
• Long term storage
This is not a contribution
Thank you!
• Note: We are hiring! If you love reactive systems,
distributed systems, love to push the performance
envelope…. there’s a place for you.
This is not a contribution
Extra Slides
This is not a contribution
On Heap vs Off HeapRecords
Records
Records
Index
Write buffers
Write buffers
Write buffers
Chunks
Chunks
Chunks
Chunks
Chunks
Chunks
ChunkMap
ChunkMap
ChunkMap
OffheapOnheap
TSPartition
TSPartition
TSPartition
PartitionMap
Lucene MMap
Index Files

FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale

  • 1.
    This is nota contribution FiloDB: Reactive, Real-time, In-Memory 
 Time Series at Scale Evan Chan ( @evanfchan) Apple October 2018
  • 2.
    This is nota contribution Solving The Time Series Problem
  • 3.
    This is nota contribution Operational Metrics
  • 4.
    This is nota contribution Requirements • Massive scale, billions of metrics • Resiliency and maximum uptime • Real time (seconds, not minutes) • Low latency querying • High concurrency (thousands of dashboards, alerts) • Easy debugging - flexible ad-hoc queries
  • 5.
    This is nota contribution What Users Wanted • Flexible data model and queries, tag-based querying • User-defined “tags” on metrics and data • Prevents abuse of hierarchical system • Can query across regions, other boundaries • Flexible rollups • Longer views of fine grained data • or flexible retention policies
  • 6.
    This is nota contribution Design for the • Internal cloud @Apple similar to public cloud • Containers and “stateless” apps • Use of Docker, etc. promotes more containers = more metrics • Stateless = more frequent restarts, more UUIDs => more metrics • Leverage hosted cloud services • Hosted Cassandra, Kafka, other data services • Let someone else manage persistent storage
  • 7.
    Where are wegoing? Dashboards Real-time Debugging Events Metrics Tracing ??? Real-time ML/ AI Actionable Insights
  • 8.
    This is nota contribution (Re)Introducing FiloDB A Prometheus-compatible, Distributed, In-Memory Time Series Database OPEN SOURCE! http://www.github.com/filodb/FiloDB Built on the proven reactive SMACK stack.
  • 9.
    This is nota contribution Core Principles Designed for Cloud Infrastructure Built for Scale and Resiliency Flexible Data Model Multi-Tenant
  • 10.
    This is nota contribution Proudly built on the Reactive Stack
  • 11.
    This is nota contribution In-Memory 
 Time Series
  • 12.
    This is nota contribution
  • 13.
    This is nota contribution Facebook Gorilla • Keep most recent time series data IN MEMORY, stored using efficient time series encoding techniques • Serve queries using separate process • Allows dense, massively scalable TS storage + very fast, rich queries of recent data • https://github.com/facebookarchive/beringei
  • 14.
    Operational Metrics Flow Kafka FiloDBNode Cassandra Dashboards Real-time Debugging HTTP Gateway Gateway Gateway App App App Collector App
  • 15.
    This is nota contribution Shard 1 Shard 0 Data Flow on a NodeRecords Records Records Monix / RX ingestion / back pressure Index Write buffers Write buffers Write buffers Chunks Chunks Chunks Index Write buffers Write buffers Write buffers Chunks Chunks Chunks Encoding Encoding
  • 16.
    This is nota contribution Columnar Compression timestamp value Row-based timestamp value timestamp value timestamp value timestamp value t1 t2 t3 t4 t5 t6 t7 t8 Column-based v1 v2 v3 v4 v5 v6 v7 v8 • Compressing all timestamps together is much more efficient
  • 17.
    This is nota contribution Delta-Delta Encoding • Encode increasing numbers (timestamps, counters) as deltas from slope
  • 18.
    This is nota contribution Results • Millions of time series and billions of samples per node • Up to 1 million samples/sec per node ingestion rate peak (measured during recovery) • Up to 8x better than previous system (Storm/HBase) • Storage density of ~3 bytes per metric sample • About 10x better than previous system (HBase)
  • 19.
    This is nota contribution Tackling Heap Issues • 60+ second GC pauses / OOM • Filled up old gen, GC stuck finding tiny bit of free space • Solution: move as many permanent objects offheap as possible • Too high rate of allocation on ingest • Temporary objects only, but producing too many • Solution: Switch from Protobuf to custom, no-allocation BinaryRecord
  • 20.
    This is nota contribution Off-heap Data Structures • BinaryVector - one compressed column of data (say timestamps, or values) • BinaryRecord - one ingestion data record, variable schema • OffheapLFSortedIDMap - offheap lightweight sorted map
  • 21.
    This is nota contribution Block Moving Object Graphs OffHeap Write buffer Chunks Chunks OffheapOnheap TSPartition WriteBufferObject ConcurrentSkipListMap ID ChunkSetInfo ID ChunkSetInfo ID ChunkSetInfo ID ChunkSetInfo VectorObject VectorObject
  • 22.
    This is nota contribution Block Moving Object Graphs OffHeap Write buffer Chunks Chunks ChunkMap OffheapOnheap TSPartitionTSPartition WriteBufferObject ChunkSetInfoPartID ChunkSetInfo ConcurrentSkipListMap ID ChunkSetInfo ID ChunkSetInfo ID ChunkSetInfo ID ChunkSetInfo VectorObject VectorObject Ptr Ptr Ptr Ptr
  • 23.
    This is nota contribution Flexible Distributed Queries
  • 24.
    This is nota contribution Prometheus Compatible • Don’t reinvent a popular time series query language • Prom HTTP API gives out of box Grafana support sum(http_requests{partition=“P2”,dc="DC0",job="A0"}) by (host) • Filtering/indexing on many time series • Time windowing-based aggregation with multiple windows • Group by
  • 25.
    This is nota contribution Queries to Logical Plan sum(http_requests{partition=“P2”,dc="DC0",job="A0"}) by (host) Aggregate(Sum,
 PeriodicSeries(
 RawSeries(
 IntervalSelector(t1, t2, step),
 List(ColumnFilter(partition,Equals(P2)),
 ColumnFilter(dc,Equals(DC0)),
 ColumnFilter(job,Equals(A0)),
 ColumnFilter(__name__,Equals(http_requests)) AST
  • 26.
    This is nota contribution Shard 0 Physical Plan Execution • Location transparency of Akka actors is crucial here ReduceAggregateExec Sum SelectRawPartitionsExec AggregateMapReduce PeriodicSamplesMapper SelectRawPartitionsExec AggregateMapReduce PeriodicSamplesMapper Chunks Chunks Chunks Chunks Shard 1
  • 27.
    This is nota contribution Shard 0 Physical Plan Execution • Look ma, plan change, no code changes! ReduceAggregateExec Sum SelectRawPartitionsExec AggregateMapReduce PeriodicSamplesMapper SelectRawPartitionsExec AggregateMapReduce PeriodicSamplesMapper Chunks Chunks Chunks Chunks Shard 1 Query Service
  • 28.
    This is nota contribution Node 2Node 1 Actor Hierarchy NodeCoordinator Actor IngestionActor QueryActor MemStore HTTP NodeCoordinator Actor IngestionActor QueryActor MemStore HTTP CLI / Akka Remote
  • 29.
    This is nota contribution Comparisons • Queries possible on FiloDB and not on old system: • Tag-based querying (filter, group by etc based on flexible tags) • Histograms and quantiles • Group by and topK queries • Flexible time series joins • 100’s millions samples queried/sec
  • 30.
    This is nota contribution Datasets and Data Model
  • 31.
    This is nota contribution What Kind of Data Works? • High cardinality of individual time series (operational metrics, devices, business metrics) • Many data points in each series, append only Series1 {k1=v1, k2=v2} Time Series2 {k1=v3, k2=v4} Series3 {k1=v5, k2=v6}
  • 32.
    This is nota contribution Flexible Tags • Each time series is defined by a metric name and a unique combination of keys-values • Index on tags allows filter/search by combo of tags memstore_partitions_queried { dataset=timeseries, host=MacBook-Pro-229.local, shard=0 } memstore_partitions_queried { dataset=timeseries, host=MacBook-Pro-229.local, shard=1 }
  • 33.
    This is nota contribution Flexible Schemas and Datasets • Datasets allow for namespacing different schemas, ingestion sources, SLAs, # shards, and offheap memory isolation • Main dataset with 2-day retention • Pre-aggregates dataset with 1 week retention • Histograms dataset with schema for efficient histogram storage • OpenTracing dataset — start, end, span duration, etc. • Historical data using different schema
  • 34.
    This is nota contribution The Hard Stuff: Recovery and Persistence
  • 35.
    This is nota contribution What is persisted? • Raw time series data - using a custom format designed for efficient ingestion & recovery - is stored using and ingested from Apache Kafka • Compressed, columnar time series data is written periodically to a ColumnStore, typically Cassandra • Time series metadata for reconstructing each node’s index is persisted as well
  • 36.
    Akka Cluster Ingestion andSharding Kafka Shard0 Shard1 Shard2 Shard3 Shard4 FiloDB Node FiloDB Node FiloDB Node FiloDB Node Gateway S0 S1 S2 S3 S4
  • 37.
    Recovery Kafka Shard0 Shard1 Shard2Shard3 Shard4 FiloDB Node FAILU RE FiloDB Node FiloDB Node S0 S1 S2 S3 S4 New Filo Node S2 CassandraChunkSink On-demand paging FiloDB Node S2 Queries to other DC
  • 38.
    This is nota contribution Recovery • The most recent raw data - before encoding - is recovered by replaying Apache Kafka partitions • Index metadata is recovered • Compressed data is loaded on-demand from Cassandra. This works because most data written is never queried.
  • 39.
    This is nota contribution FiloDB vs Alternatives
  • 40.
    This is nota contribution vs Prometheus • FiloDB supports PromQL, HTTP query API • Prometheus is single-node only • FiloDB is multi-schema and multi-tenant • FiloDB designed to run as a resilient, distributed, high-uptime cloud service • FiloDB open source not as rich feature-wise (yet)
  • 41.
    This is nota contribution vs InfluxDB • FiloDB data model is very close to Influx: multi- schema, multiple columns, namespaces, tags on series FiloDB InfluxDB Clustering Peer to peer distributed Single node (OSS),
 clustered ($$) Query language PromQL SQL (PromQL coming) Maturity New Established
  • 42.
    This is nota contribution vs Cassandra • C*: Very well established and widely used, robust • Like FiloDB: real time, distributed, low-latency • C*: Very simple queries ideally to one partition • FiloDB: complex PromQL queries, topK, groupBy, time series joins and windowing • FiloDB: much higher storage density and ingestion throughput for time series
  • 43.
    This is nota contribution vs Druid • Druid and FiloDB have different data models • Druid is an OLAP database with an explicit time dimension. Dimensions are fixed. • FiloDB supports millions/billions of time series with flexible tags • FiloDB stores raw data, Druid stores roll ups
  • 44.
    This is nota contribution Tradeoffs and Lessons
  • 45.
    This is nota contribution Tradeoffs of using the JVM • Pluses: solid, proven libraries for building distributed and data systems • Apache Lucene • Akka Cluster • Minuses: Lack of low-level memory layout and control • The devil you know best
  • 46.
    This is nota contribution JVM Production Tips • Get to know different GCs, Eden, OldGen, G1GC, etc. really really well • SJK (https://github.com/aragozin/jvm-tools) • Runtime visibility • Multiple APIs to access cluster state • JMXbeans • Measure measure measure!! (Use JMH)
  • 47.
    This is nota contribution Current Status • Development at github.com/filodb/FiloDB • Time/value schema ingestion and querying is stable • Looking for partners to work together, add integrations, etc.
  • 48.
    This is nota contribution Try it out today • Ingest data using https://github.com/influxdata/ telegraf • Expose a Prometheus HTTP read endpoint in your apps • Use Grafana to visualize metrics
  • 49.
    This is nota contribution Roadmap • Speed and efficiency improvements in core FiloDB database • Histogram optimizations • Improved cluster state management • Support for Spark/ML/AI jobs and metrics. How can we improve observability for data engineers? • Support for non-metrics schemas • Long term storage
  • 50.
    This is nota contribution Thank you! • Note: We are hiring! If you love reactive systems, distributed systems, love to push the performance envelope…. there’s a place for you.
  • 51.
    This is nota contribution Extra Slides
  • 52.
    This is nota contribution On Heap vs Off HeapRecords Records Records Index Write buffers Write buffers Write buffers Chunks Chunks Chunks Chunks Chunks Chunks ChunkMap ChunkMap ChunkMap OffheapOnheap TSPartition TSPartition TSPartition PartitionMap Lucene MMap Index Files