NYC* Big Tech Day 2013: Financial Time Series

Financial Time Series
Cassandra 1.2
Jake Luciani and Carl Yeksigian
BlueMountain Capital

Know your problem.

1000s of consumers
..creating and reading data as fast as possible
..consistent to all readers
..and handle ad-hoc user queries
..quickly
..across datacenters.

Know your data.
AAPL price

MSFT price

Know your queries.
Time Series Query

en
st
ar

d
t(

(2
10

pm
am

)
1 minute periods
)

Start, End, Periodicity defines query

Know your queries.
Cross Section Query

As Of Time (11am)

As Of time defines the query

Know your queries.

● Cross sections are for random data
● Storing for Cross Sections means thousands of
writes, inconsistent queries
● We also need bitemporality, but it's hard, so let's
ignore it in the query

Know your users.
A million, billion writes per second
..and reads are fast and happen at the same time
..and we can answer everything consistently
..and it scales to new use cases quickly
..and it's all done yesterday

Let's optimize for Time Series.
Since we can't optimize for everything.

Data Model (in C* 1.1)
AAPL lastPrice:2013-03-18:2013-03-19 0E-34-88-FF-26-E3-2C

lastPrice:2013-03-19:2012-03-19 0E-34-88-FF-26-E3-3D

lastPrice:2013-03-19:2013-03-20 0E-34-88-FF-26-E3-4E

But we're using C* 1.2.
CQL3 Parallel Compaction
V-nodes Off-Heap Bloom Filters
JBOD Metrics!
Pooled Decompression Concurrent Schema Creation
buffers
SSD Aware

Data Model (CQL 3)
CREATE TABLE tsdata (
id blob,
property string,
asof_ticks bigint,
knowledge_ticks bigint,
value blob,
PRIMARY KEY(id,property,asof_ticks,knowledge_ticks)
)
WITH COMPACT STORAGE
AND CLUSTERING ORDER BY(asof_ticks DESC, knowledge_ticks DESC)

CQL3 Queries: Time Series

SELECT * FROM tsdata
WHERE id = 0x12345
AND property = 'lastPrice'
AND asof_ticks >= 1234567890
AND asof_ticks <= 2345678901

CQL3 Queries: Cross Section

SELECT * FROM tsdata
WHERE id = 0x12345
AND property = 'lastPrice'
AND asof_ticks = 1234567890
AND knowledge_ticks < 2345678901
LIMIT 1

Data Overload!
All points between start and end
Even though we have a periodicity

All knowledge times
Even though we only want latest

A Service, not an app

App Olympus App
Ol
ym

s
pu
pu

ym
s

Ol
App App

Olympus
Olympus
App C* App

App App

Ol
s
pu

ym
lym

pu
s
O
App Olympus App

Filtration
Filter everything by knowledge time
Filter time series by periodicity
200k points filtered down to 300

AAPL:lastPrice:2013-03-18:2013-03-19 AAPL:lastPrice:2013-03-18:2013-03-19
AAPL:lastPrice:2013-03-19:2013-03-19 Service AAPL:lastPrice:2013-03-19:2013-03-20
Cassandra Reads
AAPL:lastPrice:2013-03-19:2013-03-20 AAPL:lastPrice:2013-03-20:2013-03-21
Filter
AAPL:lastPrice:2013-03-20:2013-03-20
AAPL:lastPrice:2013-03-20:2013-03-21

Pushdown Filters

● To provide periodicity on raw data, downsample
on write
● There are still cases where we don't know how
to sample
● This filtering should be pushed to C*
● The coordinator node should apply a filter to the
result set

Complex Value Types
Not every value is a double
Some values belong together
Bid and Ask should come back together

Thrift
Thrift structures as values
Typed, extensible schema
Union types give us a way to deserialize any type

Thrift: Union Types

https://gist.github.com/carlyeks/5199559

Scaling...
The first rule of scaling is you do not just turn
eveything to 11.

Scaling...
Step 1 - Fast Machines for your workload
Step 2 - Avoid Java GC for your workload
Step 3 - Tune Cassandra for your workload
Step 4 - Prefetch and cache for your workload

Can't fix what you can't measure

Riemann (http://riemann.io)
Easily push application and system metrics into a single system
We push 4k metrics per second to a single Riemann instance

Metrics: Riemann

Yammer Metrics with Riemann

https://gist.github.com/carlyeks/5199090

Metrics: Riemann

Push stream based metrics library
Riemann Dash for Why is it Slow?

Graphite for
Why was it
Slow?

VisualVM-The greatest tool EVER

Many useful plugins...
Just start jstatd on each server and go!

Scaling Reads: Machines
SSDs for hot data
JBOD config
As many cores as possible (> 16)
10GbE network
Bonded network cards
Jumbo frames

JBOD is a lifesaver

SSDs are great until they aren't anymore
JBOD allowed passive recovery in the face of
simultaneous disk failures (SSDs had a bad
firmware)

Scaling Reads: JVM

M
-Xmx12G JV gic!
Ma

-Xmn1600M
-XX:SurvivorRatio=16
-XX:+UseCompressedOops

-XX:+UseTLAB yields ~15% Boost!
(Thread local allocators, good for SEDA
architectures)

Scaling Reads: Cassandra

Changes we've made:
● Configuration
● Compaction
● Compression
● Pushdown Filters

Scaling Cassandra:
Configuration
Hinted Handoff
HHO single threaded, 100kb throttle

Scaling Cassandra:
Configuration
memtable size
2048mb, instead of 1/3 heap

We're using a 12gb heap; leaves enough room for memtables
while the majority is left for reads and compaction.

Scaling Cassandra:
Configuration
Half-Sync Half-Async server
No thread dedicated to an idle connection
We have a lot of idle connections

Scaling Cassandra:
Configuration
Multithreaded compaction, 4 cores
More threads to compact means fast
Too many threads means resource contention

Scaling Cassandra:
Configuration
Disabled internode compression
Caused too much GC and Latency
On a 10GbE network, who needs compression?

Leveled Compaction
Wide rows means data can be spread across a
huge number of SSTables
Leveled Compaction puts a bound on the worst
case (*)
Fewer SSTables to read means lower latency, as
shown below; orange SSTables get read
L0
* In Theory L1
L2
L3
L4
L5

Leveled Compaction
Breaking Bad

Under high write load, forced to read all of the L0
files

L0
L1
L2
L3
L4
L5

Hybrid Compaction
Breaking Better

Size Tiering Level 0

Size Tiered

Hybrid
Compaction
{ Leveled
L0
L1
L2
L3
L4
L5

Better Compression:
New LZ4Compressor

LZ4 Compression is 40% faster than Google's
Snappy... LZ4 JNI
Snappy JNI

LZ4 Sun Unsafe

Blocks in Cassandra are so small we don't see the same in production but the 95%
latency is improved and it works with Java 7

CRC Check Chance

CRC check of each compressed block causes
reads to be 2x SLOWER.
Lowered crc_check_chance to 10% of reads.
A move to JNI would cause a 30x boost

Current Stats

● 12 nodes
● 2 DataCenters
● RF=6
● 150k Writes/sec at EACH_QUORUM
● 100k Reads/sec at LOCAL_QUORUM
● > 6 Billion points (without replication)
● 2TB on disk (compressed)
● Read Latency 50%/95% is 1ms/10ms

Questions?

Thank you!

@tjake and @carlyeks

NYC* Big Tech Day 2013: Financial Time Series

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to NYC* Big Tech Day 2013: Financial Time Series

Similar to NYC* Big Tech Day 2013: Financial Time Series (20)

Recently uploaded

Recently uploaded (20)

NYC* Big Tech Day 2013: Financial Time Series