Scaling API-first – The story of a global engineering organization
NYC* Big Tech Day 2013: Financial Time Series
1. Financial Time Series
Cassandra 1.2
Jake Luciani and Carl Yeksigian
BlueMountain Capital
2. Know your problem.
1000s of consumers
..creating and reading data as fast as possible
..consistent to all readers
..and handle ad-hoc user queries
..quickly
..across datacenters.
6. Know your queries.
● Cross sections are for random data
● Storing for Cross Sections means thousands of
writes, inconsistent queries
● We also need bitemporality, but it's hard, so let's
ignore it in the query
7. Know your users.
A million, billion writes per second
..and reads are fast and happen at the same time
..and we can answer everything consistently
..and it scales to new use cases quickly
..and it's all done yesterday
9. Data Model (in C* 1.1)
AAPL lastPrice:2013-03-18:2013-03-19 0E-34-88-FF-26-E3-2C
lastPrice:2013-03-19:2012-03-19 0E-34-88-FF-26-E3-3D
lastPrice:2013-03-19:2013-03-20 0E-34-88-FF-26-E3-4E
10. But we're using C* 1.2.
CQL3 Parallel Compaction
V-nodes Off-Heap Bloom Filters
JBOD Metrics!
Pooled Decompression Concurrent Schema Creation
buffers
SSD Aware
11. Data Model (CQL 3)
CREATE TABLE tsdata (
id blob,
property string,
asof_ticks bigint,
knowledge_ticks bigint,
value blob,
PRIMARY KEY(id,property,asof_ticks,knowledge_ticks)
)
WITH COMPACT STORAGE
AND CLUSTERING ORDER BY(asof_ticks DESC, knowledge_ticks DESC)
12. CQL3 Queries: Time Series
SELECT * FROM tsdata
WHERE id = 0x12345
AND property = 'lastPrice'
AND asof_ticks >= 1234567890
AND asof_ticks <= 2345678901
13. CQL3 Queries: Cross Section
SELECT * FROM tsdata
WHERE id = 0x12345
AND property = 'lastPrice'
AND asof_ticks = 1234567890
AND knowledge_ticks < 2345678901
LIMIT 1
14. Data Overload!
All points between start and end
Even though we have a periodicity
All knowledge times
Even though we only want latest
15. A Service, not an app
App Olympus App
Ol
ym
s
pu
pu
ym
s
Ol
App App
Olympus
Olympus
App C* App
App App
Ol
s
pu
ym
lym
pu
s
O
App Olympus App
16. Filtration
Filter everything by knowledge time
Filter time series by periodicity
200k points filtered down to 300
AAPL:lastPrice:2013-03-18:2013-03-19 AAPL:lastPrice:2013-03-18:2013-03-19
AAPL:lastPrice:2013-03-19:2013-03-19 Service AAPL:lastPrice:2013-03-19:2013-03-20
Cassandra Reads
AAPL:lastPrice:2013-03-19:2013-03-20 AAPL:lastPrice:2013-03-20:2013-03-21
Filter
AAPL:lastPrice:2013-03-20:2013-03-20
AAPL:lastPrice:2013-03-20:2013-03-21
17. Pushdown Filters
● To provide periodicity on raw data, downsample
on write
● There are still cases where we don't know how
to sample
● This filtering should be pushed to C*
● The coordinator node should apply a filter to the
result set
18. Complex Value Types
Not every value is a double
Some values belong together
Bid and Ask should come back together
19. Thrift
Thrift structures as values
Typed, extensible schema
Union types give us a way to deserialize any type
23. Scaling...
Step 1 - Fast Machines for your workload
Step 2 - Avoid Java GC for your workload
Step 3 - Tune Cassandra for your workload
Step 4 - Prefetch and cache for your workload
24. Can't fix what you can't measure
Riemann (http://riemann.io)
Easily push application and system metrics into a single system
We push 4k metrics per second to a single Riemann instance
28. Scaling Reads: Machines
SSDs for hot data
JBOD config
As many cores as possible (> 16)
10GbE network
Bonded network cards
Jumbo frames
29. JBOD is a lifesaver
SSDs are great until they aren't anymore
JBOD allowed passive recovery in the face of
simultaneous disk failures (SSDs had a bad
firmware)
30. Scaling Reads: JVM
M
-Xmx12G JV gic!
Ma
-Xmn1600M
-XX:SurvivorRatio=16
-XX:+UseCompressedOops
-XX:+UseTLAB yields ~15% Boost!
(Thread local allocators, good for SEDA
architectures)
37. Leveled Compaction
Wide rows means data can be spread across a
huge number of SSTables
Leveled Compaction puts a bound on the worst
case (*)
Fewer SSTables to read means lower latency, as
shown below; orange SSTables get read
L0
* In Theory L1
L2
L3
L4
L5
40. Better Compression:
New LZ4Compressor
LZ4 Compression is 40% faster than Google's
Snappy... LZ4 JNI
Snappy JNI
LZ4 Sun Unsafe
Blocks in Cassandra are so small we don't see the same in production but the 95%
latency is improved and it works with Java 7
41. CRC Check Chance
CRC check of each compressed block causes
reads to be 2x SLOWER.
Lowered crc_check_chance to 10% of reads.
A move to JNI would cause a 30x boost
42. Current Stats
● 12 nodes
● 2 DataCenters
● RF=6
● 150k Writes/sec at EACH_QUORUM
● 100k Reads/sec at LOCAL_QUORUM
● > 6 Billion points (without replication)
● 2TB on disk (compressed)
● Read Latency 50%/95% is 1ms/10ms