Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© Man 2015
Arctic: High-performance IoT and financial data storage
the Python + MongoDB TimeSeries store
July 2015
@ManAHL...
2
1. Overview: Data @ AHL
2. Arctic: a fast Python library for TimeSeries
(and anything else)
3. Demo
4. Performance
Overv...
3
Love
© Man 2015
… we
4
More love
© Man 2015
…and we data...
5
AHL’s Data Pipeline
Quant researchers
• Interactive work – latency sensitive
• Batch jobs run on a cluster – total throughput
• Historical dat...
Data sizes we deal with…
• ~1MB x 1000s 1x a day price data 10k rows (30 years)
• ~0.5GB x 1000s 1-minute data 4M rows (20...
8
lib.read(‘US Equity Adjusted Prices')
Out[4]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 9605 entries, 1983-01...
9
lib.read(‘US Equity Adjusted Prices')
Out[4]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 9605 entries, 1983-01...
10
SQL
© Man 2015
cx_Oracle
~ 200us per Row.
10k rows => 2.2s
11
SQL
© Man 2015
cx_Oracle
~ 200us per Row.
10k rows => 2.2s
How do we query 4 million rows?
12
Local files
© Man 2015
Fast!
13
Local files
© Man 2015
Fast!
Multi-machine access?
Versioning?
Non-timeseries data?
Using all the technology:
• Relational databases
• Tick databases
• Flat files
• HDF5 files
• Caches
14
Databases
Using all the technology:
• Relational databases
• Tick databases
• Flat files
• HDF5 files
• Caches
15
Can we build one s...
16
Arctic
Requirements
• Scalable – unbounded in data-size and number of clients
• Agile – any data shape; new shapes; iterative dev...
Goals
• 20 years of 1 minute data in <1s
• 200 instruments x all history x once a day data <1s
• Single data store for all...
Data bucketed into named Libraries
• One minute
• Daily
• User-data: jbloggs.EOD
• Metadata Index
Pluggable Library types:...
20© Man 2015
 Document ~= Python Dictionary
 Flexible schema  Rapid prototyping
 OpenSource database
 Great support
 #1 NoSQL DB ...
22© Man 2015
Demo
Arctic key-value store
23
from arctic import Arctic
a = Arctic('research') # Connect to the data store
a.list_libraries() ...
24© Man 2015
25
Arctic - TickStore
Arctic(‘localhost’).initialize_library(‘tickdb’, ‘TickStoreV3’)
26
Tick Rate –
Kibana
© Man 2015
27
Implementation – TickStore
Sym1
Sym2
28
Implementation – a chunk
{
ID: ObjectId('52b1d39eed5066ab5e87a56d'),
SYMBOL: 'symbol'
INDEX: Binary('...', 0),
COLUMNS:...
29
Arctic - VersionStore
Arctic(‘localhost’).initialize_library(‘library’)
30© Man 2015
31
Implementation – VersionStore
Snap A
Snap B
Sym1, v1
Sym2, v3
Sym2, v4
Sym2, v5
Sym2, v6
32© Man 2015
https://github.com/manahl/arctic/blob/master/arctic/store/_pickle_store.py
33
Performance
Flat files on NFS – Random market
34
Results – Performance once a day data
HDF5 files – Random instrument
35
Results – Performance One Minute
Random E-Mini S&P contract from 2013
© Man 2013 36
Results - TickStore
Random E-Mini S&P contract from 2013
© Man 2013 37
Results – TickStore Throughput
Random E-Mini S&P contract from 2013
38
OtherTick Mongo (x2)N Tasks = 32
Results – System Load
Low latency:
- 1xDay data: 4ms for 10,000 rows (vs. 2,210ms from SQL)
- OneMinute / Tick data: 1s for 3.5M rows Python (vs...
Like data? We’re hiring!
recruitment@ahl.com
40
Questions?
@ManAHLTech
@jimmybb
James Blackburn
41© Man 2015
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pi...
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pi...
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pi...
class PickleStore(object):
def read(self, collection, version, symbol):
data = ''.join([x['data'] for x in collection.find...
Upcoming SlideShare
Loading in …5
×

Arctic: High-performance IoT and financial data storage with Python and MongoDB

11,201 views

Published on

https://github.com/manahl/arctic

Arctic is a high-performance Python library for storing time series and numeric data in MongoDB. It natively supports Pandas and numpy numeric data types.

The Arctic library is pluggable. It ships with a versioned data store, and a tick store.

Published in: Software

Arctic: High-performance IoT and financial data storage with Python and MongoDB

  1. 1. © Man 2015 Arctic: High-performance IoT and financial data storage the Python + MongoDB TimeSeries store July 2015 @ManAHLTech @jimmybb James Blackburn
  2. 2. 2 1. Overview: Data @ AHL 2. Arctic: a fast Python library for TimeSeries (and anything else) 3. Demo 4. Performance Overview
  3. 3. 3 Love © Man 2015 … we
  4. 4. 4 More love © Man 2015 …and we data...
  5. 5. 5 AHL’s Data Pipeline
  6. 6. Quant researchers • Interactive work – latency sensitive • Batch jobs run on a cluster – total throughput • Historical data • New data • ... want control of storing their own data Trading system • Auditable – SVN for data • Stable 6 Users
  7. 7. Data sizes we deal with… • ~1MB x 1000s 1x a day price data 10k rows (30 years) • ~0.5GB x 1000s 1-minute data 4M rows (20 years) • ~1GB x 1000s 10k x 10k data matrices 100M cells (30 years) • ~30TB Tick data 100k msgs/s 800M msgs/day … and different shapes • Time series of prices • Event data • News data • Metadata • What’s next? 7 Data sizes
  8. 8. 8 lib.read(‘US Equity Adjusted Prices') Out[4]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00 Columns: 8103 entries, AST10000 to AST9997 dtypes: float64(8631) Problems - Scale
  9. 9. 9 lib.read(‘US Equity Adjusted Prices') Out[4]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00 Columns: 8103 entries, AST10000 to AST9997 dtypes: float64(8631) Equity Prices: 77M float64s 600MB of data ~= 5Gbits! 600 MB Problems - Scale
  10. 10. 10 SQL © Man 2015 cx_Oracle ~ 200us per Row. 10k rows => 2.2s
  11. 11. 11 SQL © Man 2015 cx_Oracle ~ 200us per Row. 10k rows => 2.2s How do we query 4 million rows?
  12. 12. 12 Local files © Man 2015 Fast!
  13. 13. 13 Local files © Man 2015 Fast! Multi-machine access? Versioning? Non-timeseries data?
  14. 14. Using all the technology: • Relational databases • Tick databases • Flat files • HDF5 files • Caches 14 Databases
  15. 15. Using all the technology: • Relational databases • Tick databases • Flat files • HDF5 files • Caches 15 Can we build one system to rule them all? Databases
  16. 16. 16 Arctic
  17. 17. Requirements • Scalable – unbounded in data-size and number of clients • Agile – any data shape; new shapes; iterative development • Fast – as fast as local files • Easy to use – and we mean easy 17 Project Requirements
  18. 18. Goals • 20 years of 1 minute data in <1s • 200 instruments x all history x once a day data <1s • Single data store for all data types • 1x day data  Tick data • Data versioning + Audit 18 Project Goals
  19. 19. Data bucketed into named Libraries • One minute • Daily • User-data: jbloggs.EOD • Metadata Index Pluggable Library types: • VersionStore • TickStore • Pickle Store • … pluggable … https://github.com/manahl/arctic/blob/master/howtos/how_to_custom_arctic_library.py 19 Arctic Libraries
  20. 20. 20© Man 2015
  21. 21.  Document ~= Python Dictionary  Flexible schema  Rapid prototyping  OpenSource database  Great support  #1 NoSQL DB (#3 overall) http://db-engines.com/en/ranking 21 Why MongoDB
  22. 22. 22© Man 2015 Demo
  23. 23. Arctic key-value store 23 from arctic import Arctic a = Arctic('research') # Connect to the data store a.list_libraries() # What data libraries are available library = a[‘jbloggs.EOD’] # Get a Library library.list_symbols() # List symbols library.write(‘SYMBOL’, <TS or other data>) # Write library.read(‘SYMBOL’, version=…) # Read, with an optional version library.snapshot('snapshot-name') # Create a named snapshot of the library Library.list_snapshots() https://github.com/manahl/arctic/blob/master/howtos/how_to_use_arctic.py Arctic API
  24. 24. 24© Man 2015
  25. 25. 25 Arctic - TickStore Arctic(‘localhost’).initialize_library(‘tickdb’, ‘TickStoreV3’)
  26. 26. 26 Tick Rate – Kibana © Man 2015
  27. 27. 27 Implementation – TickStore Sym1 Sym2
  28. 28. 28 Implementation – a chunk { ID: ObjectId('52b1d39eed5066ab5e87a56d'), SYMBOL: 'symbol' INDEX: Binary('...', 0), COLUMNS: { ASK: { DATA: Binary('...', 0), DTYPE: '<f8', ROWMASK: Binary('...', 0) }, ... } START: DateTime(...), END: DateTime(...), SEGMENT: 1386933906826L, SHA: 1386933906826L, VERSION: 3, }
  29. 29. 29 Arctic - VersionStore Arctic(‘localhost’).initialize_library(‘library’)
  30. 30. 30© Man 2015
  31. 31. 31 Implementation – VersionStore Snap A Snap B Sym1, v1 Sym2, v3 Sym2, v4 Sym2, v5 Sym2, v6
  32. 32. 32© Man 2015 https://github.com/manahl/arctic/blob/master/arctic/store/_pickle_store.py
  33. 33. 33 Performance
  34. 34. Flat files on NFS – Random market 34 Results – Performance once a day data
  35. 35. HDF5 files – Random instrument 35 Results – Performance One Minute
  36. 36. Random E-Mini S&P contract from 2013 © Man 2013 36 Results - TickStore
  37. 37. Random E-Mini S&P contract from 2013 © Man 2013 37 Results – TickStore Throughput
  38. 38. Random E-Mini S&P contract from 2013 38 OtherTick Mongo (x2)N Tasks = 32 Results – System Load
  39. 39. Low latency: - 1xDay data: 4ms for 10,000 rows (vs. 2,210ms from SQL) - OneMinute / Tick data: 1s for 3.5M rows Python (vs. 15s – 40s+ from OtherTick) - 1s for 15M rows Java Parallel Access: - Cluster with 256+ concurrent data access - Consistent throughput – little load on the Mongo server Efficient: - 10-15x reduction in network load - Negligible decompression cost (lz4: 1.8Gb/s) 39 Conclusions
  40. 40. Like data? We’re hiring! recruitment@ahl.com 40 Questions? @ManAHLTech @jimmybb James Blackburn
  41. 41. 41© Man 2015
  42. 42. _CHUNK_SIZE = 15 * 1024 * 1024 # 15MB class PickleStore(object): def write(collection, version, symbol, item): # Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item)) for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha}, {'$set': segment, '$addToSet': {'parent': version['_id']}}, upsert=True) 42 Implementation – PickleStore
  43. 43. _CHUNK_SIZE = 15 * 1024 * 1024 # 15MB class PickleStore(object): def write(collection, version, symbol, item): # Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item)) for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha}, {'$set': segment, '$addToSet': {'parent': version['_id']}}, upsert=True) 43 Implementation – PickleStore
  44. 44. _CHUNK_SIZE = 15 * 1024 * 1024 # 15MB class PickleStore(object): def write(collection, version, symbol, item): # Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item)) for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha}, {'$set': segment, '$addToSet': {'parent': version['_id']}}, upsert=True) 44 Implementation – PickleStore
  45. 45. class PickleStore(object): def read(self, collection, version, symbol): data = ''.join([x['data'] for x in collection.find({'symbol': symbol, 'parent': version['_id']}, sort=[('segment', pymongo.ASCENDING)])]) return cPickle.loads(lz4.decompress(data)) 45 https://github.com/manahl/arctic/blob/master/arctic/store/_pickle_store.py Implementation – PickleStore

×