Python and MongoDB as a Market Data Platform
Scalable storage of time series data
2014
Opinions expressed are those of the author and may not be shared by all personnel of Man Group plc
(‘Man’). These opinions...
3
The Problem
Financial data comes in different sizes…
• ~1MB 1x a day price data
• ~1GB x 1000s 9,000 x 9,000 data matrices
• ~40GB 1-m...
Quant researchers
• Interactive work – latency sensitive
• Batch jobs run on a cluster – maximize throughput
• Historical ...
6
The Research Problem – Scale
lib.read(‘Equity Prices')
Out[4]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 9605...
Many different existing data stores
• Relational databases
• Tick databases
• Flat files
• HDF5 files
• Caches
7
Overview ...
Many different existing data stores
• Relational databases
• Tick databases
• Flat files
• HDF5 files
• Caches
8
Can we bu...
Goals
• 10 years of 1 minute data in <1s
• 200 instruments x all history x once a day data <1s
• Single data store for all...
10
Implementation
Impedance mismatch between Python/Pandas/Numpy and Existing Databases
- Machine cluster operating on data blocks
Vs
- Data...
12
Implementation – System Architecture
Python
client
rs0
mongo
d
500GB
rs1
mongod
500GB
rs2
mongod
500GB
rs3
mongod
500GB...
Data bucketed into named Libraries
• One minute
• Daily
• User-data: jbloggs.EOD
• Metadata Index
Pluggable library types:...
Mongoose key-value store
14
Implementation - MongooseAPI
from ahl.mongo import Mongoose
m = Mongoose('research') # Connect...
15
Implementation – Version Store
Snap A
Snap B
Sym1, v1
Sym2, v3
Sym2, v4
Sym2, v4
Sym2, v4
16
Implementation – VersionStore: A chunk
17
Implementation – VersionStore: A version
18
Implementation – VersionStore: Bringing it together
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pi...
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pi...
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pi...
class PickleStore(object):
def read(self, collection, version, symbol):
data = ''.join([x['data'] for x in collection.find...
23
Implementation – DataFrames
def do_write(df, version):
records = df.to_records()
version['dtype'] = str(records.dtype)
...
24
Results
Flat files on NFS – Random market
25
Results – Performance Once a Day Data
HDF5 files – Random instrument
26
Results – Performance One Minute Data
Random E-Mini S&P contract from 2013
© Man 2013 27
Results – TickStore – 8 parallel
Random E-Mini S&P contract from 2013
© Man 2013 28
Results – TickStore
Random E-Mini S&P contract from 2013
© Man 2013 29
Results – TickStore Throughput
Random E-Mini S&P contract from 2013
30
Results – System Load
OtherTick Mongo (x2)N Tasks = 32
Built a system to store data of any shape and size
- Reduced impedance between Python language and the data store
Low late...
32
Questions?
Upcoming SlideShare
Loading in …5
×

Python and MongoDB as a Market Data Platform by James Blackburn

4,240 views

Published on

Python and MongoDB as a Market Data Platform by James Blackburn

Published in: Technology, Business

Python and MongoDB as a Market Data Platform by James Blackburn

  1. 1. Python and MongoDB as a Market Data Platform Scalable storage of time series data 2014
  2. 2. Opinions expressed are those of the author and may not be shared by all personnel of Man Group plc (‘Man’). These opinions are subject to change without notice, and are for information purposes only and do not constitute an offer or invitation to make an investment in any financial instrument or in any product to which any member of Man’s group of companies provides investment advisory or any other services. Any forward-looking statements speak only as of the date on which they are made and are subject to risks and uncertainties that may cause actual results to differ materially from those contained in the statements. Unless stated otherwise this information is communicated by Man Investments Limited and AHL Partners LLP which are both authorised and regulated in the UK by the Financial Conduct Authority. 2 Legalese…
  3. 3. 3 The Problem
  4. 4. Financial data comes in different sizes… • ~1MB 1x a day price data • ~1GB x 1000s 9,000 x 9,000 data matrices • ~40GB 1-minute data • ~30TB Tick data • > even larger data sets (options, …) … and different shapes • Time series of prices • Event data • News data • What’s next? 4 Overview – Data shapes
  5. 5. Quant researchers • Interactive work – latency sensitive • Batch jobs run on a cluster – maximize throughput • Historical data • New data • ... want control of storing their own data Trading system • Auditable – SVN for data • Stable • Performant 5 Overview – Data consumers
  6. 6. 6 The Research Problem – Scale lib.read(‘Equity Prices') Out[4]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00 Columns: 8103 entries, AST10000 to AST9997 dtypes: float64(8631) Equity Prices: 77M float64s 593MB of data = 4,744Mbits! 600 MB
  7. 7. Many different existing data stores • Relational databases • Tick databases • Flat files • HDF5 files • Caches 7 Overview – Databases
  8. 8. Many different existing data stores • Relational databases • Tick databases • Flat files • HDF5 files • Caches 8 Can we build one system to rule them all? Overview – Databases
  9. 9. Goals • 10 years of 1 minute data in <1s • 200 instruments x all history x once a day data <1s • Single data store for all data types • 1x day data  Tick Data • Data versioning + Audit Requirements • Fast – most data in-memory • Complete – all data in single location • Scalable – unbounded in size and number of clients • Agile – rapid iterative development 9 Project Goals
  10. 10. 10 Implementation
  11. 11. Impedance mismatch between Python/Pandas/Numpy and Existing Databases - Machine cluster operating on data blocks Vs - Database doing the analytical work MongoDB: - Developer productivity - Document  Python Dictionary - Fast out the box - Low latency - High throughput - Predictable performance - Sharding / Replication for growth and scale out - Free - Great support - Most widely used NoSQL DB 11 Implementation – Choosing MongoDB
  12. 12. 12 Implementation – System Architecture Python client rs0 mongo d 500GB rs1 mongod 500GB rs2 mongod 500GB rs3 mongod 500GB rs4 mongod 500GB configserve r configserve r configserve r mongos mongosmongos Python client cn… Python client {'_id': ObjectId(…'), 'c': 47, 'columns': { 'PRICE': {'data': Binary('...', 0), 'dtype': 'float64', 'rowmask': Binary('...', 0)}, 'SIZE': {'data': Binary('...', 0), 'dtype': 'int64', 'endSeq': -1L, 'index': Binary('...', 0), 'segment': 1296568173000L, 'sha': abcd123456, 'start': 1296568173000L, 'end': 1298569664000L, 'symbol': ‘AST1209', 'v': 2}
  13. 13. Data bucketed into named Libraries • One minute • Daily • User-data: jbloggs.EOD • Metadata Index Pluggable library types: • VersionStore • TickStore • Metadata store • … others … © Man 2013 13 Implementation – Mongoose
  14. 14. Mongoose key-value store 14 Implementation - MongooseAPI from ahl.mongo import Mongoose m = Mongoose('research') # Connect to the data store m.list_libraries() # What data libraries are available library = m[‘jbloggs.EOD’] # Get a Library library.list_symbols() # List symbols library.write(‘SYMBOL’, <TS or other data>) # Write library.read(‘SYMBOL’, version=…) # Read, with an optional version library.snapshot('snapshot-name') # Create a named snapshot of the library Library.list_snapshots()
  15. 15. 15 Implementation – Version Store Snap A Snap B Sym1, v1 Sym2, v3 Sym2, v4 Sym2, v4 Sym2, v4
  16. 16. 16 Implementation – VersionStore: A chunk
  17. 17. 17 Implementation – VersionStore: A version
  18. 18. 18 Implementation – VersionStore: Bringing it together
  19. 19. _CHUNK_SIZE = 15 * 1024 * 1024 # 15MB class PickleStore(object): def write(collection, version, symbol, item): # Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item)) for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha}, {'$set': segment, '$addToSet': {'parent': version['_id']}}, upsert=True) 19 Implementation – Arbitrary Data
  20. 20. _CHUNK_SIZE = 15 * 1024 * 1024 # 15MB class PickleStore(object): def write(collection, version, symbol, item): # Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item)) for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha}, {'$set': segment, '$addToSet': {'parent': version['_id']}}, upsert=True) 20 Implementation – Arbitrary Data
  21. 21. _CHUNK_SIZE = 15 * 1024 * 1024 # 15MB class PickleStore(object): def write(collection, version, symbol, item): # Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item)) for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha}, {'$set': segment, '$addToSet': {'parent': version['_id']}}, upsert=True) 21 Implementation – Arbitrary Data
  22. 22. class PickleStore(object): def read(self, collection, version, symbol): data = ''.join([x['data'] for x in collection.find({'symbol': symbol, 'parent': version['_id']}, sort=[('segment', pymongo.ASCENDING)])]) return cPickle.loads(lz4.decompress(data)) 22 Implementation – Arbitrary Data
  23. 23. 23 Implementation – DataFrames def do_write(df, version): records = df.to_records() version['dtype'] = str(records.dtype) chunk_size = _CHUNK_SIZE / records.dtype.itemsize ... chunk_and_store ... def do_read(version): ... read_chunks ... data = ''.join(chunks) dtype = np.dtype(version['dtype']) recs = np.fromstring(data, dtype=dtype) return DataFrame.from_records(recs)
  24. 24. 24 Results
  25. 25. Flat files on NFS – Random market 25 Results – Performance Once a Day Data
  26. 26. HDF5 files – Random instrument 26 Results – Performance One Minute Data
  27. 27. Random E-Mini S&P contract from 2013 © Man 2013 27 Results – TickStore – 8 parallel
  28. 28. Random E-Mini S&P contract from 2013 © Man 2013 28 Results – TickStore
  29. 29. Random E-Mini S&P contract from 2013 © Man 2013 29 Results – TickStore Throughput
  30. 30. Random E-Mini S&P contract from 2013 30 Results – System Load OtherTick Mongo (x2)N Tasks = 32
  31. 31. Built a system to store data of any shape and size - Reduced impedance between Python language and the data store Low latency: - 1xDay data: 4ms for 10,000 rows (vs. 2,210ms from SQL) - OneMinute / Tick data: 1s for 3.5M rows Python (vs. 15s – 40s+ from OtherTick) - 1s for 15M rows Java Parallel Access: - Cluster with 256+ concurrent data access - Consistent throughput – little load on the Mongo server Efficient: - 10-15x reduction in network load - Negligible decompression cost (lz4: 1.8Gb/s) 31 Conclusions
  32. 32. 32 Questions?

×