Jean Maynier (Kpler): “Python, SQLalchemy and Scrapy for real-time data processing at Kpler”
Bio: I lead the engineering and operational activities of Kpler (formerly eCO2market).
Founded in Paris in 2009, Kpler is an intelligence company providing transparency solutions in energy markets. We develop proprietary technologies that systematically aggregate data from hundreds of sources, ranging from logistics and commercial, to governmental and shipping databases. By connecting the dots across fragmented information landscapes, we are able to deliver our clients with unique real-time market coverage.
4. Why python ?
Simple, fast to prototype
Data libraries (sqlAlchemy, Pandas, Scikit-learn)
Data oriented community
Scrapy: best scraping framework
Step 1
9. Metrics trends
Present
DBs < 100Gb
50 sources
500 vessels
Positions every 3min
Future
1-10 Tb
100 sources
10k to 100k vessels
Position every 30s
Performance problem !
10. Batch to data streaming
Parallelization
Granularity : item vs source
Handle failure, no data loss, monitoring
Akka streaming, Spark streaming, Celery, Storm
Step 5
11. Storm at Kpler: POC
Websites sources
API sources
Scraping
Python/scrapy
ETL + models
Position Dock/undock
Custom prices Match prices
Next
destination
Trades
Cargo volumes