Who is Adfin?
What special sauce did we build … very large OLAP DB.
Goals:
Have you take a look at at CephFS … might be one of the few people talking about it.
Realized that it’s possible for your organization to develop some expertise in-house… contribute.
Name implies a combination of Advertising + Finance Markets. Two home town industries (Madison Ave and Wall St)
Using tools and knowledge pioneered by the financial industry.
Most media (by volume) is bought and sold pragmatically. Ala. HFT
It’s an opaque marketplace.
Bloomberg … Information Platform, S&P… Indices, Market … aggregating market data (CDS)
I am going to keep butchering these analogies.
Pictures of some of the tools we’ve built.
Real time analysis into your own data and market data.
Run a query get a result… lots of variables.
Forecasting
Pictures of some of the tools we’ve built.
Real time analysis into your own data and market data.
Run a query get a result… lots of variables.
Forecasting
Pictures of some of the tools we’ve built.
Real time analysis into your own data and market data.
Run a query get a result… lots of variables.
Forecasting
Pictures of some of the tools we’ve built.
Real time analysis into your own data and market data.
Run a query get a result… lots of variables.
Forecasting
The advertising market is larger then the financial market… in terms of volume of transactions.
Each impression is worth a tiny fraction of a penny.
When I looked at the number of transactions for an exchange like the NASDAQ… it’s like 50 million, NYSE 100 million.
A lot of duct tape, but also a lot of efficiency.
This number is not getting smaller. All advertising is going to be digitally bought and sold and that day is coming.
Distributed, relational database for running real time analytics queries on very large time series data. KDB on many many nodes.
Some fun things. It’s a relational model, but not SQL. 90% of queries or sums or group bys.
Data is sharded into partitions by time. Spread across many nodes.
We get pretty amazing singe node performance. 100s of millions of rows a second per partition.
There’s been a lot of research into this stuff. Based on research into compression, indexing, query all from like last 3 to 4 years.
For large datasets our goal is to answer under 10 seconds for really large queries. Reality is most things we do answer under 1 second.
Why? Because the dataset is huge.
Also, we’re a bit crazy.
Distributed, relational database for running real time analytics queries on very large time series data. KDB on many many nodes.
Some fun things. It’s a relational model, but not SQL. 90% of queries or sums or group bys.
Data is sharded into partitions by time. Spread across many nodes.
We get pretty amazing singe node performance. 100s of millions of rows a second per partition.
There’s been a lot of research into this stuff. Based on research into compression, indexing, query all from like last 3 to 4 years.
For large datasets our goal is to answer under 10 seconds for really large queries. Reality is most things we do answer under 1 second.
Why? Because the dataset is huge.
Also, we’re a bit crazy.
Before we’re storing it all on local disks.
Couple problems:
Redundancy?
Can’t grow computation without storage, vice versa.
Looked into Ceph:
Scalable storage, just throw more machines at it… don’t worry about topology too much.
We could separate storage from computation.
No SPOF, redundancy everywhere.
Pretty good speed for DFS.
We can leverage the kernel. The kernel client versus doing it directly. Page cache etc…Common theme
“Beta company, okay using a beta product” We can get under the good.
Early start was a bit rough. There was lots of bugs. We found lots of bugs.
Community was great, esp Yan.
Yan fixed our last bug around the end of 2013… haven’t had a single problem since.
We’re not storing multi-PB yet but we processed multi-PB and haven’t had a problem
We lost some performance as a result of this. Network latency, overhead, Ceph overhead.
We can also go even cheaper without Ceph nodes / network.
Our access pattern, write once read many (mostly true).
Most recent data is most often use (working set larger then RAM smaller the the full DFS)
The linux kernel people really put hundreds of man years into scabiliity.
I don’t want to discourage anybody … we did something not smart, picked the hardest problem.
It required us to know a lot of things about Ceph, kernel, concurrency.
I would pick something simpler next time.
There’s bugs in the other parts of the kernel?
So one of the reasons we wanted to do this work in the kernel was concurrency, so our benefit was also out PITA.
We got it up to the Ceph code base around 3.13
Bunch of bug fixes from external folks. We’ve exposed issues with FSCache code.
We’ve fixed a bunch of concurrency bugs that only happen in the error path of FSCache under VMA pressure. A lot of filesystems benefit.
We’re really happy with performance… we’ve made a good bet on the kernel.
We’re able to really the fscache up to the speed of the disks we have.
So despite the initial learning curve … we want to contribute work.
Where we can leverage our knowledge … performance.
We’ve built a lot of things in our system for improving latency. Learned what to do what not to do, where to apply lockless alogs.
Readv2 syscall… Help all applications that do both IO and CPU bound work.
Thanks for listening to me.
Hopefully it was a good story of what we’re up to… how we’re leveraging Ceph.
Motivating to help and contribute.
It’s nice to have a vendor you can call up and yell at when things not working, but it’s even better to be able to guide the tool to do what you want.
The Ceph community is great, there’s so many people contributing to so many different projects.