A bit about Mediamath: we’re an Ad Tech company and we write software to buy digital media. So say you go to a site and see a banner ad at the top, there was an auction to decide who was going to get that space and how much they would pay. We build systems to ingest, analyze, and decision on those bid opportunities and use machine learning to optimize our bidding.
This story starts in 2013, with a description of our data warehouse at the time.
Three Netezza servers would store and process all of our logs into standard reports
The Netezza servers held separate copies of the source data for standard reports
Push reports to Oracle data marts
All the lines here, the glue of this system, is SQL, executed by shell scripts
Netezza servers would store 7-13 days of logs before purging
Architecture diagram is pretty true to life, and so you may have noticed that there’s no dedicated QA environment
Have a dev server or two, but with the amount of data we deal with it’s costly to keep an up-to-date QA envirnoment, leading to a mismatch with production.
Similarly, we had no environment for ad-hoc analytics. Simply selecting fields – no aggregations, nothing fancy – would cause reporting delays
And so with these in mind, the question of scaling was a frightful one. Updating workflows and creating new ones were frustrating, can’t just keep copying our logs from server to server (needed to scale vertically as well), and adding more shell and SQL would only lead to more problems.
This is the organizational data flow.
The data warehouse team held most of the data engineers at mediamath, they would push reports to where the reporting team could lay an API over it, but the reporting team was mostly DBAs and API developers, and only after the reporting was the rest of the company able to get their first crack at the data. We had unofficial links to bypass reporting, but those were very tightly controlled
The “productized” version of our log-level data was custom FTP transfers.
Would compete for resources with production workflows
FTP would run out of space, usually after hours, then you’d get into the office the next day to a client who’s upset that you deleted their data.
All of this led to a heavy reliance on canned reports, served via our API. Some of these reports were updated three times a day, some were updated once a day. Canned reports are great, but with the aforementioned developer difficulties, we just couldn’t keep pace.
Log level data is the lifeblood of our reporting. But for the longest time our logs – the greatest source for insights – was also one of the hardest things to get at.
So this was the state of affairs around 2013, and these are the issues that led to this process of “data liberation”. The name very accurately describes our goal, in that we wanted to break down the silos that existed within our company (and outside our data warehouse). In short, we needed to remove infrastructure as a limiting factor in data sharing, both internally and externally, and this led our transition from data warehouse to data platform
Need to leave behind our monolithic, big-box data warehouse
No more single-machine processing, much more fault-tolerant
Standardize access to data and make it easier for folks of all backgrouds to get real value
There were the super high-level goals, and we that central to these goals would be decoupling storage and computation
Need to make sure extracting data doesn’t interfere with processing data.
We did this along two axes: technically an organizationally.
Technically: we decided to move our data warehouse to the cloud, and in the process move to more of a platform
A little later on I’ll discuss an organizational decoupling of storage and computation
If we need 40 nodes for 2 hours, we can get that.
Spot instances: leftover inventory that you bid on
Redshift is marketed as Amazon’s “data warehouse” solution, but we saw this as a more suitable replacement for our Oracle datamart since it allowed us to de-aggregate some of our reports (i.e. – allow for custom date ranges instead of pre-aggregated by “yesterday”, “last 7 days”, etc…)
The direct s3 access is our solution to our data access problem, so I’ll zoom in on that a little more.
Data is generated by various teams within mediamath, and we’ll enrich the logs and store them partitioned by organization. Identity and Access Management (IAM) is the Amazon service we use for access control, and from there clients can safely read their data (and only their data) and process it however they like. This, essentially, is our replacement for the FTP transfers we used to set up.
So that’s data access, and this setup sets the stage for the developer experience. For developer experience, first: we get to say “yes” more
Much like two clients can run processes side by side, we can run our QA jobs side by side
With the maturation of the hadoop ecosystem, there seems to be a new “big data analytics framework” every couple months, so we don’t force developers to be too dogmatic about a single system
Part of lowering the barrier of entry was making it easier to get users from more backgrounds.
Again, select what you want from a dropdown and then hit “launch”
Altogether this lets our platform serve as the foundation for data-driven applications, or act as “big data for dummies”
To be clear, this is not Qubole’s official “greatest hits” compilation, but rather what we use at MediaMath
So that’s where Data Liberation led us, but in reality we bridge the two systems
A look at how our old and new architectures sit side by side, with load balancing done at the service layer to point to either AWS or our own data center. AWS enables us to allow access along the way. Sproxy will update a Dynamo DB table with filenames and upload times. Similarly, we keep a table in Netezza for filenames and batch numbers
Not without new challenges
- The effort to keep old and new systems consistent meant that we could migrate in pieces, not just our code but our people too. Could take time to properly learn new things.
Migrate from SQL to Scala
Migrate from RDBMS to Hadoop
So that’s where we are today. I discussed the goals of data liberation and how we solved (or tried to solve) for these, and now I’m going to discuss what challenges and questions we face moving forward. I’ll start by talking about life after liberation.
This is where our organizational decoupling of storage and computation happened
Decoupling storage (data platform) from processing (anywhere)
Cloud isn’t necessarily important, what was really important was decoupling storage and computation
S3 is a great touchpoint to help break down the walled garden of AWS and help bridge the gap between on-premises hardware and the cloud