Welcome to the 4th edition of Data Driven Rijnmond! Glad you all survived the storm and were still hyped enough to come by and listen to a talk about Airflow ;-) We are very proud to welcome you all at our new office building and we hope it will be the place of many more meet ups in the future. For this occasion the meet up is somewhat Datlinq themed, but we will stay away from sales pitches Tonight we like to share with you some of the tools and ideas Datlinq is using and why.
As always we’ll have both an engineering and a data science talk As one of the data engineers at Datlinq I’ll start you of with the engineering talk. After a small break ,Andrew Ho, our product manager apps, that will give a short energising talk about collecting data in the food domain with apps. Finally our data scientist Martijn Spitters will finish the evening with a talk about outlet matching & enrichment
For these talks to make sense you probably have to know a little about what it is we do here at Datlinq. As promised no sales talk, but it will give the context necessary to follow the overarching story of the talks of my colleagues and I
Datlinq is a company that operates in the food service domain In short: we help foodservice professionals by informing them with data and supporting them with tools about opportunities in this domain.
The data we use and supply is comprehensive location data in the most of Europe, like restaurants, coffee bars, stores, bakeries and other places that are potential outlets for food service brands We work for brands like … and we use our data (combined with theirs) to make matches between their brands and locations
Our data is gathered, process and enrich this location data from a various range of (digital) online sources.
So without further ado, let’s jump in this data gathering & enriching process
I want to take you on a journey of building a Spark pipeline in Google Cloud orchestrated via Airflow. The first halve of this talk I will present slides about how we came to use Spark & Airflow in Google Cloud, the next part I’ll try to give a real life demo of the stuff I just described It’s ok that at this time you have no idea what these tools and systems are. I’ll hope to explain to you bottom up what our challenges are and how we deemed to solve these and how these tools fit in solving these challenges
Our journey starts with data.
Everybody is in love with data, big data is the new oil they say. But I’m incline to believe that these people know as much about working with big data as I do with oil
Data is in itself complete and utter useless. Data is garbage. One of the problems with data is that it’s stale the moment you get it. Your source says it’s new, but who says they know? There is no chain of custody, or any indication that the data you receive is accurate, up to date or even usable. Even different sources may copy of each other perpetuating the problem. So you store data from different source somewhere in some files, a database, of maybe even Hadoop. Maybe you’ll use it at some point, maybe you don’t. But with the price of storage plummeting continuously you never throw it away. That would be wasteful…
It’s not hard to get data nowadays. We use many open data sources and API’s to ingest bulks of data. Think for example about … data, which we’ll use in the demo. The moment you get your Json response with a like count and some detailed information, it’s dead data and will have a half-life that determines it usability in the future. But this data will also contain information typed by the owner that can contain errors (wrong zipcodes or misspelled streetnames), lies (best pizza in town) or inaccuracies (not up to date menu’s and pricing). There may also be confusion by duplicated data Event Locations that duplicate their location on … for each event. So the data we get from sources is in itself quite worthless.
Then why work with data at all? We do believe that somewhere in these mountains of data garbage some useful nuggets of data are hidden that we can recycle out of this dump and turn into information.
This information can be used to generate knowledge, which in its turn can be used for creating insights.
To do this requires huge amounts of pre processing, cleaning en transforming of the data. In the demo I’ll show you how you can build these ETL jobs (Extract Transform Load) And how a … data json source can be turned into a Datlinq Location with basic location information (address, geo code, phone, email, website,etc) appended with informational tags, scores about likelihood of existence and classification of certain properties. This is the first step into creating information out of this data. But as mentioned processed data that is inaccurate is just nicely structured data that is inaccurate. Now it’s time to improve this accuracy.
The trick is that data is better combined. If we can data from different sources that describe the same entity, we can reduce the risk of one of those sources being stale or incorrect. The more combinations we can make the more trustworthy our data can become. And ready to be processed into information, Datlinq Locations
We call these combinations ‘crosswalks’ and one of our purposes is to imbue every location with as many crosswalks. Both to gather more detailed information (some sources provide reviews, other menu’s, etc) but also a verification tool if the location is still in business. (we Check these crosswalks periodically)
In the demo we’ll use a different source of data that overlaps somewhat with the … data. ETL’ing this data in a similar structure to be used before combining
Even though the solution of combining this data seems obvious, the meticulous part is to process and clean this data so it is ready for combining. Because with each transformation you are ‘irreversibly’ chasing the data down the line. What to keep, what to change, what to merge, what to split are the hard questions
Fortunately this is something we have been doing at Datlinq for a long time. We have a lot of experience with gathering, cleaning and matching data.
So far I have not mentioned any tool that was advertised in this talk. So if we have all this experience and all this data and all these great clients why need any of these tools at all? The problem is that in the last few years the floodgates have been opened and data keeps pouring in from all kinds of sources into our data lake aka data garbage dump Our challenge was to change our semi-automatic cleaning & combining process into a fully automated one, based on machine learning that can handle the volume and variety of data that flows through our system.
It’s not feasible any more to check all this data by hand or small scripts that run sporadicly
No we need a tool that can effortlessly process and store high volumes of data in a scalable way The best tool on the market these days seems to be Apache Spark. Sparks offers a way to distribute (map) your workload in a fault tolerant way across many machines and combine these back into a single data source.
Spark is the engine that runs all processes. You could build one huge monolithic Spark Job that would entail your entire data pipeline. Even though that seems easy, and will probably be the fastest solution, it’s a horrible idea, because failure at the end means failure of the entire pipeline. Also it’s hard do split out certain jobs that can run on different clusters. You are just replacing your single threaded opaque pipeline with a distrubuted scalable opaque pipeline.
So the best approach is in fact to build smaller jobs
The best practise in building these SparkJobs is building many small ones that work together and allow the output of one to be the input of another. This way you can build single responsibility jobs that do a specific thing without worrying about the entire pipeline.
You just have to defined different types of Spark Jobs. ETL Jobs that turn raw data into a semi-structured clean dataset. Matching jobs that combine these datasets into combined datasets. Enrichment jobs that turn these combined datasets into enriched data by adding new features. Also ML jobs, like classification jobs that use these features to predict.
In the demo I’ll try to convey how you could approach this problem, but bear in mind that there are better ways. I just kept it simple for the demo.
The downside of all these separate jobs is that all these loose components have to be orchestrated into a single functional pipeline. With the flock of birds this occurs to naturally emerging (flock) behaviour. Unfortunately big data tools are not ready for that (yet).
In the olden days we would create huge lists of cronjobs that trigger certain jobs at certain time intervals, but this has many issues. Cronjobs run regardless of what happened in the past. They are hard to schedule, since you have to estimate the time each job takes to schedule the next. If one fails, the rest will keep running and probably fail too. Logging is hard and scaling across multiple machines seems a guarantee for headaches.
No what we actually need is a tool that allows for easy composition and scheduling of complex workflows, with dependencies and also monitor these workflows, retry a number of times in case of failure and notify the status of each job
One such a tool is Apache Airflow. Written in Python and maturing pretty fast, it allows for all our requirements and has an increasing plugin library allowing it to work with Amazon, Azure and Google Cloud.
In the demo I’ll show you how we use SparkJobs plugins for Airflow to trigger our individual jobs and have them depend on each other
Now that we have our jobs in a row we need a place to run them (besides our laptop) The best solution is in my opinion the cloud. An yes. The cloud is just somebody else’s computer, but it offers us precisely what we need what we don’t get if we would host these machines ourselves: flexibility and on demand scalability
For your information. It is important that our data is up to date, but it’s nowhere near realtime (yet) so our pipeline runs once a day for a couple of hours. In these few hours it uses massive machines run all our jobs, but in the end (thanks to airflow) it kills everything besides the original data lake, resulting database and the kuberenetes cluster running our (scalable API) We only pay for used cpu cycles. Don’t get me wrong. It’s only cheaper compared to owning similar resources, that would be idle 70% of the time, but you get many services countable via CLI out to the box back.
We’ll see some of the during our demo which runs on Google Cloud (and my laptop)
I’d have love to show you the entire current pipeline as is, but that posed a couple of difficulties, mainly that the current flow would take more then the allotted time to explain let alone comprehend. So I build a very, very simple pipeline using Spark, Airflow and Google Cloud The idea is that you get inspired in building your own pipelines. I’ve tested it once, so what could go wrong?