SlideShare a Scribd company logo
W E L C O M E !
4TH DATA DRIVEN RIJNMOND
P R O G R A M
‣ Apache Airflow & Apache Spark data pipelines in the cloud
‣ Collecting data in the food domain with apps
‣ Large-scale outlet matching and enrichment in the food service domain
D A T L I N Q
A I R F L O W & S P A R K I N T H E C L O U D
D A T A I S G A R B A G E
D A T A
I N F O R M A T I O N
K N O W L E D G E
I N S I G H T
B E T T E R C O M B I N E D
C L E A N I N G D A T A I S H A R D
C O N T I N U O U S I N F L O W
A P A C H E S P A R K
D E C E N T R A L I S E & A T O M I C I S E
A P A C H E A I R F L O W
G O O G L E C L O U D P L A T F O R M
D E M O
Q U E S T I O N S ?
S L I D E S & L I N K S
W I L L B E P O S T E D
O N L I N E

More Related Content

Recently uploaded

HijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process HollowingHijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process Hollowing
Donato Onofri
 
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
rtunex8r
 
cyber crime.pptx..........................
cyber crime.pptx..........................cyber crime.pptx..........................
cyber crime.pptx..........................
GNAMBIKARAO
 
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
3a0sd7z3
 
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
APNIC
 
Bengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal BrandingBengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal Branding
Tarandeep Singh
 
How to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdfHow to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdf
Infosec train
 
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
dtagbe
 
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
3a0sd7z3
 
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
APNIC
 
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
thezot
 

Recently uploaded (11)

HijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process HollowingHijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process Hollowing
 
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
 
cyber crime.pptx..........................
cyber crime.pptx..........................cyber crime.pptx..........................
cyber crime.pptx..........................
 
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
 
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
 
Bengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal BrandingBengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal Branding
 
How to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdfHow to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdf
 
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
 
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
 
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
 
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
 

Featured

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference
 

Featured (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Apache Airflow & Apache Spark data pipelines in the cloud

  • 1. W E L C O M E ! 4TH DATA DRIVEN RIJNMOND
  • 2. P R O G R A M ‣ Apache Airflow & Apache Spark data pipelines in the cloud ‣ Collecting data in the food domain with apps ‣ Large-scale outlet matching and enrichment in the food service domain
  • 3. D A T L I N Q
  • 4. A I R F L O W & S P A R K I N T H E C L O U D
  • 5. D A T A I S G A R B A G E
  • 6. D A T A I N F O R M A T I O N K N O W L E D G E I N S I G H T
  • 7. B E T T E R C O M B I N E D
  • 8. C L E A N I N G D A T A I S H A R D
  • 9. C O N T I N U O U S I N F L O W
  • 10. A P A C H E S P A R K
  • 11. D E C E N T R A L I S E & A T O M I C I S E
  • 12. A P A C H E A I R F L O W
  • 13. G O O G L E C L O U D P L A T F O R M
  • 14. D E M O
  • 15. Q U E S T I O N S ? S L I D E S & L I N K S W I L L B E P O S T E D O N L I N E

Editor's Notes

  1. Welcome to the 4th edition of Data Driven Rijnmond! Glad you all survived the storm and were still hyped enough to come by and listen to a talk about Airflow ;-) We are very proud to welcome you all at our new office building and we hope it will be the place of many more meet ups in the future. For this occasion the meet up is somewhat Datlinq themed, but we will stay away from sales pitches Tonight we like to share with you some of the tools and ideas Datlinq is using and why.
  2. As always we’ll have both an engineering and a data science talk As one of the data engineers at Datlinq I’ll start you of with the engineering talk. After a small break ,Andrew Ho, our product manager apps, that will give a short energising talk about collecting data in the food domain with apps. Finally our data scientist Martijn Spitters will finish the evening with a talk about outlet matching & enrichment
  3. For these talks to make sense you probably have to know a little about what it is we do here at Datlinq. As promised no sales talk, but it will give the context necessary to follow the overarching story of the talks of my colleagues and I Datlinq is a company that operates in the food service domain In short: we help foodservice professionals by informing them with data and supporting them with tools about opportunities in this domain. The data we use and supply is comprehensive location data in the most of Europe, like restaurants, coffee bars, stores, bakeries and other places that are potential outlets for food service brands We work for brands like … and we use our data (combined with theirs) to make matches between their brands and locations Our data is gathered, process and enrich this location data from a various range of (digital) online sources. So without further ado, let’s jump in this data gathering & enriching process
  4. I want to take you on a journey of building a Spark pipeline in Google Cloud orchestrated via Airflow. The first halve of this talk I will present slides about how we came to use Spark & Airflow in Google Cloud, the next part I’ll try to give a real life demo of the stuff I just described It’s ok that at this time you have no idea what these tools and systems are. I’ll hope to explain to you bottom up what our challenges are and how we deemed to solve these and how these tools fit in solving these challenges Our journey starts with data.
  5. Everybody is in love with data, big data is the new oil they say. But I’m incline to believe that these people know as much about working with big data as I do with oil Data is in itself complete and utter useless. Data is garbage. One of the problems with data is that it’s stale the moment you get it. Your source says it’s new, but who says they know? There is no chain of custody, or any indication that the data you receive is accurate, up to date or even usable. Even different sources may copy of each other perpetuating the problem. So you store data from different source somewhere in some files, a database, of maybe even Hadoop. Maybe you’ll use it at some point, maybe you don’t. But with the price of storage plummeting continuously you never throw it away. That would be wasteful… It’s not hard to get data nowadays. We use many open data sources and API’s to ingest bulks of data. Think for example about … data, which we’ll use in the demo. The moment you get your Json response with a like count and some detailed information, it’s dead data and will have a half-life that determines it usability in the future. But this data will also contain information typed by the owner that can contain errors (wrong zipcodes or misspelled streetnames), lies (best pizza in town) or inaccuracies (not up to date menu’s and pricing). There may also be confusion by duplicated data Event Locations that duplicate their location on … for each event. So the data we get from sources is in itself quite worthless.
  6. Then why work with data at all? We do believe that somewhere in these mountains of data garbage some useful nuggets of data are hidden that we can recycle out of this dump and turn into information. This information can be used to generate knowledge, which in its turn can be used for creating insights. To do this requires huge amounts of pre processing, cleaning en transforming of the data. In the demo I’ll show you how you can build these ETL jobs (Extract Transform Load) And how a … data json source can be turned into a Datlinq Location with basic location information (address, geo code, phone, email, website,etc) appended with informational tags, scores about likelihood of existence and classification of certain properties. This is the first step into creating information out of this data. But as mentioned processed data that is inaccurate is just nicely structured data that is inaccurate. Now it’s time to improve this accuracy.
  7. The trick is that data is better combined. If we can data from different sources that describe the same entity, we can reduce the risk of one of those sources being stale or incorrect. The more combinations we can make the more trustworthy our data can become. And ready to be processed into information, Datlinq Locations We call these combinations ‘crosswalks’ and one of our purposes is to imbue every location with as many crosswalks. Both to gather more detailed information (some sources provide reviews, other menu’s, etc) but also a verification tool if the location is still in business. (we Check these crosswalks periodically) In the demo we’ll use a different source of data that overlaps somewhat with the … data. ETL’ing this data in a similar structure to be used before combining
  8. Even though the solution of combining this data seems obvious, the meticulous part is to process and clean this data so it is ready for combining. Because with each transformation you are ‘irreversibly’ chasing the data down the line. What to keep, what to change, what to merge, what to split are the hard questions Fortunately this is something we have been doing at Datlinq for a long time. We have a lot of experience with gathering, cleaning and matching data.
  9. So far I have not mentioned any tool that was advertised in this talk. So if we have all this experience and all this data and all these great clients why need any of these tools at all? The problem is that in the last few years the floodgates have been opened and data keeps pouring in from all kinds of sources into our data lake aka data garbage dump Our challenge was to change our semi-automatic cleaning & combining process into a fully automated one, based on machine learning that can handle the volume and variety of data that flows through our system. It’s not feasible any more to check all this data by hand or small scripts that run sporadicly
  10. No we need a tool that can effortlessly process and store high volumes of data in a scalable way The best tool on the market these days seems to be Apache Spark. Sparks offers a way to distribute (map) your workload in a fault tolerant way across many machines and combine these back into a single data source. Spark is the engine that runs all processes. You could build one huge monolithic Spark Job that would entail your entire data pipeline. Even though that seems easy, and will probably be the fastest solution, it’s a horrible idea, because failure at the end means failure of the entire pipeline. Also it’s hard do split out certain jobs that can run on different clusters. You are just replacing your single threaded opaque pipeline with a distrubuted scalable opaque pipeline. So the best approach is in fact to build smaller jobs
  11. The best practise in building these SparkJobs is building many small ones that work together and allow the output of one to be the input of another. This way you can build single responsibility jobs that do a specific thing without worrying about the entire pipeline. You just have to defined different types of Spark Jobs. ETL Jobs that turn raw data into a semi-structured clean dataset. Matching jobs that combine these datasets into combined datasets. Enrichment jobs that turn these combined datasets into enriched data by adding new features. Also ML jobs, like classification jobs that use these features to predict. In the demo I’ll try to convey how you could approach this problem, but bear in mind that there are better ways. I just kept it simple for the demo. The downside of all these separate jobs is that all these loose components have to be orchestrated into a single functional pipeline. With the flock of birds this occurs to naturally emerging (flock) behaviour. Unfortunately big data tools are not ready for that (yet).
  12. In the olden days we would create huge lists of cronjobs that trigger certain jobs at certain time intervals, but this has many issues. Cronjobs run regardless of what happened in the past. They are hard to schedule, since you have to estimate the time each job takes to schedule the next. If one fails, the rest will keep running and probably fail too. Logging is hard and scaling across multiple machines seems a guarantee for headaches. No what we actually need is a tool that allows for easy composition and scheduling of complex workflows, with dependencies and also monitor these workflows, retry a number of times in case of failure and notify the status of each job One such a tool is Apache Airflow. Written in Python and maturing pretty fast, it allows for all our requirements and has an increasing plugin library allowing it to work with Amazon, Azure and Google Cloud. In the demo I’ll show you how we use SparkJobs plugins for Airflow to trigger our individual jobs and have them depend on each other
  13. Now that we have our jobs in a row we need a place to run them (besides our laptop) The best solution is in my opinion the cloud. An yes. The cloud is just somebody else’s computer, but it offers us precisely what we need what we don’t get if we would host these machines ourselves: flexibility and on demand scalability For your information. It is important that our data is up to date, but it’s nowhere near realtime (yet) so our pipeline runs once a day for a couple of hours. In these few hours it uses massive machines run all our jobs, but in the end (thanks to airflow) it kills everything besides the original data lake, resulting database and the kuberenetes cluster running our (scalable API) We only pay for used cpu cycles. Don’t get me wrong. It’s only cheaper compared to owning similar resources, that would be idle 70% of the time, but you get many services countable via CLI out to the box back. We’ll see some of the during our demo which runs on Google Cloud (and my laptop)
  14. I’d have love to show you the entire current pipeline as is, but that posed a couple of difficulties, mainly that the current flow would take more then the allotted time to explain let alone comprehend. So I build a very, very simple pipeline using Spark, Airflow and Google Cloud The idea is that you get inspired in building your own pipelines. I’ve tested it once, so what could go wrong?
  15. Questions? https://github.com/TomLous/meetup-spark-airflow-demo