SlideShare a Scribd company logo
B U I L D I N G A D ATA I N G E S T I O N &
P R O C E S S I N G P I P E L I N E W I T H
S PA R K & A I R F L O W
D A TA D R I V E N R I J N M O N D
Tom Lous
@tomloushttps://cdn.mg.co.za/crop/content/images/2013/07/18/Coffee_Machine_-_Truth_-_Cape_Town_DH_4271_i2e.jpg/1680x1080/
D AT L I N Q B Y N U M B E R S
W H O A R E W E ?
• 14 countries

Netherlands, Belgium, France, UK, Germany, Spain, Italy, Luxembourg
• 100 customers across Europe

Coca-Cola, Unilever, FrieslandCampina, Red Bull, Spa, AB InBev & Heineken
• 2000 active users

Working with our SaaS, insights & mobile solutions
• 1.2M outlets

Restaurants, bars, hotels, bakeries, supermarkets, kiosks, hospitals
• 5000 mutations per day

Out of business, starters, ownership change, name change, etc
• 30 students

Manually checking many changes
https://www.pexels.com/photo/people-on-dining-chair-inside-the-glass-wall-room-230736/
D AT L I N Q C O L L E C T S D ATA
W H A T D O W E D O ?
• Outlet features

Address, segmentation, opening hours, contact info, lat/long, terrace, parking, pricing
• Outlet surroundings
Nearby public transportation, school, going out area, shopping, high traffic
• Demographics
Median income, age, density, marital status
• Social Media

Presence, traffic, likes, posts
http://www.wandeltourrotterdam.nl/wp-content/uploads/photo-gallery/1506-Rotterdam-Image-Bank.jpg
D AT L I N Q ’ S M I S S I O N
W H Y D O W E D O I T ?
• Food service market transparent
• Effective marketing & sales
• Which food products are relevant where?
http://frietwinkel.nl/wp-content/uploads/2014/08/Frietwinkel-openingsfoto.jpg
W H Y S PA R K ?
S O L U T I O N
• Scaling is hard

Mo’ outlets, mo’ students
• Data != information
One-of analysis projects should be done continuously and immediately
• Work smarter not harder
Have students train AI instead of repetitive work
• Faster results

Distribute computing allows for increased processing speeds
https://s-media-cache-ak0.pinimg.com/originals/7e/63/11/7e6311d96eb4bb28a41ea34378de4182.jpg
SOLUTION?
H O W T O S TA R T
S O L U T I O N
• Focus on MVP

Minimal Viable Product that acts like a Proof of Concept
• Declare riskiest assumption
We can automatically match data from different sources
• Prove it
Small scale ML models
http://realcorfu.com/wp-content/uploads/2013/09/kiosk3.jpg
BUILDING THE PIPELINE
U S E T H E C L O U D
B U I L D I N G T H E P I P E L I N E
• Flexible usage

Pay for what you use
• Scaling out of the box
The only limit seems to be your credit card
• CLI > UI
CLI tools are your friend. UI is limited
https://s-media-cache-ak0.pinimg.com/originals/00/f8/fa/00f8fafea39a7b185de0e3c6c7bbaaec.jpg
D E V E L O P L O C A L LY
B U I L D I N G T H E P I P E L I N E
• Check versions

Your cloud provider sets the max version (e.g. Spark 2.0.2 instead of 2.1.0)
• Same code smaller data
Use samples of your data to test on
• Use at least 2 threads
Otherwise parallel execution cannot be truly tested
• Exclude libraries from deployment
Some libraries are provided on the cloud, don’t add them to your final build
https://s-media-cache-ak0.pinimg.com/originals/00/f8/fa/00f8fafea39a7b185de0e3c6c7bbaaec.jpg
C O N T I N U O U S I N T E G R AT I O N
B U I L D I N G T H E P I P E L I N E
• Deploy code

Manual uploads or deployments is asking for trouble.

Use polling instead of pushing for security

Create pipeline script to stage builds

Automated tests
• Commit config
Store CI integration
S PA R K J O B S
B U I L D I N G T H E P I P E L I N E
• All jobs in one project

Unless there is a very good reason, keep all your Jobs in one file
• Use Datasets as much as possible
Avoid RDD’s and DataFrames
• Use Case Class (encoders) as much as possible
Generic Row object will only get you so far
• Verbosity only in main class
Log messages from nodes won’t make it to the master
•Look at the execution plan
Remove temp files, clean the slate
PA R Q U E T
B U I L D I N G T H E P I P E L I N E
• What is it

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem
• Compressed
Small footprint.
• Columnar
Easier to drill down into data based on fields
• Metadata
Stores meta information easy to use by spark (counts, etc)
• Partitioning
Allows for partitioning on values for faster access
E L A S T I C S E A R C H
B U I L D I N G T H E P I P E L I N E
• NxN matching is computationally expensive

Querying Elasticsearch / Lucene / Solr less
Fuzzy searching using n-grams and other tokenizers
• Cluster out of the box
Auto replication
• Config differs from normal use
It’s not a search feature for a website. 

Don’t care about update performance

Don’t load balance queries (_local)
• Discovery Plugins for GC
So nodes can find each other on local network without Zen
Create VM’s with --scopes=compute-rw
A N S I B L E
B U I L D I N G T H E P I P E L I N E
• Automate VM’s / agent less

Elasticsearch cluster
• Dependant on Ansible GC Modules

gcloud keeps updating. Ansible GC doesn’t directly follow suite
A I R F L O W
B U I L D I N G T H E P I P E L I N E
• Create workflows

Define all tasks and dependencies
• Cron on steroids
Starts every day
• Dependencies
Tasks are dependent on tasks upstream
• Backfill / rerun
Run as if in the past
BUILDING THE PIPELINE
F I N A L LY
C O N C L U S I O N
• What’s next?

Improving MVP

Adding more machine learning

Adding more datasources

Adding more countries

Improving architecture
• Production
Create an API for use in our products

Create UI for exploring the data

Exporting to standard databases to further use
• Team
Hire Data Engineer(s)
Hire Data Scientist(s)
https://static.pexels.com/photos/14107/pexels-photo-14107.jpeg
T H A N K S F O R WAT C H I N G
C O N C L U S I O N
• Questions?

AMA
• Vacancies!
Ask me or check our site http://www.datlinq.com/en/vacancies
• Break
Grab a drink an let’s reconvene after a 15 min break

More Related Content

What's hot

Hadoop summit-ams-2014-04-03
Hadoop summit-ams-2014-04-03Hadoop summit-ams-2014-04-03
Hadoop summit-ams-2014-04-03
SDanzanvilliersCriteo
 
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | GrafanaStreaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
InfluxData
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
Joe Crobak
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
Teemu Kurppa
 
The IDI Digital Transformation - OpenStack Day Israel 2016
The IDI Digital Transformation - OpenStack Day Israel 2016The IDI Digital Transformation - OpenStack Day Israel 2016
The IDI Digital Transformation - OpenStack Day Israel 2016
Cloud Native Day Tel Aviv
 
Scalable Eventing Over Apache Mesos
Scalable Eventing Over Apache MesosScalable Eventing Over Apache Mesos
Scalable Eventing Over Apache Mesos
Olivier Paugam
 
Workers and Worker Patterns at Scale
Workers and Worker Patterns at ScaleWorkers and Worker Patterns at Scale
Workers and Worker Patterns at Scale
Chad Arimura
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
Itai Yaffe
 
Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013
Thoughtworks
 
Agile sre fly beyond the clouds olx core went aws- Wojciech Krysmann
Agile sre fly beyond the clouds   olx core went aws- Wojciech KrysmannAgile sre fly beyond the clouds   olx core went aws- Wojciech Krysmann
Agile sre fly beyond the clouds olx core went aws- Wojciech Krysmann
PROIDEA
 
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...
Thoughtworks
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
Go With The Flow
Go With The FlowGo With The Flow
Go With The Flow
PhilWinstanley
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
Romi Kuntsman
 
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetFrom Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
Pôle Systematic Paris-Region
 
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
Cloud Native Day Tel Aviv
 
Hyperloglog Lightning Talk
Hyperloglog Lightning TalkHyperloglog Lightning Talk
Hyperloglog Lightning Talk
Simon Prickett
 
Testing and Monitoring and Broken Things | Nikki Attea | Sensu
Testing and Monitoring and Broken Things | Nikki Attea | SensuTesting and Monitoring and Broken Things | Nikki Attea | Sensu
Testing and Monitoring and Broken Things | Nikki Attea | Sensu
InfluxData
 
Start Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopStart Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPop
Jason Plurad
 
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...
New Relic
 

What's hot (20)

Hadoop summit-ams-2014-04-03
Hadoop summit-ams-2014-04-03Hadoop summit-ams-2014-04-03
Hadoop summit-ams-2014-04-03
 
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | GrafanaStreaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
 
The IDI Digital Transformation - OpenStack Day Israel 2016
The IDI Digital Transformation - OpenStack Day Israel 2016The IDI Digital Transformation - OpenStack Day Israel 2016
The IDI Digital Transformation - OpenStack Day Israel 2016
 
Scalable Eventing Over Apache Mesos
Scalable Eventing Over Apache MesosScalable Eventing Over Apache Mesos
Scalable Eventing Over Apache Mesos
 
Workers and Worker Patterns at Scale
Workers and Worker Patterns at ScaleWorkers and Worker Patterns at Scale
Workers and Worker Patterns at Scale
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
 
Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013
 
Agile sre fly beyond the clouds olx core went aws- Wojciech Krysmann
Agile sre fly beyond the clouds   olx core went aws- Wojciech KrysmannAgile sre fly beyond the clouds   olx core went aws- Wojciech Krysmann
Agile sre fly beyond the clouds olx core went aws- Wojciech Krysmann
 
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...
Promise of a better future by Rahul Goma Phulore and Pooja Akshantal, Thought...
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Go With The Flow
Go With The FlowGo With The Flow
Go With The Flow
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François SaussetFrom Python to smartphones: neural nets @ Saint-Gobain, François Sausset
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
 
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
So Your OpenStack Cloud is Built... Now What's Next - Walter Bentley - OpenSt...
 
Hyperloglog Lightning Talk
Hyperloglog Lightning TalkHyperloglog Lightning Talk
Hyperloglog Lightning Talk
 
Testing and Monitoring and Broken Things | Nikki Attea | Sensu
Testing and Monitoring and Broken Things | Nikki Attea | SensuTesting and Monitoring and Broken Things | Nikki Attea | Sensu
Testing and Monitoring and Broken Things | Nikki Attea | Sensu
 
Start Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPopStart Flying with Python & Apache TinkerPop
Start Flying with Python & Apache TinkerPop
 
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...
FUTURESTACK13: Software analytics with Project Rubicon from Alex Kroman Engin...
 

Viewers also liked

How to write maintainable code
How to write maintainable codeHow to write maintainable code
How to write maintainable code
Peter Hilton
 
Data Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceData Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data Science
Srivatsan Ramanujam
 
Transforming Data to Unlock Its Latent Value
Transforming Data to Unlock Its Latent ValueTransforming Data to Unlock Its Latent Value
Transforming Data to Unlock Its Latent Value
Tony Ojeda
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
Ilkay Altintas, Ph.D.
 
Big datalab
Big datalabBig datalab
Big datalab
David Chen
 
Gartner Predictions for Hadoop
Gartner Predictions for HadoopGartner Predictions for Hadoop
Gartner Predictions for Hadoop
Bruno Aziza
 
Big Data Analytics Principles
Big Data Analytics PrinciplesBig Data Analytics Principles
Big Data Analytics Principles
Bruno Aziza
 
DataLab DataQuality Dimensions
DataLab DataQuality DimensionsDataLab DataQuality Dimensions
DataLab DataQuality Dimensions
Carlos Guerreiro
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
Jonathan Holloway
 
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs
 
The Laws of Data Science Gravity
The Laws of Data Science GravityThe Laws of Data Science Gravity
The Laws of Data Science Gravity
Bruno Aziza
 
Big Data for the CMO
Big Data for the CMOBig Data for the CMO
Big Data for the CMO
Bruno Aziza
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
From Business Intelligence to Predictive Analytics
From Business Intelligence to Predictive AnalyticsFrom Business Intelligence to Predictive Analytics
From Business Intelligence to Predictive Analytics
Decision Management Solutions
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data Pipelines
Christian Gügi
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 

Viewers also liked (20)

How to write maintainable code
How to write maintainable codeHow to write maintainable code
How to write maintainable code
 
Data Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceData Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data Science
 
Transforming Data to Unlock Its Latent Value
Transforming Data to Unlock Its Latent ValueTransforming Data to Unlock Its Latent Value
Transforming Data to Unlock Its Latent Value
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 
Big datalab
Big datalabBig datalab
Big datalab
 
Gartner Predictions for Hadoop
Gartner Predictions for HadoopGartner Predictions for Hadoop
Gartner Predictions for Hadoop
 
Big Data Analytics Principles
Big Data Analytics PrinciplesBig Data Analytics Principles
Big Data Analytics Principles
 
DataLab DataQuality Dimensions
DataLab DataQuality DimensionsDataLab DataQuality Dimensions
DataLab DataQuality Dimensions
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
 
The Laws of Data Science Gravity
The Laws of Data Science GravityThe Laws of Data Science Gravity
The Laws of Data Science Gravity
 
Big Data for the CMO
Big Data for the CMOBig Data for the CMO
Big Data for the CMO
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
From Business Intelligence to Predictive Analytics
From Business Intelligence to Predictive AnalyticsFrom Business Intelligence to Predictive Analytics
From Business Intelligence to Predictive Analytics
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data Pipelines
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 

Similar to Building a Data Ingestion & Processing Pipeline with Spark & Airflow

RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
Databricks
 
Transforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsTransforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOps
Nicolas (Nick) Barcet
 
The 12 Factor App
The 12 Factor AppThe 12 Factor App
The 12 Factor App
rudiyardley
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
geetachauhan
 
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter Gildersleeve
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter GildersleeveOSDC 2018 | Puppet and the Road to Pervasive Automation by Walter Gildersleeve
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter Gildersleeve
NETWAYS
 
200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code
David Danzilio
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
iguazio
 
So Your OpenStack Cloud is Built...Now What?
So Your OpenStack Cloud is Built...Now What? So Your OpenStack Cloud is Built...Now What?
So Your OpenStack Cloud is Built...Now What?
Tesora
 
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit EbnerThe NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
NRB
 
How to Transform Into a Data-Driven Organization
How to Transform Into a Data-Driven OrganizationHow to Transform Into a Data-Driven Organization
How to Transform Into a Data-Driven Organization
WarrenCruz3
 
The domino maze
The domino mazeThe domino maze
The domino maze
Julian Woodward
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 
Metrics 4 faster feedback
Metrics 4 faster feedbackMetrics 4 faster feedback
Metrics 4 faster feedback
Kris Buytaert
 
Migrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the CloudMigrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the Cloud
Roger Valade
 
Devops For Drupal
Devops  For DrupalDevops  For Drupal
Devops For Drupal
Kris Buytaert
 
Seeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of MicroservicesSeeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of Microservices
Dave McAllister
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
Rob Winters
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
Databricks
 
Deployments in one click!
Deployments in one click!Deployments in one click!
Deployments in one click!
Manuel de la Peña Peña
 
Tools and best practices for sustainable software
Tools and best practices for sustainable softwareTools and best practices for sustainable software
Tools and best practices for sustainable software
Green Software Development
 

Similar to Building a Data Ingestion & Processing Pipeline with Spark & Airflow (20)

RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
 
Transforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsTransforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOps
 
The 12 Factor App
The 12 Factor AppThe 12 Factor App
The 12 Factor App
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter Gildersleeve
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter GildersleeveOSDC 2018 | Puppet and the Road to Pervasive Automation by Walter Gildersleeve
OSDC 2018 | Puppet and the Road to Pervasive Automation by Walter Gildersleeve
 
200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
So Your OpenStack Cloud is Built...Now What?
So Your OpenStack Cloud is Built...Now What? So Your OpenStack Cloud is Built...Now What?
So Your OpenStack Cloud is Built...Now What?
 
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit EbnerThe NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
 
How to Transform Into a Data-Driven Organization
How to Transform Into a Data-Driven OrganizationHow to Transform Into a Data-Driven Organization
How to Transform Into a Data-Driven Organization
 
The domino maze
The domino mazeThe domino maze
The domino maze
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Metrics 4 faster feedback
Metrics 4 faster feedbackMetrics 4 faster feedback
Metrics 4 faster feedback
 
Migrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the CloudMigrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the Cloud
 
Devops For Drupal
Devops  For DrupalDevops  For Drupal
Devops For Drupal
 
Seeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of MicroservicesSeeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of Microservices
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Deployments in one click!
Deployments in one click!Deployments in one click!
Deployments in one click!
 
Tools and best practices for sustainable software
Tools and best practices for sustainable softwareTools and best practices for sustainable software
Tools and best practices for sustainable software
 

Recently uploaded

ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
Ratnakar Mikkili
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
PuktoonEngr
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
nooriasukmaningtyas
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
awadeshbabu
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
nooriasukmaningtyas
 

Recently uploaded (20)

ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
 
A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...A review on techniques and modelling methodologies used for checking electrom...
A review on techniques and modelling methodologies used for checking electrom...
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
 

Building a Data Ingestion & Processing Pipeline with Spark & Airflow

  • 1. B U I L D I N G A D ATA I N G E S T I O N & P R O C E S S I N G P I P E L I N E W I T H S PA R K & A I R F L O W D A TA D R I V E N R I J N M O N D Tom Lous @tomloushttps://cdn.mg.co.za/crop/content/images/2013/07/18/Coffee_Machine_-_Truth_-_Cape_Town_DH_4271_i2e.jpg/1680x1080/
  • 2. D AT L I N Q B Y N U M B E R S W H O A R E W E ? • 14 countries
 Netherlands, Belgium, France, UK, Germany, Spain, Italy, Luxembourg • 100 customers across Europe
 Coca-Cola, Unilever, FrieslandCampina, Red Bull, Spa, AB InBev & Heineken • 2000 active users
 Working with our SaaS, insights & mobile solutions • 1.2M outlets
 Restaurants, bars, hotels, bakeries, supermarkets, kiosks, hospitals • 5000 mutations per day
 Out of business, starters, ownership change, name change, etc • 30 students
 Manually checking many changes https://www.pexels.com/photo/people-on-dining-chair-inside-the-glass-wall-room-230736/
  • 3. D AT L I N Q C O L L E C T S D ATA W H A T D O W E D O ? • Outlet features
 Address, segmentation, opening hours, contact info, lat/long, terrace, parking, pricing • Outlet surroundings Nearby public transportation, school, going out area, shopping, high traffic • Demographics Median income, age, density, marital status • Social Media
 Presence, traffic, likes, posts http://www.wandeltourrotterdam.nl/wp-content/uploads/photo-gallery/1506-Rotterdam-Image-Bank.jpg
  • 4. D AT L I N Q ’ S M I S S I O N W H Y D O W E D O I T ? • Food service market transparent • Effective marketing & sales • Which food products are relevant where? http://frietwinkel.nl/wp-content/uploads/2014/08/Frietwinkel-openingsfoto.jpg
  • 5. W H Y S PA R K ? S O L U T I O N • Scaling is hard
 Mo’ outlets, mo’ students • Data != information One-of analysis projects should be done continuously and immediately • Work smarter not harder Have students train AI instead of repetitive work • Faster results
 Distribute computing allows for increased processing speeds https://s-media-cache-ak0.pinimg.com/originals/7e/63/11/7e6311d96eb4bb28a41ea34378de4182.jpg SOLUTION?
  • 6. H O W T O S TA R T S O L U T I O N • Focus on MVP
 Minimal Viable Product that acts like a Proof of Concept • Declare riskiest assumption We can automatically match data from different sources • Prove it Small scale ML models http://realcorfu.com/wp-content/uploads/2013/09/kiosk3.jpg
  • 8. U S E T H E C L O U D B U I L D I N G T H E P I P E L I N E • Flexible usage
 Pay for what you use • Scaling out of the box The only limit seems to be your credit card • CLI > UI CLI tools are your friend. UI is limited https://s-media-cache-ak0.pinimg.com/originals/00/f8/fa/00f8fafea39a7b185de0e3c6c7bbaaec.jpg
  • 9. D E V E L O P L O C A L LY B U I L D I N G T H E P I P E L I N E • Check versions
 Your cloud provider sets the max version (e.g. Spark 2.0.2 instead of 2.1.0) • Same code smaller data Use samples of your data to test on • Use at least 2 threads Otherwise parallel execution cannot be truly tested • Exclude libraries from deployment Some libraries are provided on the cloud, don’t add them to your final build https://s-media-cache-ak0.pinimg.com/originals/00/f8/fa/00f8fafea39a7b185de0e3c6c7bbaaec.jpg
  • 10. C O N T I N U O U S I N T E G R AT I O N B U I L D I N G T H E P I P E L I N E • Deploy code
 Manual uploads or deployments is asking for trouble.
 Use polling instead of pushing for security
 Create pipeline script to stage builds
 Automated tests • Commit config Store CI integration
  • 11. S PA R K J O B S B U I L D I N G T H E P I P E L I N E • All jobs in one project
 Unless there is a very good reason, keep all your Jobs in one file • Use Datasets as much as possible Avoid RDD’s and DataFrames • Use Case Class (encoders) as much as possible Generic Row object will only get you so far • Verbosity only in main class Log messages from nodes won’t make it to the master •Look at the execution plan Remove temp files, clean the slate
  • 12. PA R Q U E T B U I L D I N G T H E P I P E L I N E • What is it
 Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem • Compressed Small footprint. • Columnar Easier to drill down into data based on fields • Metadata Stores meta information easy to use by spark (counts, etc) • Partitioning Allows for partitioning on values for faster access
  • 13. E L A S T I C S E A R C H B U I L D I N G T H E P I P E L I N E • NxN matching is computationally expensive
 Querying Elasticsearch / Lucene / Solr less Fuzzy searching using n-grams and other tokenizers • Cluster out of the box Auto replication • Config differs from normal use It’s not a search feature for a website. 
 Don’t care about update performance
 Don’t load balance queries (_local) • Discovery Plugins for GC So nodes can find each other on local network without Zen Create VM’s with --scopes=compute-rw
  • 14. A N S I B L E B U I L D I N G T H E P I P E L I N E • Automate VM’s / agent less
 Elasticsearch cluster • Dependant on Ansible GC Modules
 gcloud keeps updating. Ansible GC doesn’t directly follow suite
  • 15. A I R F L O W B U I L D I N G T H E P I P E L I N E • Create workflows
 Define all tasks and dependencies • Cron on steroids Starts every day • Dependencies Tasks are dependent on tasks upstream • Backfill / rerun Run as if in the past
  • 17. F I N A L LY C O N C L U S I O N • What’s next?
 Improving MVP
 Adding more machine learning
 Adding more datasources
 Adding more countries
 Improving architecture • Production Create an API for use in our products
 Create UI for exploring the data
 Exporting to standard databases to further use • Team Hire Data Engineer(s) Hire Data Scientist(s) https://static.pexels.com/photos/14107/pexels-photo-14107.jpeg
  • 18. T H A N K S F O R WAT C H I N G C O N C L U S I O N • Questions?
 AMA • Vacancies! Ask me or check our site http://www.datlinq.com/en/vacancies • Break Grab a drink an let’s reconvene after a 15 min break