SlideShare a Scribd company logo
Submit Search
Upload
Building a Distributed Data Pipeline
Report
Share
Tom Lous
Freelance Big Data & Machine Learning Software Engineer
Follow
•
0 likes
•
813 views
1
of
6
Building a Distributed Data Pipeline
•
0 likes
•
813 views
Report
Share
Download Now
Download to read offline
Software
Spark, Akka, MLlib, Kafka, Spray Presentation & demo for http://www.daysofcode.nl/ @daysofcode
Read more
Tom Lous
Freelance Big Data & Machine Learning Software Engineer
Follow
Recommended
Building a Data Ingestion & Processing Pipeline with Spark & Airflow by
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Tom Lous
3K views
•
18 slides
Hadoop summit-ams-2014-04-03 by
Hadoop summit-ams-2014-04-03
SDanzanvilliersCriteo
17.8K views
•
26 slides
Turning Numbers into Knowledge: A Statistics Dashboard by
Turning Numbers into Knowledge: A Statistics Dashboard
WiLS
1K views
•
17 slides
Workers and Worker Patterns at Scale by
Workers and Worker Patterns at Scale
Chad Arimura
1.4K views
•
15 slides
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ... by
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
Florian Benz
80 views
•
36 slides
ADLA Batch system by
ADLA Batch system
Xuân Thu Nguyễn
36 views
•
18 slides
More Related Content
What's hot
MatlabTutorial by
MatlabTutorial
Stephen Fox
112 views
•
16 slides
Building Serverless Machine Learning Models in the Cloud [PyData DC] by
Building Serverless Machine Learning Models in the Cloud [PyData DC]
Alex Casalboni
645 views
•
25 slides
Introducing Apache Airflow and how we are using it by
Introducing Apache Airflow and how we are using it
Bruno Faria
10.8K views
•
18 slides
Android App Performance by
Android App Performance
Altaf ur Rehman
222 views
•
14 slides
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli... by
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Flink Forward
699 views
•
20 slides
kickstart your org into automation - ansible workshops by
kickstart your org into automation - ansible workshops
Ilkka Tengvall
291 views
•
7 slides
What's hot
(9)
MatlabTutorial by Stephen Fox
MatlabTutorial
Stephen Fox
•
112 views
Building Serverless Machine Learning Models in the Cloud [PyData DC] by Alex Casalboni
Building Serverless Machine Learning Models in the Cloud [PyData DC]
Alex Casalboni
•
645 views
Introducing Apache Airflow and how we are using it by Bruno Faria
Introducing Apache Airflow and how we are using it
Bruno Faria
•
10.8K views
Android App Performance by Altaf ur Rehman
Android App Performance
Altaf ur Rehman
•
222 views
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli... by Flink Forward
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Flink Forward
•
699 views
kickstart your org into automation - ansible workshops by Ilkka Tengvall
kickstart your org into automation - ansible workshops
Ilkka Tengvall
•
291 views
Embracing Serverless with Google by Joseph Lust
Embracing Serverless with Google
Joseph Lust
•
907 views
Quarterly Technology Briefing, Manchester, UK September 2013 by Thoughtworks
Quarterly Technology Briefing, Manchester, UK September 2013
Thoughtworks
•
2.8K views
Workflow Engines + Luigi by Vladislav Supalov
Workflow Engines + Luigi
Vladislav Supalov
•
1.9K views
Viewers also liked
Data Driven Action : A Primer on Data Science by
Data Driven Action : A Primer on Data Science
Srivatsan Ramanujam
891 views
•
103 slides
Transforming Data to Unlock Its Latent Value by
Transforming Data to Unlock Its Latent Value
Tony Ojeda
543 views
•
74 slides
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o... by
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
Ilkay Altintas, Ph.D.
552 views
•
35 slides
Big datalab by
Big datalab
David Chen
1.3K views
•
27 slides
Gartner Predictions for Hadoop by
Gartner Predictions for Hadoop
Bruno Aziza
3.5K views
•
8 slides
Big Data Analytics Principles by
Big Data Analytics Principles
Bruno Aziza
4.1K views
•
32 slides
Viewers also liked
(18)
Data Driven Action : A Primer on Data Science by Srivatsan Ramanujam
Data Driven Action : A Primer on Data Science
Srivatsan Ramanujam
•
891 views
Transforming Data to Unlock Its Latent Value by Tony Ojeda
Transforming Data to Unlock Its Latent Value
Tony Ojeda
•
543 views
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o... by Ilkay Altintas, Ph.D.
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
Ilkay Altintas, Ph.D.
•
552 views
Big datalab by David Chen
Big datalab
David Chen
•
1.3K views
Gartner Predictions for Hadoop by Bruno Aziza
Gartner Predictions for Hadoop
Bruno Aziza
•
3.5K views
Big Data Analytics Principles by Bruno Aziza
Big Data Analytics Principles
Bruno Aziza
•
4.1K views
DataLab DataQuality Dimensions by Carlos Guerreiro
DataLab DataQuality Dimensions
Carlos Guerreiro
•
1.4K views
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal by Srivatsan Ramanujam
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
•
2.8K views
The Laws of Data Science Gravity by Bruno Aziza
The Laws of Data Science Gravity
Bruno Aziza
•
742 views
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services by Marlabs
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs
•
834 views
Big Data for the CMO by Bruno Aziza
Big Data for the CMO
Bruno Aziza
•
10.5K views
From Business Intelligence to Predictive Analytics by Decision Management Solutions
From Business Intelligence to Predictive Analytics
Decision Management Solutions
•
18.7K views
Analyttica_Data Science in Motion_Intro by Analyttica Datalab
Analyttica_Data Science in Motion_Intro
Analyttica Datalab
•
468 views
Googling the Error Message by Tom Lous
Googling the Error Message
Tom Lous
•
2.7K views
Process Mining based on the Internet of Events by Rising Media Ltd.
Process Mining based on the Internet of Events
Rising Media Ltd.
•
2.4K views
Apache Flink: Real-World Use Cases for Streaming Analytics by Slim Baltagi
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
•
12.5K views
Predictive Analytics: Context and Use Cases by Kimberley Mitchell
Predictive Analytics: Context and Use Cases
Kimberley Mitchell
•
19.2K views
Predictive Analytics - An Overview by MachinePulse
Predictive Analytics - An Overview
MachinePulse
•
32.5K views
Recently uploaded
JioEngage_Presentation.pptx by
JioEngage_Presentation.pptx
admin125455
9 views
•
4 slides
The Path to DevOps by
The Path to DevOps
John Valentino
6 views
•
6 slides
nintendo_64.pptx by
nintendo_64.pptx
paiga02016
7 views
•
7 slides
predicting-m3-devopsconMunich-2023.pptx by
predicting-m3-devopsconMunich-2023.pptx
Tier1 app
10 views
•
24 slides
tecnologia18.docx by
tecnologia18.docx
nosi6702
6 views
•
5 slides
Bootstrapping vs Venture Capital.pptx by
Bootstrapping vs Venture Capital.pptx
Zeljko Svedic
16 views
•
17 slides
Recently uploaded
(20)
JioEngage_Presentation.pptx by admin125455
JioEngage_Presentation.pptx
admin125455
•
9 views
The Path to DevOps by John Valentino
The Path to DevOps
John Valentino
•
6 views
nintendo_64.pptx by paiga02016
nintendo_64.pptx
paiga02016
•
7 views
predicting-m3-devopsconMunich-2023.pptx by Tier1 app
predicting-m3-devopsconMunich-2023.pptx
Tier1 app
•
10 views
tecnologia18.docx by nosi6702
tecnologia18.docx
nosi6702
•
6 views
Bootstrapping vs Venture Capital.pptx by Zeljko Svedic
Bootstrapping vs Venture Capital.pptx
Zeljko Svedic
•
16 views
How to build dyanmic dashboards and ensure they always work by Wiiisdom
How to build dyanmic dashboards and ensure they always work
Wiiisdom
•
16 views
Agile 101 by John Valentino
Agile 101
John Valentino
•
13 views
Using Qt under LGPL-3.0 by Burkhard Stubert
Using Qt under LGPL-3.0
Burkhard Stubert
•
14 views
Techstack Ltd at Slush 2023, Ukrainian delegation by ViktoriiaOpanasenko
Techstack Ltd at Slush 2023, Ukrainian delegation
ViktoriiaOpanasenko
•
7 views
Page Object Model by artembondar5
Page Object Model
artembondar5
•
7 views
Transport Management System - Shipment & Container Tracking by Freightoscope
Transport Management System - Shipment & Container Tracking
Freightoscope
•
6 views
Introduction to Gradle by John Valentino
Introduction to Gradle
John Valentino
•
7 views
Understanding HTML terminology by artembondar5
Understanding HTML terminology
artembondar5
•
8 views
Advanced API Mocking Techniques Using Wiremock by Dimpy Adhikary
Advanced API Mocking Techniques Using Wiremock
Dimpy Adhikary
•
5 views
Streamlining Your Business Operations with Enterprise Application Integration... by Flexsin
Streamlining Your Business Operations with Enterprise Application Integration...
Flexsin
•
5 views
What is API by artembondar5
What is API
artembondar5
•
15 views
FOSSLight Community Day 2023-11-30 by Shane Coughlan
FOSSLight Community Day 2023-11-30
Shane Coughlan
•
8 views
Automated Testing of Microsoft Power BI Reports by RTTS
Automated Testing of Microsoft Power BI Reports
RTTS
•
11 views
Winter Projects GDSC IITK by SahilSingh368445
Winter Projects GDSC IITK
SahilSingh368445
•
416 views
Building a Distributed Data Pipeline
1.
BUILDING A DISTRIBUTED MACHINE LEARNING
AT SCALE
2.
BACKGROUND DATA ▸Data is everywhere ▸Data,
unapplied, is useless ▸How can we turn high volume & velocity data into value?
3.
BACKGROUND PIPELINE ▸Process the data
continuously ▸Apply several processing steps COLLECT MODEL DEPLOY INTEGRA TE
4.
SOLUTION ANALYSE THE STOCK
MARKET YAHOO.C OM YAHOO.C OM (PREFETCHED) COLLECTO R MESSAGE BROKER STREAMIN G STORAGE MODEL MACHINE LEARNING MLlibWEBSERVI CE USER / CLIENTS
5.
DEMO DEMO (FINGERS CROSSED)
6.
DONE QUESTIONS? ▸?