SlideShare a Scribd company logo
1 of 31
Download to read offline
Data Transformation made easy
Building a successful agile data
transformation stack
Martin Magdinier
March 2014
Building an Agile Data Transformation Stack
Martin Magdinier
Agile Data Transformation Stack isAgile Data Transformation Stack is
the Key for Successthe Key for Success
Building an Agile Data Transformation Stack
Martin Magdinier
If Data is the new oilIf Data is the new oil
Where are the gas station !?!Where are the gas station !?!
● Data is not (yet?) a standardized good:
- Environment with evolving technology and formats
● - Unique need:
● Industry,
● Department,
● Business case
Building an Agile Data Transformation Stack
Martin Magdinier
The Data Transformation ProcessThe Data Transformation Process
Your data transformation stack should help you to:
– Explore and search new data
– Identify and Extract relevant data
– Refine/Turn data into usable information
– Store & distribute to business users
Building an Agile Data Transformation Stack
Martin Magdinier
The Agile Data Transformation Stack
● Is a combination of complementary tools,
technology and processes,
● Supporting rapid iteration of ideas,
processes and products
● Focused on value creation for the customer
(internal or external)
Building an Agile Data Transformation Stack
Martin Magdinier
The Data Transformation Stack
......
Platform
Data Processing
Solutions
Storage
Free
Open Source
Suit your needsAll Software are
cross platform
Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery & Profiling
Mine existing data
Add new data Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Learn from your experience
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need
Agile Data Transformation Iteration
Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery & Profiling
Data Discovery & Profiling
Mine existing data
Add new data Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Learn from your experience
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need
Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery
● Seek:
– New data sources
– New usage for existing data
● Validate
– Does the data match my quality criteria?
– Can I create value out of it?
Building an Agile Data Transformation Stack
Martin Magdinier
Data Profiling
● Understand your data and make sense of it
– Mine
– Explore
– Interact
– Transform
● Combine with visualization and reporting tool
● Iterate and explore various vantage points
Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery & Profiling
Mine existing data
Add new data
Refine requirements
Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Learn from your experience
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need
Data Transformation
Building an Agile Data Transformation Stack
Martin Magdinier
Role of a Working Prototype
● Minimize project cost and development time
● Focus on core functions of the transformation
process (packaging will come later)
● Define your transformation strategy in a
sandbox mode
– Validate your assumption
– Identify road block on the path to automation
Building an Agile Data Transformation Stack
Martin Magdinier
Iterate - Iterate - Iterate
● Improve and grow by incremental steps
● Start feeding your business with data
– Validate if there is value in this data
– Collect feedback from the users
● Iterate as much as necessary
Building an Agile Data Transformation Stack
Martin Magdinier
Discovery, Profiling & Prototyping
● Designed for technical and business users
● Support a variety of input format
● Allow easy and safe interaction with the data:
– Somewhere between Excel
● Point and click user friendly interface
● Changes Preview
● Undo / Redo functions
– and SQL
● Query oriented language
● Handling large amount of data
Building an Agile Data Transformation Stack
Martin Magdinier
OpenRefine Interface
Facet for fast
filtering
Expression builder
Instant preview of the
transformation
Building an Agile Data Transformation Stack
Martin Magdinier
Prototyping & Automation
● Extract – Transform – Load solution
● Process focus with
– Drag and drop component graphical interface
– Java based
● Compile your job to run it on your server
– Java (Talend Open Studio)
– Map reduce (Talend for Big Data)
● Connect to anything
● Open Source: Ease of addition / customizing
your own components / library
Building an Agile Data Transformation Stack
Martin Magdinier
Talend Open Studio Interface
Drag, drop,
connect and
configure
components
Process
oriented
interface
List of
components
available
Building an Agile Data Transformation Stack
Martin Magdinier
Semi Automated Cleaning
● Intelligent Meta
Crowd-sourcing
Platform
● Build your job for
data:
– clean up
– analysis
– categorization
– collection ...
● Ensure quality output
– Check consistency of
results
– Select best worker
● Web Interface to
– Build Prototype
– Test job
● API for automation
– OpenRefine extension
– Talend Internet
component
Building an Agile Data Transformation Stack
Martin Magdinier
Lesson Learned
Data Discovery & Profiling
Mine existing data
Add new data
Refine requirements
Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Learn from your experience
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need
Building an Agile Data Transformation Stack
Martin Magdinier
Don't repeat yourself
● 1 process = 1 independent component / job
● Reuse your existing components
● Maintain your code in one place
● Add few new items at each iteration
Building an Agile Data Transformation Stack
Martin Magdinier
Name Splitting
3. Move the talend component to a routine
● Split FullName into FirstName and LastName
– John Doe / John Van de Doe / John Della Doe
1. Define Logic and exception list in OpenRefine
2. Translate the logic into a talend component (tJavaRow)
Building an Agile Data Transformation Stack
Martin Magdinier
Garbage in - Garbage out
● Catch errors early
– The sooner, the easier
– Do not build the next step on erroneous data
● Independent process
– Make it easier to track and debug.
– When the bug is fixed, every process / job
benefit from it
Building an Agile Data Transformation Stack
Martin Magdinier
Know where the value is
● Poorly planned data cleaning process is a
never ending job (and a depressing experience)
● Prototyping helps to
– Anticipate how dirty the data is
● Plan appropriate strategy
● Discard the source early on if too dirty
– Set quality level of acceptance
● Level of granularity
● Data format
● ...
Building an Agile Data Transformation Stack
Martin Magdinier
Example: Address parsing
Example:
91 King Street East
305 – 1055, 20 TH ST SW
● Option A:
– Address Line 1
– Address Line 2
● Option B:
– Street Number
– Street Name
– Unit / PO Box
– Unit / PO Box Number
Building an Agile Data Transformation Stack
Martin Magdinier
Know when to stop
● Plan your process keeping in mind the effort to
– Build
– Operate
– Maintain
● Balance fully automated vs semi-automated
process
– Manual Cleaning - Crowdflower API
– OpenRefine Redo / Apply function
– Talend job
Building an Agile Data Transformation Stack
Martin Magdinier
Undo / redo in OpenRefine
History to undo
previous steps
Extract and re apply transformation
steps on a different project
JSON code to copy /
paste in a different
project
Building an Agile Data Transformation Stack
Martin Magdinier
Know when to stop
Build your job in Crowdflower
Building an Agile Data Transformation Stack
Martin Magdinier
Cleaning Typo
● How do you spell:
– Mississagua
– mississauga
– Mississauga
– Mississuaga
– Misssisauga
● Algorithms
– Levenshtein
– Fingerprint
– n-gram
– Metaphone
– PPM
● Process followed
– Test and explore various algorithms in OpenRefine
– Automate in Talend with tFuzzyMatch
– Add human validation over a certain threshold
Building an Agile Data Transformation Stack
Martin Magdinier
Cleaning Typo
1. OpenRefine cluster interface to test different algorithms
2. tFuzzyMatch in talend to
automate transformation
Building an Agile Data Transformation Stack
Martin Magdinier
Conclusion
● Think Agile!
● Iterate as often as you can
– Start small and build on it
– Confirm your assumption
– Focus on value creation
● Build a data friendly environment
– Chose your tools carefully
– Leave room for learning and growing
Building an Agile Data Transformation Stack
Martin Magdinier
Contact
Ask me questions!
Martin Magdinier
● Linkedin: www.linkedin.com/in/magdinier/en
● Twitter: @magdmartin
● Email
– martin.magdinier@gmail.com
– mmagdinier@alleyneinc.net

More Related Content

Similar to Building a successful agile data transformation stack

Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
Using Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemUsing Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemVMware Tanzu
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
Meta Data Framework
Meta Data FrameworkMeta Data Framework
Meta Data FrameworkMark Nießen
 
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...Christopher Gutknecht
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith Kumar Pampatti
 
Understand your data dependencies – Key enabler to efficient modernisation
 Understand your data dependencies – Key enabler to efficient modernisation  Understand your data dependencies – Key enabler to efficient modernisation
Understand your data dependencies – Key enabler to efficient modernisation Profinit
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data PlatformDani Solà Lagares
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Managing an Experimentation Platform by LinkedIn Product Leader
Managing an Experimentation Platform by LinkedIn Product LeaderManaging an Experimentation Platform by LinkedIn Product Leader
Managing an Experimentation Platform by LinkedIn Product LeaderProduct School
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Benjamin Bengfort
 
Talend Open Studio Data Integration
Talend Open Studio Data IntegrationTalend Open Studio Data Integration
Talend Open Studio Data IntegrationRoberto Marchetto
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
 
FinTech Data Challenges @ Nerdwallet
FinTech Data Challenges @ Nerdwallet FinTech Data Challenges @ Nerdwallet
FinTech Data Challenges @ Nerdwallet Vaibhav Jajoo
 
Talend 6.1 - What's New in Talend?
Talend 6.1 - What's New in Talend?Talend 6.1 - What's New in Talend?
Talend 6.1 - What's New in Talend?Talend
 

Similar to Building a successful agile data transformation stack (20)

Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Using Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation SystemUsing Data Science to Build an End-to-End Recommendation System
Using Data Science to Build an End-to-End Recommendation System
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Meta Data Framework
Meta Data FrameworkMeta Data Framework
Meta Data Framework
 
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
 
Resume
ResumeResume
Resume
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
 
Understand your data dependencies – Key enabler to efficient modernisation
 Understand your data dependencies – Key enabler to efficient modernisation  Understand your data dependencies – Key enabler to efficient modernisation
Understand your data dependencies – Key enabler to efficient modernisation
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Managing an Experimentation Platform by LinkedIn Product Leader
Managing an Experimentation Platform by LinkedIn Product LeaderManaging an Experimentation Platform by LinkedIn Product Leader
Managing an Experimentation Platform by LinkedIn Product Leader
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
 
Talend Open Studio Data Integration
Talend Open Studio Data IntegrationTalend Open Studio Data Integration
Talend Open Studio Data Integration
 
Informatica Online Training.pdf
Informatica Online Training.pdfInformatica Online Training.pdf
Informatica Online Training.pdf
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
FinTech Data Challenges @ Nerdwallet
FinTech Data Challenges @ Nerdwallet FinTech Data Challenges @ Nerdwallet
FinTech Data Challenges @ Nerdwallet
 
Talend 6.1 - What's New in Talend?
Talend 6.1 - What's New in Talend?Talend 6.1 - What's New in Talend?
Talend 6.1 - What's New in Talend?
 

More from Martin Magdinier

Data Operations for CRM and Marketing - OpenRefine Demo - Webinar
Data Operations for CRM and Marketing - OpenRefine Demo - WebinarData Operations for CRM and Marketing - OpenRefine Demo - Webinar
Data Operations for CRM and Marketing - OpenRefine Demo - WebinarMartin Magdinier
 
OpenRefine March 2019 Status
OpenRefine March 2019 StatusOpenRefine March 2019 Status
OpenRefine March 2019 StatusMartin Magdinier
 
Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Martin Magdinier
 
Iterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refineIterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refineMartin Magdinier
 
20130626 OpenRefine Introduction
20130626 OpenRefine Introduction20130626 OpenRefine Introduction
20130626 OpenRefine IntroductionMartin Magdinier
 

More from Martin Magdinier (6)

Data Operations for CRM and Marketing - OpenRefine Demo - Webinar
Data Operations for CRM and Marketing - OpenRefine Demo - WebinarData Operations for CRM and Marketing - OpenRefine Demo - Webinar
Data Operations for CRM and Marketing - OpenRefine Demo - Webinar
 
OpenRefine March 2019 Status
OpenRefine March 2019 StatusOpenRefine March 2019 Status
OpenRefine March 2019 Status
 
Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015
 
Iterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refineIterative data discovery and transformation with open refine
Iterative data discovery and transformation with open refine
 
20130626 OpenRefine Introduction
20130626 OpenRefine Introduction20130626 OpenRefine Introduction
20130626 OpenRefine Introduction
 
20130206 open refine
20130206  open refine20130206  open refine
20130206 open refine
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 

Building a successful agile data transformation stack

  • 1. Data Transformation made easy Building a successful agile data transformation stack Martin Magdinier March 2014
  • 2. Building an Agile Data Transformation Stack Martin Magdinier Agile Data Transformation Stack isAgile Data Transformation Stack is the Key for Successthe Key for Success
  • 3. Building an Agile Data Transformation Stack Martin Magdinier If Data is the new oilIf Data is the new oil Where are the gas station !?!Where are the gas station !?! ● Data is not (yet?) a standardized good: - Environment with evolving technology and formats ● - Unique need: ● Industry, ● Department, ● Business case
  • 4. Building an Agile Data Transformation Stack Martin Magdinier The Data Transformation ProcessThe Data Transformation Process Your data transformation stack should help you to: – Explore and search new data – Identify and Extract relevant data – Refine/Turn data into usable information – Store & distribute to business users
  • 5. Building an Agile Data Transformation Stack Martin Magdinier The Agile Data Transformation Stack ● Is a combination of complementary tools, technology and processes, ● Supporting rapid iteration of ideas, processes and products ● Focused on value creation for the customer (internal or external)
  • 6. Building an Agile Data Transformation Stack Martin Magdinier The Data Transformation Stack ...... Platform Data Processing Solutions Storage Free Open Source Suit your needsAll Software are cross platform
  • 7. Building an Agile Data Transformation Stack Martin Magdinier Data Discovery & Profiling Mine existing data Add new data Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need Agile Data Transformation Iteration
  • 8. Building an Agile Data Transformation Stack Martin Magdinier Data Discovery & Profiling Data Discovery & Profiling Mine existing data Add new data Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need
  • 9. Building an Agile Data Transformation Stack Martin Magdinier Data Discovery ● Seek: – New data sources – New usage for existing data ● Validate – Does the data match my quality criteria? – Can I create value out of it?
  • 10. Building an Agile Data Transformation Stack Martin Magdinier Data Profiling ● Understand your data and make sense of it – Mine – Explore – Interact – Transform ● Combine with visualization and reporting tool ● Iterate and explore various vantage points
  • 11. Building an Agile Data Transformation Stack Martin Magdinier Data Discovery & Profiling Mine existing data Add new data Refine requirements Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need Data Transformation
  • 12. Building an Agile Data Transformation Stack Martin Magdinier Role of a Working Prototype ● Minimize project cost and development time ● Focus on core functions of the transformation process (packaging will come later) ● Define your transformation strategy in a sandbox mode – Validate your assumption – Identify road block on the path to automation
  • 13. Building an Agile Data Transformation Stack Martin Magdinier Iterate - Iterate - Iterate ● Improve and grow by incremental steps ● Start feeding your business with data – Validate if there is value in this data – Collect feedback from the users ● Iterate as much as necessary
  • 14. Building an Agile Data Transformation Stack Martin Magdinier Discovery, Profiling & Prototyping ● Designed for technical and business users ● Support a variety of input format ● Allow easy and safe interaction with the data: – Somewhere between Excel ● Point and click user friendly interface ● Changes Preview ● Undo / Redo functions – and SQL ● Query oriented language ● Handling large amount of data
  • 15. Building an Agile Data Transformation Stack Martin Magdinier OpenRefine Interface Facet for fast filtering Expression builder Instant preview of the transformation
  • 16. Building an Agile Data Transformation Stack Martin Magdinier Prototyping & Automation ● Extract – Transform – Load solution ● Process focus with – Drag and drop component graphical interface – Java based ● Compile your job to run it on your server – Java (Talend Open Studio) – Map reduce (Talend for Big Data) ● Connect to anything ● Open Source: Ease of addition / customizing your own components / library
  • 17. Building an Agile Data Transformation Stack Martin Magdinier Talend Open Studio Interface Drag, drop, connect and configure components Process oriented interface List of components available
  • 18. Building an Agile Data Transformation Stack Martin Magdinier Semi Automated Cleaning ● Intelligent Meta Crowd-sourcing Platform ● Build your job for data: – clean up – analysis – categorization – collection ... ● Ensure quality output – Check consistency of results – Select best worker ● Web Interface to – Build Prototype – Test job ● API for automation – OpenRefine extension – Talend Internet component
  • 19. Building an Agile Data Transformation Stack Martin Magdinier Lesson Learned Data Discovery & Profiling Mine existing data Add new data Refine requirements Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need
  • 20. Building an Agile Data Transformation Stack Martin Magdinier Don't repeat yourself ● 1 process = 1 independent component / job ● Reuse your existing components ● Maintain your code in one place ● Add few new items at each iteration
  • 21. Building an Agile Data Transformation Stack Martin Magdinier Name Splitting 3. Move the talend component to a routine ● Split FullName into FirstName and LastName – John Doe / John Van de Doe / John Della Doe 1. Define Logic and exception list in OpenRefine 2. Translate the logic into a talend component (tJavaRow)
  • 22. Building an Agile Data Transformation Stack Martin Magdinier Garbage in - Garbage out ● Catch errors early – The sooner, the easier – Do not build the next step on erroneous data ● Independent process – Make it easier to track and debug. – When the bug is fixed, every process / job benefit from it
  • 23. Building an Agile Data Transformation Stack Martin Magdinier Know where the value is ● Poorly planned data cleaning process is a never ending job (and a depressing experience) ● Prototyping helps to – Anticipate how dirty the data is ● Plan appropriate strategy ● Discard the source early on if too dirty – Set quality level of acceptance ● Level of granularity ● Data format ● ...
  • 24. Building an Agile Data Transformation Stack Martin Magdinier Example: Address parsing Example: 91 King Street East 305 – 1055, 20 TH ST SW ● Option A: – Address Line 1 – Address Line 2 ● Option B: – Street Number – Street Name – Unit / PO Box – Unit / PO Box Number
  • 25. Building an Agile Data Transformation Stack Martin Magdinier Know when to stop ● Plan your process keeping in mind the effort to – Build – Operate – Maintain ● Balance fully automated vs semi-automated process – Manual Cleaning - Crowdflower API – OpenRefine Redo / Apply function – Talend job
  • 26. Building an Agile Data Transformation Stack Martin Magdinier Undo / redo in OpenRefine History to undo previous steps Extract and re apply transformation steps on a different project JSON code to copy / paste in a different project
  • 27. Building an Agile Data Transformation Stack Martin Magdinier Know when to stop Build your job in Crowdflower
  • 28. Building an Agile Data Transformation Stack Martin Magdinier Cleaning Typo ● How do you spell: – Mississagua – mississauga – Mississauga – Mississuaga – Misssisauga ● Algorithms – Levenshtein – Fingerprint – n-gram – Metaphone – PPM ● Process followed – Test and explore various algorithms in OpenRefine – Automate in Talend with tFuzzyMatch – Add human validation over a certain threshold
  • 29. Building an Agile Data Transformation Stack Martin Magdinier Cleaning Typo 1. OpenRefine cluster interface to test different algorithms 2. tFuzzyMatch in talend to automate transformation
  • 30. Building an Agile Data Transformation Stack Martin Magdinier Conclusion ● Think Agile! ● Iterate as often as you can – Start small and build on it – Confirm your assumption – Focus on value creation ● Build a data friendly environment – Chose your tools carefully – Leave room for learning and growing
  • 31. Building an Agile Data Transformation Stack Martin Magdinier Contact Ask me questions! Martin Magdinier ● Linkedin: www.linkedin.com/in/magdinier/en ● Twitter: @magdmartin ● Email – martin.magdinier@gmail.com – mmagdinier@alleyneinc.net