Building a successful agile data transformation stack

Data Transformation made easy
Building a successful agile data
transformation stack
Martin Magdinier
March 2014

Building an Agile Data Transformation Stack
Martin Magdinier
Agile Data Transformation Stack isAgile Data Transformation Stack is
the Key for Successthe Key for Success

Martin Magdinier
If Data is the new oilIf Data is the new oil
Where are the gas station !?!Where are the gas station !?!
● Data is not (yet?) a standardized good:
- Environment with evolving technology and formats
● - Unique need:
● Industry,
● Department,
● Business case

Martin Magdinier
The Data Transformation ProcessThe Data Transformation Process
Your data transformation stack should help you to:
– Explore and search new data
– Identify and Extract relevant data
– Refine/Turn data into usable information
– Store & distribute to business users

Martin Magdinier
The Agile Data Transformation Stack
● Is a combination of complementary tools,
technology and processes,
● Supporting rapid iteration of ideas,
processes and products
● Focused on value creation for the customer
(internal or external)

Martin Magdinier
The Data Transformation Stack
......
Platform
Data Processing
Solutions
Storage
Free
Open Source
Suit your needsAll Software are
cross platform

Martin Magdinier
Data Discovery & Profiling
Mine existing data
Add new data Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Learn from your experience
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need
Agile Data Transformation Iteration

Martin Magdinier
Mine existing data
Add new data Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need

Martin Magdinier
Data Discovery
● Seek:
– New data sources
– New usage for existing data
● Validate
– Does the data match my quality criteria?
– Can I create value out of it?

Martin Magdinier
Data Profiling
● Understand your data and make sense of it
– Mine
– Explore
– Interact
– Transform
● Combine with visualization and reporting tool
● Iterate and explore various vantage points

Martin Magdinier
Mine existing data
Add new data
Refine requirements
Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need
Data Transformation

Martin Magdinier
Role of a Working Prototype
● Minimize project cost and development time
● Focus on core functions of the transformation
process (packaging will come later)
● Define your transformation strategy in a
sandbox mode
– Validate your assumption
– Identify road block on the path to automation

Martin Magdinier
Iterate - Iterate - Iterate
● Improve and grow by incremental steps
● Start feeding your business with data
– Validate if there is value in this data
– Collect feedback from the users
● Iterate as much as necessary

Martin Magdinier
Discovery, Profiling & Prototyping
● Designed for technical and business users
● Support a variety of input format
● Allow easy and safe interaction with the data:
– Somewhere between Excel
● Point and click user friendly interface
● Changes Preview
● Undo / Redo functions
– and SQL
● Query oriented language
● Handling large amount of data

Martin Magdinier
OpenRefine Interface
Facet for fast
filtering
Expression builder
Instant preview of the
transformation

Martin Magdinier
Prototyping & Automation
● Extract – Transform – Load solution
● Process focus with
– Drag and drop component graphical interface
– Java based
● Compile your job to run it on your server
– Java (Talend Open Studio)
– Map reduce (Talend for Big Data)
● Connect to anything
● Open Source: Ease of addition / customizing
your own components / library

Martin Magdinier
Talend Open Studio Interface
Drag, drop,
connect and
configure
components
Process
oriented
interface
List of
components
available

Martin Magdinier
Semi Automated Cleaning
● Intelligent Meta
Crowd-sourcing
Platform
● Build your job for
data:
– clean up
– analysis
– categorization
– collection ...
● Ensure quality output
– Check consistency of
results
– Select best worker
● Web Interface to
– Build Prototype
– Test job
● API for automation
– OpenRefine extension
– Talend Internet
component

Martin Magdinier
Lesson Learned
Mine existing data
Add new data
Refine requirements
Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need

Martin Magdinier
Don't repeat yourself
● 1 process = 1 independent component / job
● Reuse your existing components
● Maintain your code in one place
● Add few new items at each iteration

Martin Magdinier
Name Splitting
3. Move the talend component to a routine
● Split FullName into FirstName and LastName
– John Doe / John Van de Doe / John Della Doe
1. Define Logic and exception list in OpenRefine
2. Translate the logic into a talend component (tJavaRow)

Martin Magdinier
Garbage in - Garbage out
● Catch errors early
– The sooner, the easier
– Do not build the next step on erroneous data
● Independent process
– Make it easier to track and debug.
– When the bug is fixed, every process / job
benefit from it

Martin Magdinier
Know where the value is
● Poorly planned data cleaning process is a
never ending job (and a depressing experience)
● Prototyping helps to
– Anticipate how dirty the data is
● Plan appropriate strategy
● Discard the source early on if too dirty
– Set quality level of acceptance
● Level of granularity
● Data format
● ...

Martin Magdinier
Example: Address parsing
Example:
91 King Street East
305 – 1055, 20 TH ST SW
● Option A:
– Address Line 1
– Address Line 2
● Option B:
– Street Number
– Street Name
– Unit / PO Box
– Unit / PO Box Number

Martin Magdinier
Know when to stop
● Plan your process keeping in mind the effort to
– Build
– Operate
– Maintain
● Balance fully automated vs semi-automated
process
– Manual Cleaning - Crowdflower API
– OpenRefine Redo / Apply function
– Talend job

Martin Magdinier
Undo / redo in OpenRefine
History to undo
previous steps
Extract and re apply transformation
steps on a different project
JSON code to copy /
paste in a different
project

Martin Magdinier
Know when to stop
Build your job in Crowdflower

Martin Magdinier
Cleaning Typo
● How do you spell:
– Mississagua
– mississauga
– Mississauga
– Mississuaga
– Misssisauga
● Algorithms
– Levenshtein
– Fingerprint
– n-gram
– Metaphone
– PPM
● Process followed
– Test and explore various algorithms in OpenRefine
– Automate in Talend with tFuzzyMatch
– Add human validation over a certain threshold

Martin Magdinier
Cleaning Typo
1. OpenRefine cluster interface to test different algorithms
2. tFuzzyMatch in talend to
automate transformation

Martin Magdinier
Conclusion
● Think Agile!
● Iterate as often as you can
– Start small and build on it
– Confirm your assumption
– Focus on value creation
● Build a data friendly environment
– Chose your tools carefully
– Leave room for learning and growing

Martin Magdinier
Contact
Ask me questions!
Martin Magdinier
● Linkedin: www.linkedin.com/in/magdinier/en
● Twitter: @magdmartin
● Email
– martin.magdinier@gmail.com
– mmagdinier@alleyneinc.net

Building a successful agile data transformation stack

Recommended

Recommended

More Related Content

Similar to Building a successful agile data transformation stack

Similar to Building a successful agile data transformation stack (20)

More from Martin Magdinier

More from Martin Magdinier (6)

Recently uploaded

Recently uploaded (20)

Building a successful agile data transformation stack