With today's abundance of data (big or small), organization's ability to capture, understand and process new content is key for their success. Martin Magdinier has developed a custom data transformation stack to integrate over an hundred eclectic data feeds into a single repository. His process goes through three stages:
- Data discovery and exploration,
- Rapid data transformation prototyping and
- Automation of data cleaning and transformation process.
This presentation review challenges specific to each step of the integration process, describe tools used (OpenRefine, Talend, Crowdflower) and processes developed to address them while keeping agility and flexibility of the overall stack in mind.
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
Building a successful agile data transformation stack
1. Data Transformation made easy
Building a successful agile data
transformation stack
Martin Magdinier
March 2014
2. Building an Agile Data Transformation Stack
Martin Magdinier
Agile Data Transformation Stack isAgile Data Transformation Stack is
the Key for Successthe Key for Success
3. Building an Agile Data Transformation Stack
Martin Magdinier
If Data is the new oilIf Data is the new oil
Where are the gas station !?!Where are the gas station !?!
● Data is not (yet?) a standardized good:
- Environment with evolving technology and formats
● - Unique need:
● Industry,
● Department,
● Business case
4. Building an Agile Data Transformation Stack
Martin Magdinier
The Data Transformation ProcessThe Data Transformation Process
Your data transformation stack should help you to:
– Explore and search new data
– Identify and Extract relevant data
– Refine/Turn data into usable information
– Store & distribute to business users
5. Building an Agile Data Transformation Stack
Martin Magdinier
The Agile Data Transformation Stack
● Is a combination of complementary tools,
technology and processes,
● Supporting rapid iteration of ideas,
processes and products
● Focused on value creation for the customer
(internal or external)
6. Building an Agile Data Transformation Stack
Martin Magdinier
The Data Transformation Stack
......
Platform
Data Processing
Solutions
Storage
Free
Open Source
Suit your needsAll Software are
cross platform
7. Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery & Profiling
Mine existing data
Add new data Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Learn from your experience
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need
Agile Data Transformation Iteration
8. Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery & Profiling
Data Discovery & Profiling
Mine existing data
Add new data Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Learn from your experience
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need
9. Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery
● Seek:
– New data sources
– New usage for existing data
● Validate
– Does the data match my quality criteria?
– Can I create value out of it?
10. Building an Agile Data Transformation Stack
Martin Magdinier
Data Profiling
● Understand your data and make sense of it
– Mine
– Explore
– Interact
– Transform
● Combine with visualization and reporting tool
● Iterate and explore various vantage points
11. Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery & Profiling
Mine existing data
Add new data
Refine requirements
Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Learn from your experience
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need
Data Transformation
12. Building an Agile Data Transformation Stack
Martin Magdinier
Role of a Working Prototype
● Minimize project cost and development time
● Focus on core functions of the transformation
process (packaging will come later)
● Define your transformation strategy in a
sandbox mode
– Validate your assumption
– Identify road block on the path to automation
13. Building an Agile Data Transformation Stack
Martin Magdinier
Iterate - Iterate - Iterate
● Improve and grow by incremental steps
● Start feeding your business with data
– Validate if there is value in this data
– Collect feedback from the users
● Iterate as much as necessary
14. Building an Agile Data Transformation Stack
Martin Magdinier
Discovery, Profiling & Prototyping
● Designed for technical and business users
● Support a variety of input format
● Allow easy and safe interaction with the data:
– Somewhere between Excel
● Point and click user friendly interface
● Changes Preview
● Undo / Redo functions
– and SQL
● Query oriented language
● Handling large amount of data
15. Building an Agile Data Transformation Stack
Martin Magdinier
OpenRefine Interface
Facet for fast
filtering
Expression builder
Instant preview of the
transformation
16. Building an Agile Data Transformation Stack
Martin Magdinier
Prototyping & Automation
● Extract – Transform – Load solution
● Process focus with
– Drag and drop component graphical interface
– Java based
● Compile your job to run it on your server
– Java (Talend Open Studio)
– Map reduce (Talend for Big Data)
● Connect to anything
● Open Source: Ease of addition / customizing
your own components / library
17. Building an Agile Data Transformation Stack
Martin Magdinier
Talend Open Studio Interface
Drag, drop,
connect and
configure
components
Process
oriented
interface
List of
components
available
18. Building an Agile Data Transformation Stack
Martin Magdinier
Semi Automated Cleaning
● Intelligent Meta
Crowd-sourcing
Platform
● Build your job for
data:
– clean up
– analysis
– categorization
– collection ...
● Ensure quality output
– Check consistency of
results
– Select best worker
● Web Interface to
– Build Prototype
– Test job
● API for automation
– OpenRefine extension
– Talend Internet
component
19. Building an Agile Data Transformation Stack
Martin Magdinier
Lesson Learned
Data Discovery & Profiling
Mine existing data
Add new data
Refine requirements
Data Transformation
Process & Code
Prototype (MVP)
Semi automated
Automation
Track / Measure
Collect feedback
Learn from your experience
Progress in
small
incremental
steps
Data Consumption
Create value
Generate new need
20. Building an Agile Data Transformation Stack
Martin Magdinier
Don't repeat yourself
● 1 process = 1 independent component / job
● Reuse your existing components
● Maintain your code in one place
● Add few new items at each iteration
21. Building an Agile Data Transformation Stack
Martin Magdinier
Name Splitting
3. Move the talend component to a routine
● Split FullName into FirstName and LastName
– John Doe / John Van de Doe / John Della Doe
1. Define Logic and exception list in OpenRefine
2. Translate the logic into a talend component (tJavaRow)
22. Building an Agile Data Transformation Stack
Martin Magdinier
Garbage in - Garbage out
● Catch errors early
– The sooner, the easier
– Do not build the next step on erroneous data
● Independent process
– Make it easier to track and debug.
– When the bug is fixed, every process / job
benefit from it
23. Building an Agile Data Transformation Stack
Martin Magdinier
Know where the value is
● Poorly planned data cleaning process is a
never ending job (and a depressing experience)
● Prototyping helps to
– Anticipate how dirty the data is
● Plan appropriate strategy
● Discard the source early on if too dirty
– Set quality level of acceptance
● Level of granularity
● Data format
● ...
24. Building an Agile Data Transformation Stack
Martin Magdinier
Example: Address parsing
Example:
91 King Street East
305 – 1055, 20 TH ST SW
● Option A:
– Address Line 1
– Address Line 2
● Option B:
– Street Number
– Street Name
– Unit / PO Box
– Unit / PO Box Number
25. Building an Agile Data Transformation Stack
Martin Magdinier
Know when to stop
● Plan your process keeping in mind the effort to
– Build
– Operate
– Maintain
● Balance fully automated vs semi-automated
process
– Manual Cleaning - Crowdflower API
– OpenRefine Redo / Apply function
– Talend job
26. Building an Agile Data Transformation Stack
Martin Magdinier
Undo / redo in OpenRefine
History to undo
previous steps
Extract and re apply transformation
steps on a different project
JSON code to copy /
paste in a different
project
27. Building an Agile Data Transformation Stack
Martin Magdinier
Know when to stop
Build your job in Crowdflower
28. Building an Agile Data Transformation Stack
Martin Magdinier
Cleaning Typo
● How do you spell:
– Mississagua
– mississauga
– Mississauga
– Mississuaga
– Misssisauga
● Algorithms
– Levenshtein
– Fingerprint
– n-gram
– Metaphone
– PPM
● Process followed
– Test and explore various algorithms in OpenRefine
– Automate in Talend with tFuzzyMatch
– Add human validation over a certain threshold
29. Building an Agile Data Transformation Stack
Martin Magdinier
Cleaning Typo
1. OpenRefine cluster interface to test different algorithms
2. tFuzzyMatch in talend to
automate transformation
30. Building an Agile Data Transformation Stack
Martin Magdinier
Conclusion
● Think Agile!
● Iterate as often as you can
– Start small and build on it
– Confirm your assumption
– Focus on value creation
● Build a data friendly environment
– Chose your tools carefully
– Leave room for learning and growing
31. Building an Agile Data Transformation Stack
Martin Magdinier
Contact
Ask me questions!
Martin Magdinier
● Linkedin: www.linkedin.com/in/magdinier/en
● Twitter: @magdmartin
● Email
– martin.magdinier@gmail.com
– mmagdinier@alleyneinc.net