Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Transformation made easy
Building a successful agile data
transformation stack
Martin Magdinier
March 2014
Building an Agile Data Transformation Stack
Martin Magdinier
Agile Data Transformation Stack isAgile Data Transformation S...
Building an Agile Data Transformation Stack
Martin Magdinier
If Data is the new oilIf Data is the new oil
Where are the ga...
Building an Agile Data Transformation Stack
Martin Magdinier
The Data Transformation ProcessThe Data Transformation Proces...
Building an Agile Data Transformation Stack
Martin Magdinier
The Agile Data Transformation Stack
● Is a combination of com...
Building an Agile Data Transformation Stack
Martin Magdinier
The Data Transformation Stack
......
Platform
Data Processing...
Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery & Profiling
Mine existing data
Add new data Da...
Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery & Profiling
Data Discovery & Profiling
Mine ex...
Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery
● Seek:
– New data sources
– New usage for exi...
Building an Agile Data Transformation Stack
Martin Magdinier
Data Profiling
● Understand your data and make sense of it
– ...
Building an Agile Data Transformation Stack
Martin Magdinier
Data Discovery & Profiling
Mine existing data
Add new data
Re...
Building an Agile Data Transformation Stack
Martin Magdinier
Role of a Working Prototype
● Minimize project cost and devel...
Building an Agile Data Transformation Stack
Martin Magdinier
Iterate - Iterate - Iterate
● Improve and grow by incremental...
Building an Agile Data Transformation Stack
Martin Magdinier
Discovery, Profiling & Prototyping
● Designed for technical a...
Building an Agile Data Transformation Stack
Martin Magdinier
OpenRefine Interface
Facet for fast
filtering
Expression buil...
Building an Agile Data Transformation Stack
Martin Magdinier
Prototyping & Automation
● Extract – Transform – Load solutio...
Building an Agile Data Transformation Stack
Martin Magdinier
Talend Open Studio Interface
Drag, drop,
connect and
configur...
Building an Agile Data Transformation Stack
Martin Magdinier
Semi Automated Cleaning
● Intelligent Meta
Crowd-sourcing
Pla...
Building an Agile Data Transformation Stack
Martin Magdinier
Lesson Learned
Data Discovery & Profiling
Mine existing data
...
Building an Agile Data Transformation Stack
Martin Magdinier
Don't repeat yourself
● 1 process = 1 independent component /...
Building an Agile Data Transformation Stack
Martin Magdinier
Name Splitting
3. Move the talend component to a routine
● Sp...
Building an Agile Data Transformation Stack
Martin Magdinier
Garbage in - Garbage out
● Catch errors early
– The sooner, t...
Building an Agile Data Transformation Stack
Martin Magdinier
Know where the value is
● Poorly planned data cleaning proces...
Building an Agile Data Transformation Stack
Martin Magdinier
Example: Address parsing
Example:
91 King Street East
305 – 1...
Building an Agile Data Transformation Stack
Martin Magdinier
Know when to stop
● Plan your process keeping in mind the eff...
Building an Agile Data Transformation Stack
Martin Magdinier
Undo / redo in OpenRefine
History to undo
previous steps
Extr...
Building an Agile Data Transformation Stack
Martin Magdinier
Know when to stop
Build your job in Crowdflower
Building an Agile Data Transformation Stack
Martin Magdinier
Cleaning Typo
● How do you spell:
– Mississagua
– mississauga...
Building an Agile Data Transformation Stack
Martin Magdinier
Cleaning Typo
1. OpenRefine cluster interface to test differe...
Building an Agile Data Transformation Stack
Martin Magdinier
Conclusion
● Think Agile!
● Iterate as often as you can
– Sta...
Building an Agile Data Transformation Stack
Martin Magdinier
Contact
Ask me questions!
Martin Magdinier
● Linkedin: www.li...
Upcoming SlideShare
Loading in …5
×

Building a successful agile data transformation stack

831 views

Published on

With today's abundance of data (big or small), organization's ability to capture, understand and process new content is key for their success. Martin Magdinier has developed a custom data transformation stack to integrate over an hundred eclectic data feeds into a single repository. His process goes through three stages:
- Data discovery and exploration,
- Rapid data transformation prototyping and
- Automation of data cleaning and transformation process.

This presentation review challenges specific to each step of the integration process, describe tools used (OpenRefine, Talend, Crowdflower) and processes developed to address them while keeping agility and flexibility of the overall stack in mind.

Published in: Technology
  • Be the first to comment

Building a successful agile data transformation stack

  1. 1. Data Transformation made easy Building a successful agile data transformation stack Martin Magdinier March 2014
  2. 2. Building an Agile Data Transformation Stack Martin Magdinier Agile Data Transformation Stack isAgile Data Transformation Stack is the Key for Successthe Key for Success
  3. 3. Building an Agile Data Transformation Stack Martin Magdinier If Data is the new oilIf Data is the new oil Where are the gas station !?!Where are the gas station !?! ● Data is not (yet?) a standardized good: - Environment with evolving technology and formats ● - Unique need: ● Industry, ● Department, ● Business case
  4. 4. Building an Agile Data Transformation Stack Martin Magdinier The Data Transformation ProcessThe Data Transformation Process Your data transformation stack should help you to: – Explore and search new data – Identify and Extract relevant data – Refine/Turn data into usable information – Store & distribute to business users
  5. 5. Building an Agile Data Transformation Stack Martin Magdinier The Agile Data Transformation Stack ● Is a combination of complementary tools, technology and processes, ● Supporting rapid iteration of ideas, processes and products ● Focused on value creation for the customer (internal or external)
  6. 6. Building an Agile Data Transformation Stack Martin Magdinier The Data Transformation Stack ...... Platform Data Processing Solutions Storage Free Open Source Suit your needsAll Software are cross platform
  7. 7. Building an Agile Data Transformation Stack Martin Magdinier Data Discovery & Profiling Mine existing data Add new data Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need Agile Data Transformation Iteration
  8. 8. Building an Agile Data Transformation Stack Martin Magdinier Data Discovery & Profiling Data Discovery & Profiling Mine existing data Add new data Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need
  9. 9. Building an Agile Data Transformation Stack Martin Magdinier Data Discovery ● Seek: – New data sources – New usage for existing data ● Validate – Does the data match my quality criteria? – Can I create value out of it?
  10. 10. Building an Agile Data Transformation Stack Martin Magdinier Data Profiling ● Understand your data and make sense of it – Mine – Explore – Interact – Transform ● Combine with visualization and reporting tool ● Iterate and explore various vantage points
  11. 11. Building an Agile Data Transformation Stack Martin Magdinier Data Discovery & Profiling Mine existing data Add new data Refine requirements Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need Data Transformation
  12. 12. Building an Agile Data Transformation Stack Martin Magdinier Role of a Working Prototype ● Minimize project cost and development time ● Focus on core functions of the transformation process (packaging will come later) ● Define your transformation strategy in a sandbox mode – Validate your assumption – Identify road block on the path to automation
  13. 13. Building an Agile Data Transformation Stack Martin Magdinier Iterate - Iterate - Iterate ● Improve and grow by incremental steps ● Start feeding your business with data – Validate if there is value in this data – Collect feedback from the users ● Iterate as much as necessary
  14. 14. Building an Agile Data Transformation Stack Martin Magdinier Discovery, Profiling & Prototyping ● Designed for technical and business users ● Support a variety of input format ● Allow easy and safe interaction with the data: – Somewhere between Excel ● Point and click user friendly interface ● Changes Preview ● Undo / Redo functions – and SQL ● Query oriented language ● Handling large amount of data
  15. 15. Building an Agile Data Transformation Stack Martin Magdinier OpenRefine Interface Facet for fast filtering Expression builder Instant preview of the transformation
  16. 16. Building an Agile Data Transformation Stack Martin Magdinier Prototyping & Automation ● Extract – Transform – Load solution ● Process focus with – Drag and drop component graphical interface – Java based ● Compile your job to run it on your server – Java (Talend Open Studio) – Map reduce (Talend for Big Data) ● Connect to anything ● Open Source: Ease of addition / customizing your own components / library
  17. 17. Building an Agile Data Transformation Stack Martin Magdinier Talend Open Studio Interface Drag, drop, connect and configure components Process oriented interface List of components available
  18. 18. Building an Agile Data Transformation Stack Martin Magdinier Semi Automated Cleaning ● Intelligent Meta Crowd-sourcing Platform ● Build your job for data: – clean up – analysis – categorization – collection ... ● Ensure quality output – Check consistency of results – Select best worker ● Web Interface to – Build Prototype – Test job ● API for automation – OpenRefine extension – Talend Internet component
  19. 19. Building an Agile Data Transformation Stack Martin Magdinier Lesson Learned Data Discovery & Profiling Mine existing data Add new data Refine requirements Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need
  20. 20. Building an Agile Data Transformation Stack Martin Magdinier Don't repeat yourself ● 1 process = 1 independent component / job ● Reuse your existing components ● Maintain your code in one place ● Add few new items at each iteration
  21. 21. Building an Agile Data Transformation Stack Martin Magdinier Name Splitting 3. Move the talend component to a routine ● Split FullName into FirstName and LastName – John Doe / John Van de Doe / John Della Doe 1. Define Logic and exception list in OpenRefine 2. Translate the logic into a talend component (tJavaRow)
  22. 22. Building an Agile Data Transformation Stack Martin Magdinier Garbage in - Garbage out ● Catch errors early – The sooner, the easier – Do not build the next step on erroneous data ● Independent process – Make it easier to track and debug. – When the bug is fixed, every process / job benefit from it
  23. 23. Building an Agile Data Transformation Stack Martin Magdinier Know where the value is ● Poorly planned data cleaning process is a never ending job (and a depressing experience) ● Prototyping helps to – Anticipate how dirty the data is ● Plan appropriate strategy ● Discard the source early on if too dirty – Set quality level of acceptance ● Level of granularity ● Data format ● ...
  24. 24. Building an Agile Data Transformation Stack Martin Magdinier Example: Address parsing Example: 91 King Street East 305 – 1055, 20 TH ST SW ● Option A: – Address Line 1 – Address Line 2 ● Option B: – Street Number – Street Name – Unit / PO Box – Unit / PO Box Number
  25. 25. Building an Agile Data Transformation Stack Martin Magdinier Know when to stop ● Plan your process keeping in mind the effort to – Build – Operate – Maintain ● Balance fully automated vs semi-automated process – Manual Cleaning - Crowdflower API – OpenRefine Redo / Apply function – Talend job
  26. 26. Building an Agile Data Transformation Stack Martin Magdinier Undo / redo in OpenRefine History to undo previous steps Extract and re apply transformation steps on a different project JSON code to copy / paste in a different project
  27. 27. Building an Agile Data Transformation Stack Martin Magdinier Know when to stop Build your job in Crowdflower
  28. 28. Building an Agile Data Transformation Stack Martin Magdinier Cleaning Typo ● How do you spell: – Mississagua – mississauga – Mississauga – Mississuaga – Misssisauga ● Algorithms – Levenshtein – Fingerprint – n-gram – Metaphone – PPM ● Process followed – Test and explore various algorithms in OpenRefine – Automate in Talend with tFuzzyMatch – Add human validation over a certain threshold
  29. 29. Building an Agile Data Transformation Stack Martin Magdinier Cleaning Typo 1. OpenRefine cluster interface to test different algorithms 2. tFuzzyMatch in talend to automate transformation
  30. 30. Building an Agile Data Transformation Stack Martin Magdinier Conclusion ● Think Agile! ● Iterate as often as you can – Start small and build on it – Confirm your assumption – Focus on value creation ● Build a data friendly environment – Chose your tools carefully – Leave room for learning and growing
  31. 31. Building an Agile Data Transformation Stack Martin Magdinier Contact Ask me questions! Martin Magdinier ● Linkedin: www.linkedin.com/in/magdinier/en ● Twitter: @magdmartin ● Email – martin.magdinier@gmail.com – mmagdinier@alleyneinc.net

×