• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Building a successful agile data transformation stack
 

Building a successful agile data transformation stack

on

  • 195 views

With today's abundance of data (big or small), organization's ability to capture, understand and process new content is key for their success. Martin Magdinier has developed a custom data ...

With today's abundance of data (big or small), organization's ability to capture, understand and process new content is key for their success. Martin Magdinier has developed a custom data transformation stack to integrate over an hundred eclectic data feeds into a single repository. His process goes through three stages:
- Data discovery and exploration,
- Rapid data transformation prototyping and
- Automation of data cleaning and transformation process.

This presentation review challenges specific to each step of the integration process, describe tools used (OpenRefine, Talend, Crowdflower) and processes developed to address them while keeping agility and flexibility of the overall stack in mind.

Statistics

Views

Total Views
195
Views on SlideShare
192
Embed Views
3

Actions

Likes
1
Downloads
8
Comments
0

1 Embed 3

http://www.slideee.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Building a successful agile data transformation stack Building a successful agile data transformation stack Presentation Transcript

    • Data Transformation made easy Building a successful agile data transformation stack Martin Magdinier March 2014
    • Building an Agile Data Transformation Stack Martin Magdinier Agile Data Transformation Stack isAgile Data Transformation Stack is the Key for Successthe Key for Success
    • Building an Agile Data Transformation Stack Martin Magdinier If Data is the new oilIf Data is the new oil Where are the gas station !?!Where are the gas station !?! ● Data is not (yet?) a standardized good: - Environment with evolving technology and formats ● - Unique need: ● Industry, ● Department, ● Business case
    • Building an Agile Data Transformation Stack Martin Magdinier The Data Transformation ProcessThe Data Transformation Process Your data transformation stack should help you to: – Explore and search new data – Identify and Extract relevant data – Refine/Turn data into usable information – Store & distribute to business users
    • Building an Agile Data Transformation Stack Martin Magdinier The Agile Data Transformation Stack ● Is a combination of complementary tools, technology and processes, ● Supporting rapid iteration of ideas, processes and products ● Focused on value creation for the customer (internal or external)
    • Building an Agile Data Transformation Stack Martin Magdinier The Data Transformation Stack ...... Platform Data Processing Solutions Storage Free Open Source Suit your needsAll Software are cross platform
    • Building an Agile Data Transformation Stack Martin Magdinier Data Discovery & Profiling Mine existing data Add new data Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need Agile Data Transformation Iteration
    • Building an Agile Data Transformation Stack Martin Magdinier Data Discovery & Profiling Data Discovery & Profiling Mine existing data Add new data Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need
    • Building an Agile Data Transformation Stack Martin Magdinier Data Discovery ● Seek: – New data sources – New usage for existing data ● Validate – Does the data match my quality criteria? – Can I create value out of it?
    • Building an Agile Data Transformation Stack Martin Magdinier Data Profiling ● Understand your data and make sense of it – Mine – Explore – Interact – Transform ● Combine with visualization and reporting tool ● Iterate and explore various vantage points
    • Building an Agile Data Transformation Stack Martin Magdinier Data Discovery & Profiling Mine existing data Add new data Refine requirements Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need Data Transformation
    • Building an Agile Data Transformation Stack Martin Magdinier Role of a Working Prototype ● Minimize project cost and development time ● Focus on core functions of the transformation process (packaging will come later) ● Define your transformation strategy in a sandbox mode – Validate your assumption – Identify road block on the path to automation
    • Building an Agile Data Transformation Stack Martin Magdinier Iterate - Iterate - Iterate ● Improve and grow by incremental steps ● Start feeding your business with data – Validate if there is value in this data – Collect feedback from the users ● Iterate as much as necessary
    • Building an Agile Data Transformation Stack Martin Magdinier Discovery, Profiling & Prototyping ● Designed for technical and business users ● Support a variety of input format ● Allow easy and safe interaction with the data: – Somewhere between Excel ● Point and click user friendly interface ● Changes Preview ● Undo / Redo functions – and SQL ● Query oriented language ● Handling large amount of data
    • Building an Agile Data Transformation Stack Martin Magdinier OpenRefine Interface Facet for fast filtering Expression builder Instant preview of the transformation
    • Building an Agile Data Transformation Stack Martin Magdinier Prototyping & Automation ● Extract – Transform – Load solution ● Process focus with – Drag and drop component graphical interface – Java based ● Compile your job to run it on your server – Java (Talend Open Studio) – Map reduce (Talend for Big Data) ● Connect to anything ● Open Source: Ease of addition / customizing your own components / library
    • Building an Agile Data Transformation Stack Martin Magdinier Talend Open Studio Interface Drag, drop, connect and configure components Process oriented interface List of components available
    • Building an Agile Data Transformation Stack Martin Magdinier Semi Automated Cleaning ● Intelligent Meta Crowd-sourcing Platform ● Build your job for data: – clean up – analysis – categorization – collection ... ● Ensure quality output – Check consistency of results – Select best worker ● Web Interface to – Build Prototype – Test job ● API for automation – OpenRefine extension – Talend Internet component
    • Building an Agile Data Transformation Stack Martin Magdinier Lesson Learned Data Discovery & Profiling Mine existing data Add new data Refine requirements Data Transformation Process & Code Prototype (MVP) Semi automated Automation Track / Measure Collect feedback Learn from your experience Progress in small incremental steps Data Consumption Create value Generate new need
    • Building an Agile Data Transformation Stack Martin Magdinier Don't repeat yourself ● 1 process = 1 independent component / job ● Reuse your existing components ● Maintain your code in one place ● Add few new items at each iteration
    • Building an Agile Data Transformation Stack Martin Magdinier Name Splitting 3. Move the talend component to a routine ● Split FullName into FirstName and LastName – John Doe / John Van de Doe / John Della Doe 1. Define Logic and exception list in OpenRefine 2. Translate the logic into a talend component (tJavaRow)
    • Building an Agile Data Transformation Stack Martin Magdinier Garbage in - Garbage out ● Catch errors early – The sooner, the easier – Do not build the next step on erroneous data ● Independent process – Make it easier to track and debug. – When the bug is fixed, every process / job benefit from it
    • Building an Agile Data Transformation Stack Martin Magdinier Know where the value is ● Poorly planned data cleaning process is a never ending job (and a depressing experience) ● Prototyping helps to – Anticipate how dirty the data is ● Plan appropriate strategy ● Discard the source early on if too dirty – Set quality level of acceptance ● Level of granularity ● Data format ● ...
    • Building an Agile Data Transformation Stack Martin Magdinier Example: Address parsing Example: 91 King Street East 305 – 1055, 20 TH ST SW ● Option A: – Address Line 1 – Address Line 2 ● Option B: – Street Number – Street Name – Unit / PO Box – Unit / PO Box Number
    • Building an Agile Data Transformation Stack Martin Magdinier Know when to stop ● Plan your process keeping in mind the effort to – Build – Operate – Maintain ● Balance fully automated vs semi-automated process – Manual Cleaning - Crowdflower API – OpenRefine Redo / Apply function – Talend job
    • Building an Agile Data Transformation Stack Martin Magdinier Undo / redo in OpenRefine History to undo previous steps Extract and re apply transformation steps on a different project JSON code to copy / paste in a different project
    • Building an Agile Data Transformation Stack Martin Magdinier Know when to stop Build your job in Crowdflower
    • Building an Agile Data Transformation Stack Martin Magdinier Cleaning Typo ● How do you spell: – Mississagua – mississauga – Mississauga – Mississuaga – Misssisauga ● Algorithms – Levenshtein – Fingerprint – n-gram – Metaphone – PPM ● Process followed – Test and explore various algorithms in OpenRefine – Automate in Talend with tFuzzyMatch – Add human validation over a certain threshold
    • Building an Agile Data Transformation Stack Martin Magdinier Cleaning Typo 1. OpenRefine cluster interface to test different algorithms 2. tFuzzyMatch in talend to automate transformation
    • Building an Agile Data Transformation Stack Martin Magdinier Conclusion ● Think Agile! ● Iterate as often as you can – Start small and build on it – Confirm your assumption – Focus on value creation ● Build a data friendly environment – Chose your tools carefully – Leave room for learning and growing
    • Building an Agile Data Transformation Stack Martin Magdinier Contact Ask me questions! Martin Magdinier ● Linkedin: www.linkedin.com/in/magdinier/en ● Twitter: @magdmartin ● Email – martin.magdinier@gmail.com – mmagdinier@alleyneinc.net