With Excel or custom tooling (Python, R, etc) there's flexibility to build data processing and preparation pipelines. Getting these to production level is often a different story as traditional or formal IT organisations are not well equipped to handle this kind of development.
In this talk, I'll show how we have combined SQL and NoSQL storage engines to create flexible and production ready data pipelines that can deal with unstructured data flows in an efficient manner.
2. @JOTB19
Crossing the bridge
How we link end-user-computing and
formal tech for data savvy teams
Mark de Brauw
https://xkcd.com/2116/
3. Since I was a kid I’ve been fascinated by
computers and how people use them…
4. ... I thought financial services were the
pinnacle of tech and innovation...
5. ... In reality it is mostly about people and
processes dealing with messy data
6. My definition of “small data”: ad-hoc, manually exchanged, non-streaming, low volume,
high value data. Data quality levels may vary wildly.
Effects of bad data quality:
• Wasted time
• Low morale
• Mistakes
• Misinterpretation
• Needing overqualified people
• Key man risk / local heroes
• Low levels of automation
What you want:
• Short preparation time
• A controlled and compliant data
management process
• Data clarity
As well as Big Data, “Small data” is a big
problem and needs attention
7. Research shows that companies spend over
60% of time on data preparation
Source: Crowdflower Data Science Report 2016 (in 2017 similar numbers)
My definition of Data preparation:
• Collection
• Validation
• Enrichment
• Reconciliation
• Reformatting / reshaping
• Delivery
8. Data preparation is a prerequisite for any
kind of data process
Data entry Data analysis
Repetitive
Similar data and process
Transactional
Low skilled
Low volumes, high value
Generic knowledge required
One off
New problems
Project based
Highly skilled
High volumes
Domain knowledge required
9. Low data quality leads to risk and long
preparation time
OPPORTUNITES
• Scale
• Automation
• Outsourcing
• Value from data
RISKS
ENTRY
• Waste time
• Overqualified, expensive staff
• Key man risk / local heroes
• Mistakes
• Low levels of automation
ANALYSIS
• Waste time
• Demoralized staff
• Wrong hypothesis
• Misinterpretation
HIGHLOW
10. Excel is the primary tool used to build
solutions.
And, this works really well… until:
• The intern leaves.
• The lady in charge of all the macro’s
wins the lottery.
• IT upgrades MS Office.
• Or…
Part of the EuSpRIG site is dedicated to
this: http://www.eusprig.org/horror-
stories.htm
Excel is often used to solve this problem, this
can work in the short run…
11. … but breaks down in the long run, given
contradictory requirements
Data Clarity
Short Preparation
Time
Control &
Compliance
12. IT support in enterprises is not aligned with
“Small Data” problems…
Source: Data Quadrant model by Ronald Damhof
!. Facts (under architecture)
• Structured data
• Cookie cutter systems
• One size fits all
• Long planning horizon
II. Context
PUSH: SUPPLY / SOURCE DRIVEN
III. Shadow It, ad-hoc, one off
• Messy data
• Excel
• Macro’s
PULL: DEMAND/PRODUCT DRIVEN
IV. R&D, innovation, prototyping
• Messy data
• Prototyping
• Trial and error
• Need to fix this now
SYSTEMATICOPPORTUNITSTIC
13. … due to a different need for speed.
I. II.
PUSH: SUPPLY / SOURCE DRIVEN
III.
PULL: DEMAND/PRODUCT DRIVEN
IV.
15. Data preparation consists of several
distinctive steps to make data ready for use
Collect Validate Transform DeliverInput Ready for use
16. Stacks of PDF’s Emails
Forms
CSV
Excel
The variety of data containers and quality
makes this process complex
Ready for useCollect Validate Transform Deliver
17. Most data is provided in anything but clear
cut machine readable formats
Human readable
(PDF, scanned documents/fax, Word, Web)
Flat / Tabular
Machine readable
(CSV, Excel, JSON, XML, Etc.)
Hierarchical / Structural
18. Forms in Excel, machine readable but
structure complicates automated processing
23. Shorten time to data clarity and increase
control to get to an optimal data process
• Variable
• Contains errors
• Incomplete
• Inaccurate
• Delayed
Input
Shorten
preparation
time &
increase control
Data Clarity
Input Ready for use
• Consistent
• Valid
• Complete
• Accurate
• Timely
Collect Validate Transform Deliver
24. Supporting contradictory requirements
• Move data out of containers as quickly as possible into a manageable
environment.
But…
• Provide easy access, make data shareable and allow review (4-eyes).
• Ability to modify data, shape it and, ‘work’ with it.
• Track changes, create an audit trail and support data lineage.
• Link data to business context.
25. An architecture in support of shorter
preparation processes and increased control
DWH
RDMS
Apps
UI
Task management, data repair, issue resolution, configuration
Data Abstraction
Documents
Storage, audit, lineage
Metadata
Container metadata,
workflow, process
PDF’sandscansExcelCSV
Collect Validate Transform Deliver
26. Data Abstraction Layer
Capturing raw data from containers needs to
be straightforward
Serialize
Serialize
Serialize
OCR
Import Document
Store
Metadata
store (SQL)
28. • An audit trail provides insight into
changes made to data by whom
and why.
• Track and store user changes and
automated operations performed
on data.
Data audit trail is mandatory to support
compliant data management processes
29. Add a ‘changes’ attribute to keep track of all
changes to a data item
30. Data lineage is key to a controlled data
management process
• Data lineage gives visibility while providing the ability to trace errors back to
the root cause in any data process.
• Track many-to-1 lineage from data sets and also individual data items.
Data Set A
Data Set B
Data Set 1
Data item
A,B
Data item 1
Source: Data Set A, Data Set B
32. Solving “Small Data” problems can drastically reduce
data preparation time in financial services
• Non-traditional approach provides an edge when building data solutions for
business teams.
• Provide data clarity in a short time span, data quality is now part of a
normal control & compliance cycle.
• Positive effects for launching customers:
• Reporting cycle time (data clarity) reduced by 80%.
• Data preparation time (FTE) reduced by 85%.
• Large control and data quality improvements