Crossing the bridge - how do we link end-user-computing and formal tech for data savvy teams

{“ON”:”THE BEACH”}
MARBELLA, SPAIN / 15TH - 17TH MAY 2019
Bringing developers and DevOps together around Big Data
4th Edition

@JOTB19
Crossing the bridge
How we link end-user-computing and
formal tech for data savvy teams
Mark de Brauw
https://xkcd.com/2116/

Since I was a kid I’ve been fascinated by
computers and how people use them…

... I thought financial services were the
pinnacle of tech and innovation...

... In reality it is mostly about people and
processes dealing with messy data

My definition of “small data”: ad-hoc, manually exchanged, non-streaming, low volume,
high value data. Data quality levels may vary wildly.
Effects of bad data quality:
• Wasted time
• Low morale
• Mistakes
• Misinterpretation
• Needing overqualified people
• Key man risk / local heroes
• Low levels of automation
What you want:
• Short preparation time
• A controlled and compliant data
management process
• Data clarity
As well as Big Data, “Small data” is a big
problem and needs attention

Research shows that companies spend over
60% of time on data preparation
Source: Crowdflower Data Science Report 2016 (in 2017 similar numbers)
My definition of Data preparation:
• Collection
• Validation
• Enrichment
• Reconciliation
• Reformatting / reshaping
• Delivery

Data preparation is a prerequisite for any
kind of data process
Data entry Data analysis
Repetitive
Similar data and process
Transactional
Low skilled
Low volumes, high value
Generic knowledge required
One off
New problems
Project based
Highly skilled
High volumes
Domain knowledge required

Low data quality leads to risk and long
preparation time
OPPORTUNITES
• Scale
• Automation
• Outsourcing
• Value from data
RISKS
ENTRY
• Waste time
• Overqualified, expensive staff
• Key man risk / local heroes
• Mistakes
• Low levels of automation
ANALYSIS
• Waste time
• Demoralized staff
• Wrong hypothesis
• Misinterpretation
HIGHLOW

Excel is the primary tool used to build
solutions.
And, this works really well… until:
• The intern leaves.
• The lady in charge of all the macro’s
wins the lottery.
• IT upgrades MS Office.
• Or…
Part of the EuSpRIG site is dedicated to
this: http://www.eusprig.org/horror-
stories.htm
Excel is often used to solve this problem, this
can work in the short run…

… but breaks down in the long run, given
contradictory requirements
Data Clarity
Short Preparation
Time
Control &
Compliance

IT support in enterprises is not aligned with
“Small Data” problems…
Source: Data Quadrant model by Ronald Damhof
!. Facts (under architecture)
• Structured data
• Cookie cutter systems
• One size fits all
• Long planning horizon
II. Context
PUSH: SUPPLY / SOURCE DRIVEN
III. Shadow It, ad-hoc, one off
• Messy data
• Excel
• Macro’s
PULL: DEMAND/PRODUCT DRIVEN
IV. R&D, innovation, prototyping
• Messy data
• Prototyping
• Trial and error
• Need to fix this now
SYSTEMATICOPPORTUNITSTIC

… due to a different need for speed.
I. II.
PUSH: SUPPLY / SOURCE DRIVEN
III.
PULL: DEMAND/PRODUCT DRIVEN
IV.

We started Mesoica to try and bridge this
gap
Mark
David
Jona
Mathieu

Data preparation consists of several
distinctive steps to make data ready for use
Collect Validate Transform DeliverInput Ready for use

Stacks of PDF’s Emails
Forms
CSV
Excel
The variety of data containers and quality
makes this process complex
Ready for useCollect Validate Transform Deliver

Most data is provided in anything but clear
cut machine readable formats
Human readable
(PDF, scanned documents/fax, Word, Web)
Flat / Tabular
Machine readable
(CSV, Excel, JSON, XML, Etc.)
Hierarchical / Structural

Forms in Excel, machine readable but
structure complicates automated processing

Tabular data, machine readable but structure
complicates automated processing

PDF with hierarchical and structural data,
human interpretation is needed

Two page from PDF with table data,
container make it difficult to process

Tabular data, machine readable!

Shorten time to data clarity and increase
control to get to an optimal data process
• Variable
• Contains errors
• Incomplete
• Inaccurate
• Delayed
Input
Shorten
preparation
time &
increase control
Data Clarity
Input Ready for use
• Consistent
• Valid
• Complete
• Accurate
• Timely
Collect Validate Transform Deliver

Supporting contradictory requirements
• Move data out of containers as quickly as possible into a manageable
environment.
But…
• Provide easy access, make data shareable and allow review (4-eyes).
• Ability to modify data, shape it and, ‘work’ with it.
• Track changes, create an audit trail and support data lineage.
• Link data to business context.

An architecture in support of shorter
preparation processes and increased control
DWH
RDMS
Apps
UI
Task management, data repair, issue resolution, configuration
Data Abstraction
Documents
Storage, audit, lineage
Metadata
Container metadata,
workflow, process
PDF’sandscansExcelCSV
Collect Validate Transform Deliver

Data Abstraction Layer
Capturing raw data from containers needs to
be straightforward
Serialize
Serialize
Serialize
OCR
Import Document
Store
Metadata
store (SQL)

JSON offers simple data structure to do this
PDF CSV / Excel

• An audit trail provides insight into
changes made to data by whom
and why.
• Track and store user changes and
automated operations performed
on data.
Data audit trail is mandatory to support
compliant data management processes

Add a ‘changes’ attribute to keep track of all
changes to a data item

Data lineage is key to a controlled data
management process
• Data lineage gives visibility while providing the ability to trace errors back to
the root cause in any data process.
• Track many-to-1 lineage from data sets and also individual data items.
Data Set A
Data Set B
Data Set 1
Data item
A,B
Data item 1
Source: Data Set A, Data Set B

We add lineage data to each individual data
item

Solving “Small Data” problems can drastically reduce
data preparation time in financial services
• Non-traditional approach provides an edge when building data solutions for
business teams.
• Provide data clarity in a short time span, data quality is now part of a
normal control & compliance cycle.
• Positive effects for launching customers:
• Reporting cycle time (data clarity) reduced by 80%.
• Data preparation time (FTE) reduced by 85%.
• Large control and data quality improvements

Crossing the bridge - how do we link end-user-computing and formal tech for data savvy teams

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Crossing the bridge - how do we link end-user-computing and formal tech for data savvy teams

Similar to Crossing the bridge - how do we link end-user-computing and formal tech for data savvy teams (20)

More from J On The Beach

More from J On The Beach (20)

Recently uploaded

Recently uploaded (20)

Crossing the bridge - how do we link end-user-computing and formal tech for data savvy teams