SlideShare a Scribd company logo
1 of 126
Download to read offline
CopyrightThird Nature,Inc.
Architecting a Data
Platform For
Enterprise Use
March 2019
Mark Madsen
Todd Walter
CopyrightThird Nature,Inc.
Why are we talking about architecture?
4 © 2017 Gartner, Inc.and/or its affiliates. All rights reserved.
Gartner statement in 2018:
only 15% are reported
to be successful
only
17%of Hadoop deployments
are in production in 2017
Survey Analysis: BI and Analytics
Spending Intentions, 2017
A McKinsey survey this
year asked executives if
their company had
achieved a positive ROI
with their big data projects:
7% answered “yes”
Gartner Finding:
in 2017
CopyrightThird Nature,Inc.
DW: Centralize, that solves all problems!
Creates bottlenecks
Causes scale problems
Availability?
CopyrightThird Nature,Inc.
The data lake solution: no central authority
wtf, it was fully
operational!
CopyrightThird Nature,Inc.
The data lake solution?
There’s a problem: as
the lake is envisioned,
it is still a centralized
data architecture, but
this time there is no
single global model.
Instead it’s files and
not modeled. It can be
operational while
under construction.
It’s still a death star.
CopyrightThird Nature,Inc.
Eventually we run into the same problems
Seriously, wtf?
It was agile
and operational
You’re building a centralized model without realizing it.
Maybe we can avoid recreating the same problems.
Copyright Third Nature, Inc.
The solution to our problems isn’t
technology, it’s architecture.
CopyrightThird Nature,Inc.
Architecture?
What does the architect do?
What is the architect’s
responsibility and role?
What is the core problem of
data architecture?
CopyrightThird Nature,Inc.
Use
CopyrightThird Nature,Inc.
Use
Over time
CopyrightThird Nature,Inc.
Use
Over time
By people
Copyright Third Nature, Inc.
Bricks are not buildings
We don’t think this is equivalent to this
Architecture is not technology. It’s not a product you can buy.
Copyright Third Nature, Inc.
Blueprints are not architectures
Copyright Third Nature, Inc.
Technical wiring diagrams
Product lists
Pretty pictures
Hand-waving abstract statements and rules
A thing you can purchase
…so what is it?
Architecture is not…
Copyright Third Nature, Inc.
What is this?
Architecture is an abstraction – a pattern that supports a purpose
You need purpose, therefore focus business goals, outcomes, use cases
19
No architecture = accidental architecture
The complexity of most organizations exceeds the capability of one system
Just like the DW, keep piling things on…
Accidental architecture at enterprise scale:
Kowloon full scope (above), and the detail of one
building section(left) – this is a true monolith.
Functional, served a purpose, impossible to add
services to, support, maintain.
Result: avoid, or bulldoze.
20
This focus on monolith is why the
data warehouse is in legacy mode
Everything is tightly
interconnected, such that
nobody wants to change
anything because they might
break something else.
It’s easier to start over
somewhere else.
Then they blame the data
warehouse, or the database,
or the data management
team for what is essentially
an architectural failure.
CopyrightThird Nature,Inc.
“The problem is the tools. Standardize on one tech!”
It’s called stack think. Pick your vendor. Cede all architecture.
CopyrightThird Nature,Inc.
The problem is methods and process: agile your way
into the Analytics Shantytown
CopyrightThird Nature,Inc.
“Start with the platform. The rest will follow.”
CopyrightThird Nature,Inc.CopyrightThird Nature,Inc.
Big data reality: What users seeBig data promise: What you see
CopyrightThird Nature,Inc.
HISTORY: HOW DID WE GET HERE?
CopyrightThird Nature,Inc. 4/23/2018
The market shifted from not enough data
in the 90’s to too much data: the problem
has become managing not just size, but
scope and variety
CopyrightThird Nature,Inc.
More complex needs drive more complex technology
CopyrightThird Nature,Inc.
Market hype and IT workforce skill gaps lead to FOMO
The pressure on IT to “just
buy something” is high. The
usual IT response is based on
procurement, not innovation
or integration. The data
infrastructure solution tends
to be: buy tools, collect data.
CopyrightThird Nature,Inc. 4/23/2018
Developmentaccumulates in a disorderly fashion
The end result in the IT landscape is complexity
31
Sales
SQL
BI: dashboards,
reports, queries
Customers
Inventory
Products
tables
Where we started: the data warehouse and BI
The data warehouse solved a key problem:
access to data from multiple OLTP systems
to provide a unified view of key
information.
It is built on some assumptions:you must
model the data before use, the data must
be cleaned first,the data must be in tables.
The DW is built for repeatable datause, not
for one-time uses of informationor
unknown-value datasets.
© 2018 Teradata
32
Sales
SQL
BIDiscovery
Customers
Inventory
Products
Financials
tabular
?
Analysts Anyone
tables
After a while, user self-service was required
© 2018 Teradata
Eventually we reached maturitywith
the DW, driving an increasingrate of
new data requests by departments
and individuals.
A backlog of smaller data requests
built up around .it We got self-service
tools that actually worked.
33
Sales
SQL
BIDiscovery
Customers
Inventory
Products
Financials
tabular
?
Analysts Anyone
tables
Why Isn’t the Data in the DW? Rush Jobs and Unknown Data
© 2018 Teradata
The real problem is time: business
analyses are often needed in a week
or less. If the data is not in the DW
then the analyst has to wait – the
industry average for making data
available in a DW is 10 weeks.
They need quick access to new data
rather than reusable, cleaned,
common data.This means they need
another way – and self-service tools
offer an answer other than “wait.”
34
Sales
SQL
BIDiscovery
Customers
Inventory
Products
Financials
tabular
?
Analysts Anyone
tables
But data is rarely used in isolation, DW data is often needed
© 2018 Teradata
New data usually needs to be linked
to existing data.
Users don’t justneed accessto data,
they need a place to work with and
store datatoo.
Ignore this requirement and you have
runaway copies, extracts, andfiles
tied to specific tools,with have no
visibility into what is happening.
35
Sales
SQL
BIDiscovery
Customers
Inventory
Products
Financials
? R
Analysts Anyone
tables Array /
matrix
Warehouse
events
There are limits to what you can do with queries
© 2018 Teradata
tabular
Some questionsare not answerable
with queries. Deeper analysis is
required to answer the question.
The truth is, and always has been,
that tables and a database are not the
only technology in the analytic
ecosystem.
36
Sales
SQL
BIDiscovery Data science
Customers
Inventory
Products
Financials clicks
? R Python
Analysts
Spark
Anyone Data scientists
tables Array /
matrix
Time
series
Warehouse
events
Data Science accelerates, more data, more engines
© 2018 Teradata
tabular
As BI matured,informationneeds
grew more complex. New analytics,
new data,higher volumes drove
creationof new techniquesand new
processingengines.
New techniques,new engines, means
new structuringand positioning of
data is required.
37
Sales
SQL
BIDiscovery Data science
Customers
Inventory
Products
Financials clicks
? R Python Tensor
Flow
Analysts
Spark
Anyone Data scientists
tables Array /
matrix
Time
series
Graph
Warehouse
events
More unique data, more approaches, technologies arrive monthly
© 2018 Teradata
tabular
?
e.g. Emails, images,
more events
38
Sales
SQL
BIDiscovery Data science
Customers
Inventory
Products
Financials clicks
? R Python Tensor
Flow
Analysts
Spark
Anyone Data scientists
tables Array /
matrix
Time
series
Warehouse
events
The end result of years of addition is accidental architecture
© 2018 Teradata
tabular Graph
?
e.g. Emails, images,
more events
39
Sales
SQL
BIDiscovery Data science
Customers
Inventory
Products
Financials clicks
? R Python Tensor
Flow
Analysts
Spark
Anyone Data scientists
tables Array /
matrix
Time
series
Warehouse
events
Today’s environment has (and still needs) different engines
Engines
© 2018 Teradata
tabular Graph
?
e.g. Emails, images,
more events
40
Sales
SQL
BIDiscovery Data science
Customers
Inventory
Products
Financials clicks
? R Python Tensor
Flow
Analysts
Spark
Anyone Data scientists
tables Array /
matrix
Time
series
Warehouse
events
Some engines require specific structures and positioning of data
Data storedto
engine needs
Engines
© 2018 Teradata
tabular Graph
?
e.g. Emails, images,
more events
41
Sales
SQL
BIDiscovery Data science
Customers
Inventory
Products
Financials clicks
? R Python Tensor
Flow
Analysts
Spark
Anyone Data scientists
S3HDFSRDBMSRDBMS
Raw data
tables Array /
matrix
Time
series
Warehouse
events
Distributed data is the norm, stored in multiple types of repositories
© 2018 Teradata
tabular Graph
?
e.g. Emails, images,
more events
42
Sales
SQL
BIDiscovery Data science
Customers
Inventory
Products
Financials clicks
? R Python Tensor
Flow
Analysts
Spark
Anyone Data scientists
tables Array /
matrix
Time
series
Warehouse
events
How much of the data science problem is tools vs engines?
Data storedto
engine needs
Engines
© 2018 Teradata
tabular Graph
?
e.g. Emails, images,
more events
The dirty secret of datascience: more
than 80% of the enterprise market
does it on laptops.Mostdoes not
need specialized infrastructure.
43
Sales
SQL
BIDiscovery Data science
?
Customers
Inventory
Products
Financials clicks e.g. Emails, images,
more events
? R Python Tensor
Flow
Analysts
Spark
Anyone Data scientists
tables Array /
matrix
Time
series
Warehouse
events
Managing this complexity is a growing challenge
© 2018 Teradata
tabular Graph
The challenge with use of data
science shifts to operations:how to
deploy and manage a production
environment with so many
technologies and copies of data?
44
Sales
SQL
BIDiscovery Data science
Customers
Inventory
Products
Financials clicks
? R Python Tensor
Flow
Analysts
Spark
Anyone Data scientists
tables Array /
matrix
Time
series
Warehouse
events
Accidental architecture does not lead to agility
© 2018 Teradata
tabular Graph
?
e.g. Emails, images,
more events
Accidentalarchitectureis
unsustainablein the long run.
45
What’s missing:
loosely integrated
data stored for
provisioning.
Sales
SQL
BIDiscovery Data science
Customers
Inventory
Products
Financials clicks
? R Python Tensor
Flow
Analysts
Spark
Anyone Data scientists
tables Array /
matrix
Time
series
Warehouse
events
© 2018 Teradata
tabular Graph
?
e.g. Emails, images,
more events
The arrows in architecturediagrams
hide the hard work and make things
seemsimplerthan they are.
The arrows are wherethe integration
costs and labor are buried.
46
Sales
SQL
BIDiscovery Data science
Customers
Inventory
Products
Financials clicks
? R Python Tensor
Flow
Analysts
Spark
Anyone
Curated data stored
for integration and
provisioning
tables Array /
matrix
Time
series
Warehouse
events
Data architecture is needed to fill the gap© 2018 Teradata
You need a layerto separatethe
messof raw data below from the
distributed,context-specificusesof
data above.
This minimizesthe cost of change,
providesuser control of data via
separationof data managementfrom
data use.
tabular Graph
?
e.g. Emails, images,
more events
Data scientists
47
Sales
Renovating the accidental architecture requires a transition
path to separate data infrastructure from the uses of data
© 2018 Teradata
SQL
BIDiscovery Data science
Customers
Inventory
Products
Financials clicks
? R Python Tensor
Flow
Analysts
Spark
Anyone
Data storedto
engine needs
Curateddata stored
for integrationand
provisioning
S3HDFSRDBMSRDBMS
Raw data
tables Array /
matrix
Time
series
Warehouse
events
Engines
tabular Graph
?
e.g. Emails, images,
more events
Data scientists
CopyrightThird Nature,Inc.
We continue to act as if there is one single system
“So much complexity in software
comes from trying to make one
thing do two things.”
— Ryan Singer
CopyrightThird Nature,Inc.
Break down the monolithic architecture – not one
building, multiple buildings in a block
CopyrightThird Nature,Inc.
Buildingsabove: flexibility,repurposing,
quicker change above, funded separately
Applications
Utilitiesbelow:stability, reuse, slow
predictable change below, funded centrally
Infrastructure
We built things as a single entity that
combined the building and the
infrastructure in one complex monolith
Complexity requires a shift: separate the
application from the infrastructure – we
are focused on the block, no the buildings
© Third Nature Inc.
Speed and agility
You want speed. Speed comes
from burying work in the
infrastructure, which trades
flexibility for repeatability.
Therefore you must be careful
when you draw the line
between what is above ground
and below, the uses of data
and the infrastructure.
CopyrightThird Nature,Inc.
"Always design a thing by consideringit in its next larger
context - a chair in a room, a room in a house,a house in an
environment,an environmentin a city plan." – Eliel Saarinen
CopyrightThird Nature,Inc.
Think of IT as a city
Where does a building fit?
CopyrightThird Nature,Inc.
Streaming analytics
Data science
Data collection
Self-service
/ Discovery
BI (QRD)
Self-service Data
56
The bigger picture of what we create
The ecosystem is like the city and all of
its extended neighborhoods.
The different applications or types of
projects are like the neighborhoods
with specific types of buildings.
Each individual building (application)
has its own unique blueprint.
Underlying all the applications is
shared data infrastructure, like city
services. The shared infrastructure is
the foundation of the analytics
architecture for a company.
City
Neighborhoods
Buildings and
shared infra
CopyrightThird Nature,Inc.
“Begin with the end in mind”
The starting point can’t be with
technology. That’s like starting
with bricks when designing a
house. You may get lucky but…
The goals and specific uses are
the place to start
▪ Use dictates need
▪ Need dictates capabilities
▪ Capabilities are solved with
technology
This is how you avoid spending $2M on a Hadoop andspark cluster in order to serve
data to analysts whose primary requirements aremet with laptops.
CopyrightThird Nature,Inc.
Persisting data
is not the end
of the line.
If you stop
here you win
the battle and
lose the war
CopyrightThird Nature,Inc.
We don’t have a data science problem, just like we
didn’t have a BI problem
The origin of analytics as “business
intelligence” was stated well in 1958:
…the ability to apprehend the
interrelationships of presented facts in
such a way as to guide action towards a
desired goal. ~ H. P. Luhn
“A Business Intelligence System”, http://altaplana.com/ibmrd0204H.pdf
”
“
Our goal is analytics as a capability, not a technology
© Third Nature Inc.
B I
© Third Nature Inc.
The old problem was access, the new problem is analysis
CopyrightThird Nature,Inc.
Three constituencies
Stakeholder Analyst Builder
aka the recipient aka the data scientist aka the engineer
CopyrightThird Nature,Inc.
Starting points for analytics strategy
Many organizations choose to start with
the analysts. Create a data science team.
Turn them loose to find a problem.
Many more start with builders: technology
solutions looking for problems, e.g. 65% of
the IT driven Hadoop and Spark projects
over the last five years.
The right place to start? Stakeholders.The
goal to achieve,the problem to solve.
CopyrightThird Nature,Inc.
WHAT IS THE CONTEXT OF USE?
There is an extensive list of requirements to support
Primary requirementsneededby constituents S D E
Data catalogand ability to search it for datasets X X
Self-service accessto curateddata X
Self-service accessto uncurated(unknown, new) data X X
Temporary storagefor working with data X
Data integration,cleaning,transformation, preparation toolsandenvironment X X
Persistentstoragefor source dataused by productionmodels X X
Persistentstoragefor training,testing,production data used by models X X
Storageand management of models X X
Deployment, monitoring, decommissioning models X
Lineage, traceabilityof changes made for data used by models X X
Lineage, traceabilityfor model changes X X X
Managingbaseline data / metrics for comparing model performance X X X
Managingongoing data / metrics for trackingongoing model performance X X X
S = stakeholder, user, D = data scientist, analyst, E = engineer, developer
© Third Nature Inc.
How did we get to this state
with BI & analytics?
There’s a difference
between having no past
and actively rejecting it.
67
100
1,000
10,000
1 2 3 4 5 6 7 8 9 10
Customer Data Plateaus
Add users,
accumulate
history, add
attributes, add
dimensions
Entirely new data
set enabled by
new technology at
new price point –
value has to
exceed cost (ROI).
Earliest adopters
have vision ahead
of ROI
68
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
1 2 3 4 5 6 7 8 9 10 11 12 13
User Adoption of New Data
Customer Data
Users - Data Set 1
Users - Data Set 2
Users of Integrated Data
User adoption of new data sets
starts over. Very small number of
experts growing to wider
audience, sophisticated users
moving to business analysts, then
business users, B2B customers and
even to consumers
Greater value is derived when
data sets are linked – see bigger
picture (eg who buys what,
when). Comes after initial
extractionof easy value from
standalonedata
69
• <Store, Item, Week>
• <Store, Item, Day>
– Simple aggregations
• Market Basket
– Affinity
– Link to person, demographics, HR
• Inventoryby SKU by store
– Temporal, time series, forecasting
– Link to product, marketing, market basket
Retail Plateaus
2B records
total for 9
quarters
2B records
per day,
keep 9
quarters
70
• Web Logs and traffic
– Behavioral patterns– eg path linked to person,offers,other channels
– Operationsof the web site
• Supply chain sensors – sampled at major event
– Activity Based Costing
– Link to customer, product,HR, planning
• Social Media
– Text analysis, Filtering, languages
– Link to customer, sales,other channelinteractions
• Supply chain sensors – sampled at minutes or seconds
– Telematics
– Real time, Eventdetection,trending,static and dynamicrules
– Link to HR, thresholds, forecasts,routing,planning
Retail Plateaus
30B records per day
71
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
Data Size Plateaus over Time
Lather, Rinse,
Repeat. Trend
line is roughly
Moore’s Law.
Delay or skip a
generation if new
data set is two
orders of
magnitude
instead of one.
72
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
100,000,000,000
1,000,000,000,000
10,000,000,000,000
100,000,000,000,000
1,000,000,000,000,000
10,000,000,000,000,000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
Customer Data Size vs. 2x per 12mo
Customer Data
Moore's Law - 2x/18mo
2x/12 mo
But all studies show
demand/creation of
data increasing
significantly faster
than Moore’s law
If we started only 1
OOM behind, we are
now 5 OOM away from
user demand for data
Data
Exhaust
73
FINANCE
Revenue
Expenses
Customers
CUSTOMER
CARE
Customer
Products
Orders
Case History
SALES
Orders
Customers
Products
MARKETING
Customers
Orders
Campaign
History
OPERATIONS
Inventory
Returns
Manufacturing
Supply Chain
How many
batteries
are in
inventory
by plant?
What is the
trend of
warranty
costs?
How many
people
made a
warranty
claim last
week?
How many
sales have
been
made
quarter to
date?
Which
customers
should get a
communica
tion on
extended
warranties?
54 32 29 49 66
74
2954 32 49 41
75
Given the rise in warranty costs, isolate the problem to be a
plant, then to a battery lot.
Communicate with affected customers, who have not made
a warranty claim on batteries, through Marketing and
Customer Service channels to recall cars with affected
batteries.
2855
Inventory
Returns
Manufacturing
Supply Chain
Customer Service
Orders
Revenue
Expenses
Case History
Customers
Products
Pipeline
Customers
Campaign
History
FINANCE
SALESMARKETING
OPERATIONS
CUSTOMER
EXPERIENCE
2855
76
Manufacturing: Data Overlap Analysis
New Business Improvement Opportunities through Data Leverage
Sales Force Profitability Analysis 100% 80% 66% 24% 41% 66% 0% 24% 0% 24%
Transportation Planning 11% 100% 87% 28% 64% 56% 22% 34% 13% 45%
Production Planning 6% 57% 100% 19% 83% 35% 17% 28% 9% 40%
Vendor Managed Inventory 12% 100% 100% 100% 78% 100% 61% 100% 39% 100%
Global Pricing Rationalization 4% 50% 100% 17% 100% 27% 15% 29% 11% 41%
Fulfillment (Perfect Order) 16% 94% 90% 48% 58% 100% 35% 53% 19% 56%
Manufacturing Quality
Optimization
0% 76% 88% 60% 66% 71% 100% 79% 43% 73%
Preventative Maintenance Analysis 7% 75% 93% 61% 80% 68% 50% 100% 27% 82%
Warranty Claims Analysis 0% 94% 100% 83% 100% 83% 94% 93% 100% 98%
Quality Life Cycle Improvement 4% 57% 75% 35% 64% 41% 26% 47% 16% 100%
SalesForce
ProfitabilityAnalysis
ProductionPlanning
VendorManaged
Inventory
GlobalPricing
Rationalization
TransportationPlanning
Fulfillment
(PerfectOrder)
ManufacturingQuality
Optimization
Preventative
Maintenance
Analysis
WarrantyClaims
Analysis
QualityLifeCycle
Improvement
If
Then
78
PRODUCT
SENSOR
SOCIAL
MEDIA
CUSTOMER
CARE AUDIO
RECORDINGS
DIGITAL
ADVERTISING
CLICKSTREAM
65 41 32 19 28
How many
visitors did
we have to
our hybrid
cars
microsite
yesterday?
What are
the
temperature
readings for
batteries by
Manufactur
er?
What is the
sentiment
towards line
of hybrid
vehicles?
Which
customers
likely
expressed
anger with
customer
care?
Which ad
creative
generated
the most
clicks?
79
2855
SENSOR
DIGITAL
ADVERTISING
CLICKSTREAM
INTERACTIONS
RATINGS &
REVIEWS
CUSTOMER PORTAL
INTERACTIONS
EXTERNAL
INTERACTIONS
SOCIAL MEDIA
IVR Routing
RFID
ELECTRONIC
COMMERCE
FINANCE
SALESMARKETING
Inventory
Returns
Manufacturing
Supply Chain
Customer
Service
Orders
Revenue
Expenses
Case History
Customers
Products
PipelineCustomers
Campaign History
OPERATIONS
CUSTOMER
CARE – AUDIO
RECORDINGS
Maps
Telemetry
SERVER LOGS
CUSTOMER
EXPERIENCE
Enterprise Data
80
2855
Schema on
Write
Evolving
Schema
Schema on
Read
Enterprise
Data
Evolving
Data
New Data
Sources
Schema?
81
2855
High, Well
Known Quality
Directional
Quality
Unknown/Low
quality
Well Defined
Data Model
Curated JSON,
XML, DB
Extracted
Attributes
Curation Required
82
2855
Out of Deviation
Sensor Readings
Fraud
Events
CUSTOMER EXPERIENCE
SALES
MARKETING
OPERATIONS
Abandoned
Carts
Online Price
Quotes
Social Media
Influencers
FINANCE
11234
Minimum Viable
Curation
Minimum Viable
Data Quality
83
2855
Out of Deviation
Sensor Readings
Fraud
Events
CUSTOMER
EXPERIENCE
SALES
MARKETING
OPERATIONS
Abandoned
Carts
Online Price
Quotes
SocialMedia
Influencers
FINANCE
11234
Sensor data
formatting
Unit
Normalization
Vehicle/version
normalization
Make/model/year
selection
Data Comprehension,
Pipelines
84
2855
Enterprise
Wide Use
10s to 100s
of Users
10s Users
Analysts /
engineers
Business
Analysts
Action list, Report,
Dashboard Users
Share across
Enterprise
Share in
department
Share over
cube wall
User Base and Sharing
85
Evolving
Consumption
2855
Enterprise
production
analytics
Departmental
analytics
Exploratory
analytics
Repeatable,
auditable
results
Ad-Hoc
Query, Self
Service
Data Labs
86
Evolving Consumption
Requirements
2855
Production tools,
curated data,
integrated across
business areas
Wide variety
of tools and
data forms
Targeted data access,
many applications,
response time SLAs
High CPU,
moderate to
low IO
Bulk scans, large
computation,
transformation, data access,
specific integration
Moderate CPU,
High IO, Resource
management
87
MARKETING
FINANCE
SALES
OPERATIONS
CUSTOMER
EXPERIENCE
Given the rise in warranty
costs, isolate the
problem to be a plant
and the specific lot.
Exclude 2/3rd of the
batteries fromthe lot
that are fine.
Communicate with
affected customers, who
havenot made a
warranty claim, through
Marketing and Customer
Service channels to
recall cars with affected
batteries.
2855
MANUFACTURING
CAMPAIGN
HISTORY
COSTS
PRODUCTS
CUSTOMERS
CASE HISTORY
SENSOR
Access Wide Variety of Data
to Answer a Question
Copyright Third Nature, Inc.
DATA ARCHITECTURE
We’re so focused on the light switch that we’re not
talking about the light
CopyrightThird Nature,Inc.
History: This is how BI was done through the 80s
First there were files and reporting programs.
Application files feed through a data processing pipeline to
generate an output file. The file is used by a report formatter
for print/screen. Files are largely single-purpose use.
Every report is a program written by a developer.
Data pipeline
code
CopyrightThird Nature,Inc.
History: This is how BI ended the 80s
The inevitable situation was...
Data pipeline
code
CopyrightThird Nature,Inc.
History: This is how we started the 90s
Collect data in a database. Queries replaced a LOT of
application code because much was just joins. We
learned about “dead code”
SQL
SQL
SQL
SQL
SQL
CopyrightThird Nature,Inc.
Pragmatism and Data
Lessons learned during
the ad-hoc SQL era of the
DW market:
When the technology is
awkward for the users, the
users will stop trying to use
it.
Even “simple” schemas
weren’t enough for anyone
other than analysts and
their Brio…
Led to the evolution of
metadata-driven SQL-
generating BI tools, ETL
tools.
CopyrightThird Nature,Inc.
BI evolved to hiding query generation for end users
With more regular schema models, in particular
dimensional models that didn’t contain cyclic join paths, it
was possible to automate SQL generation via semantic
mapping layers.
We developed data pipeline building tools (ETL).
Query via business terms made BI usable by non-technical
people.
ETL
SQL
Life got much easier…for a while
CopyrightThird Nature,Inc.
Today’s model: Lake + data engineers, looks familiar…
The Lake with data pipelines to files or Hive tables
is exactly the same pattern as the COBOL batch..
Dataflow
code
We already know that people don’t scale…
Copyright Third Nature, Inc.
The architecture from 1988 we SHOULD HAVE BEEN USING
The general concept of a
separate architecture for BI
has been around longer, but
this paper by Devlin and
Murphy is the first formal
data warehouse architecture
and definition published.
95
“An architecture for a business and
information system”, B. A. Devlin,
P. T. Murphy, IBM Systems Journal,
Vol.27, No. 1, (1988)
Slide 95Copyright Third Nature, Inc.
But 30 years ago we did not expect so many different
models of deployment,execution and use. Needs change
DeployETL
Data Data Storage
Alerts / Reports/
Decisioning
Deploy
f
Data Streams Intelligent Filter
/ Transform Model Execution
Analytics for eyeballs and analytics for machines are different
Copyright Third Nature, Inc.
Decouple the Architecture
The core of a data warehouse isn’t the database, it’s the
data architecture that the database and tools implement.
We need a new data architecture that is not limiting:
▪ Deals with change more easily and at scale
▪ Does not enforce requirements and models up front
▪ Does not limit the format or structure of data
▪ Assumes the range of data latencies in and out, from streaming
to one-time bulk
▪ Allows both reading and writing of data
▪ Makes data linkable, and provide governance where required
▪ Does not give up the gains of the last 25 years
Copyright Third Nature, Inc.
The goal is to decouple: solve the application and
infrastructure problems separately, independently
Data Acquisition
Collect & Store
Incremental
Batch
One-time copy
Real time
Platform Services
Data storage
Data arrives in many latencies, from real-time to one-time. Acquisition
can’t be limited by the management or consumption layers.
Data acquisition should not be
directly tied to the needs of
consumption. It must operate
independently of data use.
Copyright Third Nature, Inc.
Data hoarding is not a data management strategy
Copyright Third Nature, Inc.
The goal is to decouple: solve the application and
infrastructure problems separately, independently
Platform Services
Data Management
Process & Integrate
Data storage
Data management should not be subject to the constraints of a single use
Data management
has historically been
blended with both
data acquisition and
structuring data for
client tools. It should
be an independent
function.
Copyright Third Nature, Inc.
The goal is to decouple: solve the application and
infrastructure problems separately, independently
Platform Services
Data Access
Deliver & Use
Data storage
This separates uses of data from each other, allowingeach type of
use to structure the data specific to its own requirements.
Data access is already
somewhat separate
today. Make the
separation of different
access methods a formal
part of the architecture.
Don’t force one model.
Copyright Third Nature, Inc.
The full analytic environment subsumes all the functions
of a data lake and a data warehouse, and extends them
Data Acquisition
Collect & Store
Incremental
Batch
One-time copy
Real time
Platform Services
Data Management
Process & Integrate
Data Access
Deliver & Use
Data storage
The platform has to do more than serve queries; it has to be read-write.
Copyright Third Nature, Inc.
The data architecture must align with system components
because each of them addresses different data needs
Incremental
Collect
Batch
One-time copy
Real time
Manage & Integrate
Data Acquisition
Collect & Store
Data Management
Process & Integrate
Data Access
Deliver & Use
Separating concerns is part of the mechanism for change isolation
Copyright Third Nature, Inc.
Divide the data architecture to address three different goals
Collection
Creation, collection,
storage of new data
Distribution
Organization and
provisioning of data
to multiple points
of use
Consumption
Direct support of
data use
Separation of concerns, coordination of process
Collect Curate Consume
CopyrightThird Nature,Inc.
The design focus is different in each area
Ingredients
Goal: available
User needs a recipe
in order to make
use of the data.
Pre-mixed
Goal: discoverable
and integrateable
User needs a menu
to choose from the
data available
Meals
Goal: usable
User needs utensils
but is given a
finished meal
CopyrightThird Nature,Inc.
Food supply chain: an analogy for analytic data
Multiple contexts of use, differing quality levels
You need to keep the original because just like baking,
you can’t unmake dough once it’s mixed.
CopyrightThird Nature,Inc.
Data has to be moved, standardized, tracked
There is a lot of data policy and governance to think about
Collect Manage & Integrate
Data Acquisition
Collect & Store
Data Management
Process & Integrate
Data Access
Deliver & Use
CRM
User reg
Orders
Leads
Cust
Cust
Cust
Cust
Master DW
Mktg
Campaign
analysis
Rec
engine
CopyrightThird Nature,Inc.
Data curation
The problem with so many
sources, types, formats and
latencies of data is that it is
now impossible to create
one model for all of it in
advance.
Data modeling is about the
inside of a dataset. Curation
is about the set.
Data curation, rather than
data modeling, is becoming
the more important data
management practice.
Copyright Third Nature, Inc.
The missing ingredient from most data projects
Specifically,
metadata kept
separate from
the data.
CopyrightThird Nature,Inc.
The data is in zones of management, not isolating layers
Raw data in an immutable
storage area
Standardized or
enhanced data
Common or
usage-
specific
data
Transient data
Relax control to enable self-service while avoiding a mess.
Do not constrain access to one zone or to a single tool.
Focus on visibility of data use, not control of data.
CopyrightThird Nature,Inc.
This data architecture resolves rate of change problems
Raw data in an immutable
storage area
Standardized or
enhanced data
Common or
usage-
specific data
Transient data
New data of
unknown value,
simple requests for
new data can land
here first, with little
work by IT.
More effort applied to
management, slower.
Optimized for
specific uses /
workloads. Generally
the slowest change.
Not fast vs slow:
fast vs right
Not flexibility vs
control: flexibility
vs repeatability
Agile for
structure change
vs agile for
questions / use
Copyright Third Nature, Inc.
The concept of a zone is not a physical system. It’s data architecture
Apps
DW
The biggest decision is to separate all data collection from
the data integration from consumption.
Physical system/technology overlays are separate, depend
on the specific use cases and needs of the organization.
Dist FS
RDBMSs
Decompress &
process device logs
RDBMS
RDBMS
multiple
Discovery
platform
Copyright Third Nature, Inc.
Data has to be moved, standardized, tracked
There is a lot of data policy and governance to think about
Collect Manage & Integrate
Data Acquisition
Collect & Store
Data Management
Process & Integrate
Data Access
Deliver & Use
CRM
User reg
Orders
Leads
Cust
Cust
Cust
Cust
Master DW
Mktg
Campaign
analysis
Rec
engine
Copyright Third Nature, Inc.
HOW DO YOU EVOLVE
WHAT YOU HAVE
NOW?
Copyright Third Nature, Inc.
Agile methods without agile architectures fail
119
Process and governance are part of architecture
Data policy and organizational model go together. Policy
choices are like land use policies. Want free access to
water? Ribbon farms in a capitalist world, domains in a
feudal manor.
Copyright Third Nature, Inc.
Reinforcing relationships resist change, despite
radical technology and practice shifts
Note how only one third is tech
Architectural
Regime
MethodologyTechnology
Organization
Organization
defines wherethe
work is done and
the roles.
Technology
defines what
work can be done
in a given area. Methodology
defines how
work is done
and what that
work is.
Slide 120
Copyright Third Nature, Inc.
What about the technology? Do I need an <X>?
122
“Technology first” thinking creates implementation problems
If you start with technology,it will constrain the problems you can solve
123
Blended Architectures Are a Requirement,
Not an Option
Data Warehouse + Data Lake
On Premise + Cloud
RDBMS + S3 + HDFS
Commercial + Open Source
You can’t just buy one thing platform from one vendor. We aren’t building a
death star.
Each of the zones is likely to have products specific to that zone’s usage. The
uses differ, the people using them differ, shouldn’t the tools should differ too?
Copyright Third Nature, Inc.
In other words…
Software is like puppies.
Getting a puppy is easy, raising one is hard.
“The short term benefits of using a new [type of]
database exceed the long term cost of operating it.”
Dan Mckinley
Copyright Third Nature, Inc.
Choose Boring Technology
You only get so many chances
to make big changes. Don’t
waste them.
You can spend time focusing
on the goal and worry less
about the known tech, or you
can spend time learning the
new tech but less time
focusing on the goal.
The important thing is not the
choice of tech, it’s knowing
when the time is right to make
a new tech choice.
CopyrightThird Nature,Inc.
TANSTAAFL
When replacing the old
with the new (or ignoring
the new over the old) you
always make tradeoffs,
and usually you won’t see
them for a long time.
Technologiesare not
perfect replacementsfor
one another. Often not
better, only different.
© Third Nature Inc.
Manage your data
(or it will manage you)
Data management is where
developers are weakest.
Modern engineering
practices are where data
management is weakest.
You need to bridge these
groups and practices in the
organization if you want to
do meaningful work with
data. Remember Conway’s
Law when you build.
© Third Nature Inc.
“Begin with the end in mind”
The starting point can’t be with
technology. That’s like starting
with bricks when designing a
house. You may get lucky but…
The goals and specific uses are
the place to start
▪ Use dictates need
▪ Need dictates capabilities
▪ Capabilities are solved with
technology
This is how you avoid spending $20M on a new cluster in order to serve data to
analysts whose primary requirements aremet with laptops.
© Third Nature Inc.
A good design is not the one that
correctly predicts the future, it’s
one that makes adapting to the
future affordable.
— Venkat Subramaniam
Mark Madsen is the global head of
architectureat Think Big Analytics,
Prior to that he was president of Third
Nature, a research and consulting firm
focused on analytics, data integration
and data management.Mark is an
award-winning author, architectand
CTO whose work has been featured in
numerous industry publications.Over
the past ten years Mark received
awards for his work from the American
Productivity& Quality Center, TDWI,
and the Smithsonian Institute.He is an
international speaker, chairs several
conferences, and is on the O’Reilly
Strataprogram committee. For more
information or to contactMark, follow
@markmadsen on Twitter.
About the Presenter
Todd Walter
Chief Technologist - Teradata
• Chief Technologist for Teradata
• A pragmatic visionary, Walter helps business leaders,
analysts and technologists better understand all of the
astonishing possibilities of big data and analytics
• Works with organizations of all sizes and levels of
experience at the leading edge of adopting big data, data
warehouse and analytics technologies
• With Teradata for more than 30 years and served for more
than 10 years as CTO of Teradata Labs, contributing
significantly to Teradata’s unique design features and
functionality
• Holds more than a dozen Teradata patents and is a
Teradata Fellow in recognition of his long record of
technical innovation and contribution to the company

More Related Content

What's hot

Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesmark madsen
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprisemark madsen
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science TeamsEMC
 
Analytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataAnalytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataMicrosoft
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
How to Build Data Science Teams
How to Build Data Science TeamsHow to Build Data Science Teams
How to Build Data Science TeamsGanes Kesari
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science teamAshish Bansal
 
Data science team, a practice to setup
Data science team, a practice to setupData science team, a practice to setup
Data science team, a practice to setupOmid Mogharian
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehousemark madsen
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino Data Lab
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approachjoshwills
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiProfessor Lili Saghafi
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchKlaas Bosteels
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Domino Data Lab
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Formulatedby
 

What's hot (20)

Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slides
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprise
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
 
Analytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataAnalytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big data
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
How to Build Data Science Teams
How to Build Data Science TeamsHow to Build Data Science Teams
How to Build Data Science Teams
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science team
 
Data science team, a practice to setup
Data science team, a practice to setupData science team, a practice to setup
Data science team, a practice to setup
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehouse
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approach
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili Saghafi
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from Scratch
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
 

Similar to Building a Data Platform Strata SF 2019

Four Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyFour Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyArcadia Data
 
Modern Data Challenges require Modern Graph Technology
Modern Data Challenges require Modern Graph TechnologyModern Data Challenges require Modern Graph Technology
Modern Data Challenges require Modern Graph TechnologyNeo4j
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019DataKitchen
 
Big Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview PreparationBig Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview PreparationIntellipaat
 
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfChallenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfvenkatakeerthi3
 
DAMA Webinar: Turn Grand Designs into a Reality with Data Virtualization
DAMA Webinar: Turn Grand Designs into a Reality with Data VirtualizationDAMA Webinar: Turn Grand Designs into a Reality with Data Virtualization
DAMA Webinar: Turn Grand Designs into a Reality with Data VirtualizationDenodo
 
CTO Radshow Hamburg17 - Keynote - The CxO responsibilities in Big Data and AI...
CTO Radshow Hamburg17 - Keynote - The CxO responsibilities in Big Data and AI...CTO Radshow Hamburg17 - Keynote - The CxO responsibilities in Big Data and AI...
CTO Radshow Hamburg17 - Keynote - The CxO responsibilities in Big Data and AI...Santiago Cabrera-Naranjo
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDATAVERSITY
 
Modern data integration expert sessions
Modern data integration expert sessionsModern data integration expert sessions
Modern data integration expert sessionsJessicaMurrell3
 
Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar ibi
 
Why Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieWhy Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieSunil Ranka
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis
 
Best Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the OrganizationBest Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the OrganizationChasity Gibson
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data TipsQubole
 
Big Data
Big DataBig Data
Big DataNGDATA
 

Similar to Building a Data Platform Strata SF 2019 (20)

Four Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyFour Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics Strategy
 
Modern Data Challenges require Modern Graph Technology
Modern Data Challenges require Modern Graph TechnologyModern Data Challenges require Modern Graph Technology
Modern Data Challenges require Modern Graph Technology
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Big Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview PreparationBig Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview Preparation
 
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdfChallenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
 
Big Data at a Glance
Big Data at a GlanceBig Data at a Glance
Big Data at a Glance
 
Big data
Big dataBig data
Big data
 
DAMA Webinar: Turn Grand Designs into a Reality with Data Virtualization
DAMA Webinar: Turn Grand Designs into a Reality with Data VirtualizationDAMA Webinar: Turn Grand Designs into a Reality with Data Virtualization
DAMA Webinar: Turn Grand Designs into a Reality with Data Virtualization
 
CTO Radshow Hamburg17 - Keynote - The CxO responsibilities in Big Data and AI...
CTO Radshow Hamburg17 - Keynote - The CxO responsibilities in Big Data and AI...CTO Radshow Hamburg17 - Keynote - The CxO responsibilities in Big Data and AI...
CTO Radshow Hamburg17 - Keynote - The CxO responsibilities in Big Data and AI...
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
Unlocking big data
Unlocking big dataUnlocking big data
Unlocking big data
 
Modern data integration expert sessions
Modern data integration expert sessionsModern data integration expert sessions
Modern data integration expert sessions
 
Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar
 
Why Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieWhy Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A Lie
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Best Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the OrganizationBest Practices for Scaling Data Science Across the Organization
Best Practices for Scaling Data Science Across the Organization
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Big Data
Big DataBig Data
Big Data
 
7 trends-for-big-data
7 trends-for-big-data7 trends-for-big-data
7 trends-for-big-data
 

More from mark madsen

A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou RangeA Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Rangemark madsen
 
A Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing CustomersA Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing Customersmark madsen
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?mark madsen
 
Briefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionmark madsen
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecturemark madsen
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsmark madsen
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except usmark madsen
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)mark madsen
 
On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)mark madsen
 
Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...mark madsen
 
Don't let data get in the way of a good story
Don't let data get in the way of a good storyDon't let data get in the way of a good story
Don't let data get in the way of a good storymark madsen
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogiesmark madsen
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followersmark madsen
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousingmark madsen
 
Open Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing DataOpen Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing Datamark madsen
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousingmark madsen
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the datamark madsen
 
Big Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data RevolutionBig Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data Revolutionmark madsen
 
Using Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big DataUsing Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big Datamark madsen
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolutionmark madsen
 

More from mark madsen (20)

A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou RangeA Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
 
A Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing CustomersA Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing Customers
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?
 
Briefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collection
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analytics
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)
 
On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)
 
Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...
 
Don't let data get in the way of a good story
Don't let data get in the way of a good storyDon't let data get in the way of a good story
Don't let data get in the way of a good story
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogies
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followers
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
 
Open Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing DataOpen Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing Data
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the data
 
Big Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data RevolutionBig Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data Revolution
 
Using Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big DataUsing Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big Data
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 

Recently uploaded

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 

Recently uploaded (20)

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 

Building a Data Platform Strata SF 2019

  • 1. CopyrightThird Nature,Inc. Architecting a Data Platform For Enterprise Use March 2019 Mark Madsen Todd Walter
  • 2. CopyrightThird Nature,Inc. Why are we talking about architecture?
  • 3. 4 © 2017 Gartner, Inc.and/or its affiliates. All rights reserved. Gartner statement in 2018: only 15% are reported to be successful only 17%of Hadoop deployments are in production in 2017 Survey Analysis: BI and Analytics Spending Intentions, 2017 A McKinsey survey this year asked executives if their company had achieved a positive ROI with their big data projects: 7% answered “yes” Gartner Finding: in 2017
  • 4. CopyrightThird Nature,Inc. DW: Centralize, that solves all problems! Creates bottlenecks Causes scale problems Availability?
  • 5. CopyrightThird Nature,Inc. The data lake solution: no central authority wtf, it was fully operational!
  • 6. CopyrightThird Nature,Inc. The data lake solution? There’s a problem: as the lake is envisioned, it is still a centralized data architecture, but this time there is no single global model. Instead it’s files and not modeled. It can be operational while under construction. It’s still a death star.
  • 7. CopyrightThird Nature,Inc. Eventually we run into the same problems Seriously, wtf? It was agile and operational You’re building a centralized model without realizing it. Maybe we can avoid recreating the same problems.
  • 8. Copyright Third Nature, Inc. The solution to our problems isn’t technology, it’s architecture.
  • 9. CopyrightThird Nature,Inc. Architecture? What does the architect do? What is the architect’s responsibility and role? What is the core problem of data architecture?
  • 13. Copyright Third Nature, Inc. Bricks are not buildings We don’t think this is equivalent to this Architecture is not technology. It’s not a product you can buy.
  • 14. Copyright Third Nature, Inc. Blueprints are not architectures
  • 15. Copyright Third Nature, Inc. Technical wiring diagrams Product lists Pretty pictures Hand-waving abstract statements and rules A thing you can purchase …so what is it? Architecture is not…
  • 16. Copyright Third Nature, Inc. What is this?
  • 17. Architecture is an abstraction – a pattern that supports a purpose You need purpose, therefore focus business goals, outcomes, use cases
  • 18. 19 No architecture = accidental architecture The complexity of most organizations exceeds the capability of one system Just like the DW, keep piling things on… Accidental architecture at enterprise scale: Kowloon full scope (above), and the detail of one building section(left) – this is a true monolith. Functional, served a purpose, impossible to add services to, support, maintain. Result: avoid, or bulldoze.
  • 19. 20 This focus on monolith is why the data warehouse is in legacy mode Everything is tightly interconnected, such that nobody wants to change anything because they might break something else. It’s easier to start over somewhere else. Then they blame the data warehouse, or the database, or the data management team for what is essentially an architectural failure.
  • 20. CopyrightThird Nature,Inc. “The problem is the tools. Standardize on one tech!” It’s called stack think. Pick your vendor. Cede all architecture.
  • 21. CopyrightThird Nature,Inc. The problem is methods and process: agile your way into the Analytics Shantytown
  • 22. CopyrightThird Nature,Inc. “Start with the platform. The rest will follow.”
  • 23. CopyrightThird Nature,Inc.CopyrightThird Nature,Inc. Big data reality: What users seeBig data promise: What you see
  • 25. CopyrightThird Nature,Inc. 4/23/2018 The market shifted from not enough data in the 90’s to too much data: the problem has become managing not just size, but scope and variety
  • 26. CopyrightThird Nature,Inc. More complex needs drive more complex technology
  • 27. CopyrightThird Nature,Inc. Market hype and IT workforce skill gaps lead to FOMO The pressure on IT to “just buy something” is high. The usual IT response is based on procurement, not innovation or integration. The data infrastructure solution tends to be: buy tools, collect data.
  • 29. The end result in the IT landscape is complexity
  • 30. 31 Sales SQL BI: dashboards, reports, queries Customers Inventory Products tables Where we started: the data warehouse and BI The data warehouse solved a key problem: access to data from multiple OLTP systems to provide a unified view of key information. It is built on some assumptions:you must model the data before use, the data must be cleaned first,the data must be in tables. The DW is built for repeatable datause, not for one-time uses of informationor unknown-value datasets. © 2018 Teradata
  • 31. 32 Sales SQL BIDiscovery Customers Inventory Products Financials tabular ? Analysts Anyone tables After a while, user self-service was required © 2018 Teradata Eventually we reached maturitywith the DW, driving an increasingrate of new data requests by departments and individuals. A backlog of smaller data requests built up around .it We got self-service tools that actually worked.
  • 32. 33 Sales SQL BIDiscovery Customers Inventory Products Financials tabular ? Analysts Anyone tables Why Isn’t the Data in the DW? Rush Jobs and Unknown Data © 2018 Teradata The real problem is time: business analyses are often needed in a week or less. If the data is not in the DW then the analyst has to wait – the industry average for making data available in a DW is 10 weeks. They need quick access to new data rather than reusable, cleaned, common data.This means they need another way – and self-service tools offer an answer other than “wait.”
  • 33. 34 Sales SQL BIDiscovery Customers Inventory Products Financials tabular ? Analysts Anyone tables But data is rarely used in isolation, DW data is often needed © 2018 Teradata New data usually needs to be linked to existing data. Users don’t justneed accessto data, they need a place to work with and store datatoo. Ignore this requirement and you have runaway copies, extracts, andfiles tied to specific tools,with have no visibility into what is happening.
  • 34. 35 Sales SQL BIDiscovery Customers Inventory Products Financials ? R Analysts Anyone tables Array / matrix Warehouse events There are limits to what you can do with queries © 2018 Teradata tabular Some questionsare not answerable with queries. Deeper analysis is required to answer the question. The truth is, and always has been, that tables and a database are not the only technology in the analytic ecosystem.
  • 35. 36 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events Data Science accelerates, more data, more engines © 2018 Teradata tabular As BI matured,informationneeds grew more complex. New analytics, new data,higher volumes drove creationof new techniquesand new processingengines. New techniques,new engines, means new structuringand positioning of data is required.
  • 36. 37 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Graph Warehouse events More unique data, more approaches, technologies arrive monthly © 2018 Teradata tabular ? e.g. Emails, images, more events
  • 37. 38 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events The end result of years of addition is accidental architecture © 2018 Teradata tabular Graph ? e.g. Emails, images, more events
  • 38. 39 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events Today’s environment has (and still needs) different engines Engines © 2018 Teradata tabular Graph ? e.g. Emails, images, more events
  • 39. 40 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events Some engines require specific structures and positioning of data Data storedto engine needs Engines © 2018 Teradata tabular Graph ? e.g. Emails, images, more events
  • 40. 41 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists S3HDFSRDBMSRDBMS Raw data tables Array / matrix Time series Warehouse events Distributed data is the norm, stored in multiple types of repositories © 2018 Teradata tabular Graph ? e.g. Emails, images, more events
  • 41. 42 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events How much of the data science problem is tools vs engines? Data storedto engine needs Engines © 2018 Teradata tabular Graph ? e.g. Emails, images, more events The dirty secret of datascience: more than 80% of the enterprise market does it on laptops.Mostdoes not need specialized infrastructure.
  • 42. 43 Sales SQL BIDiscovery Data science ? Customers Inventory Products Financials clicks e.g. Emails, images, more events ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events Managing this complexity is a growing challenge © 2018 Teradata tabular Graph The challenge with use of data science shifts to operations:how to deploy and manage a production environment with so many technologies and copies of data?
  • 43. 44 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events Accidental architecture does not lead to agility © 2018 Teradata tabular Graph ? e.g. Emails, images, more events Accidentalarchitectureis unsustainablein the long run.
  • 44. 45 What’s missing: loosely integrated data stored for provisioning. Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data scientists tables Array / matrix Time series Warehouse events © 2018 Teradata tabular Graph ? e.g. Emails, images, more events The arrows in architecturediagrams hide the hard work and make things seemsimplerthan they are. The arrows are wherethe integration costs and labor are buried.
  • 45. 46 Sales SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Curated data stored for integration and provisioning tables Array / matrix Time series Warehouse events Data architecture is needed to fill the gap© 2018 Teradata You need a layerto separatethe messof raw data below from the distributed,context-specificusesof data above. This minimizesthe cost of change, providesuser control of data via separationof data managementfrom data use. tabular Graph ? e.g. Emails, images, more events Data scientists
  • 46. 47 Sales Renovating the accidental architecture requires a transition path to separate data infrastructure from the uses of data © 2018 Teradata SQL BIDiscovery Data science Customers Inventory Products Financials clicks ? R Python Tensor Flow Analysts Spark Anyone Data storedto engine needs Curateddata stored for integrationand provisioning S3HDFSRDBMSRDBMS Raw data tables Array / matrix Time series Warehouse events Engines tabular Graph ? e.g. Emails, images, more events Data scientists
  • 47. CopyrightThird Nature,Inc. We continue to act as if there is one single system “So much complexity in software comes from trying to make one thing do two things.” — Ryan Singer
  • 48. CopyrightThird Nature,Inc. Break down the monolithic architecture – not one building, multiple buildings in a block
  • 49. CopyrightThird Nature,Inc. Buildingsabove: flexibility,repurposing, quicker change above, funded separately Applications Utilitiesbelow:stability, reuse, slow predictable change below, funded centrally Infrastructure We built things as a single entity that combined the building and the infrastructure in one complex monolith Complexity requires a shift: separate the application from the infrastructure – we are focused on the block, no the buildings
  • 50. © Third Nature Inc. Speed and agility You want speed. Speed comes from burying work in the infrastructure, which trades flexibility for repeatability. Therefore you must be careful when you draw the line between what is above ground and below, the uses of data and the infrastructure.
  • 51. CopyrightThird Nature,Inc. "Always design a thing by consideringit in its next larger context - a chair in a room, a room in a house,a house in an environment,an environmentin a city plan." – Eliel Saarinen
  • 52. CopyrightThird Nature,Inc. Think of IT as a city Where does a building fit?
  • 53. CopyrightThird Nature,Inc. Streaming analytics Data science Data collection Self-service / Discovery BI (QRD) Self-service Data
  • 54. 56 The bigger picture of what we create The ecosystem is like the city and all of its extended neighborhoods. The different applications or types of projects are like the neighborhoods with specific types of buildings. Each individual building (application) has its own unique blueprint. Underlying all the applications is shared data infrastructure, like city services. The shared infrastructure is the foundation of the analytics architecture for a company. City Neighborhoods Buildings and shared infra
  • 55. CopyrightThird Nature,Inc. “Begin with the end in mind” The starting point can’t be with technology. That’s like starting with bricks when designing a house. You may get lucky but… The goals and specific uses are the place to start ▪ Use dictates need ▪ Need dictates capabilities ▪ Capabilities are solved with technology This is how you avoid spending $2M on a Hadoop andspark cluster in order to serve data to analysts whose primary requirements aremet with laptops.
  • 56. CopyrightThird Nature,Inc. Persisting data is not the end of the line. If you stop here you win the battle and lose the war
  • 57. CopyrightThird Nature,Inc. We don’t have a data science problem, just like we didn’t have a BI problem The origin of analytics as “business intelligence” was stated well in 1958: …the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal. ~ H. P. Luhn “A Business Intelligence System”, http://altaplana.com/ibmrd0204H.pdf ” “ Our goal is analytics as a capability, not a technology
  • 58. © Third Nature Inc. B I
  • 59. © Third Nature Inc. The old problem was access, the new problem is analysis
  • 60. CopyrightThird Nature,Inc. Three constituencies Stakeholder Analyst Builder aka the recipient aka the data scientist aka the engineer
  • 61. CopyrightThird Nature,Inc. Starting points for analytics strategy Many organizations choose to start with the analysts. Create a data science team. Turn them loose to find a problem. Many more start with builders: technology solutions looking for problems, e.g. 65% of the IT driven Hadoop and Spark projects over the last five years. The right place to start? Stakeholders.The goal to achieve,the problem to solve.
  • 62. CopyrightThird Nature,Inc. WHAT IS THE CONTEXT OF USE?
  • 63. There is an extensive list of requirements to support Primary requirementsneededby constituents S D E Data catalogand ability to search it for datasets X X Self-service accessto curateddata X Self-service accessto uncurated(unknown, new) data X X Temporary storagefor working with data X Data integration,cleaning,transformation, preparation toolsandenvironment X X Persistentstoragefor source dataused by productionmodels X X Persistentstoragefor training,testing,production data used by models X X Storageand management of models X X Deployment, monitoring, decommissioning models X Lineage, traceabilityof changes made for data used by models X X Lineage, traceabilityfor model changes X X X Managingbaseline data / metrics for comparing model performance X X X Managingongoing data / metrics for trackingongoing model performance X X X S = stakeholder, user, D = data scientist, analyst, E = engineer, developer
  • 64. © Third Nature Inc. How did we get to this state with BI & analytics? There’s a difference between having no past and actively rejecting it.
  • 65. 67 100 1,000 10,000 1 2 3 4 5 6 7 8 9 10 Customer Data Plateaus Add users, accumulate history, add attributes, add dimensions Entirely new data set enabled by new technology at new price point – value has to exceed cost (ROI). Earliest adopters have vision ahead of ROI
  • 66. 68 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 1 2 3 4 5 6 7 8 9 10 11 12 13 User Adoption of New Data Customer Data Users - Data Set 1 Users - Data Set 2 Users of Integrated Data User adoption of new data sets starts over. Very small number of experts growing to wider audience, sophisticated users moving to business analysts, then business users, B2B customers and even to consumers Greater value is derived when data sets are linked – see bigger picture (eg who buys what, when). Comes after initial extractionof easy value from standalonedata
  • 67. 69 • <Store, Item, Week> • <Store, Item, Day> – Simple aggregations • Market Basket – Affinity – Link to person, demographics, HR • Inventoryby SKU by store – Temporal, time series, forecasting – Link to product, marketing, market basket Retail Plateaus 2B records total for 9 quarters 2B records per day, keep 9 quarters
  • 68. 70 • Web Logs and traffic – Behavioral patterns– eg path linked to person,offers,other channels – Operationsof the web site • Supply chain sensors – sampled at major event – Activity Based Costing – Link to customer, product,HR, planning • Social Media – Text analysis, Filtering, languages – Link to customer, sales,other channelinteractions • Supply chain sensors – sampled at minutes or seconds – Telematics – Real time, Eventdetection,trending,static and dynamicrules – Link to HR, thresholds, forecasts,routing,planning Retail Plateaus 30B records per day
  • 69. 71 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Data Size Plateaus over Time Lather, Rinse, Repeat. Trend line is roughly Moore’s Law. Delay or skip a generation if new data set is two orders of magnitude instead of one.
  • 70. 72 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,000 100,000,000,000 1,000,000,000,000 10,000,000,000,000 100,000,000,000,000 1,000,000,000,000,000 10,000,000,000,000,000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 Customer Data Size vs. 2x per 12mo Customer Data Moore's Law - 2x/18mo 2x/12 mo But all studies show demand/creation of data increasing significantly faster than Moore’s law If we started only 1 OOM behind, we are now 5 OOM away from user demand for data Data Exhaust
  • 71. 73 FINANCE Revenue Expenses Customers CUSTOMER CARE Customer Products Orders Case History SALES Orders Customers Products MARKETING Customers Orders Campaign History OPERATIONS Inventory Returns Manufacturing Supply Chain How many batteries are in inventory by plant? What is the trend of warranty costs? How many people made a warranty claim last week? How many sales have been made quarter to date? Which customers should get a communica tion on extended warranties? 54 32 29 49 66
  • 73. 75 Given the rise in warranty costs, isolate the problem to be a plant, then to a battery lot. Communicate with affected customers, who have not made a warranty claim on batteries, through Marketing and Customer Service channels to recall cars with affected batteries. 2855 Inventory Returns Manufacturing Supply Chain Customer Service Orders Revenue Expenses Case History Customers Products Pipeline Customers Campaign History FINANCE SALESMARKETING OPERATIONS CUSTOMER EXPERIENCE 2855
  • 74. 76 Manufacturing: Data Overlap Analysis New Business Improvement Opportunities through Data Leverage Sales Force Profitability Analysis 100% 80% 66% 24% 41% 66% 0% 24% 0% 24% Transportation Planning 11% 100% 87% 28% 64% 56% 22% 34% 13% 45% Production Planning 6% 57% 100% 19% 83% 35% 17% 28% 9% 40% Vendor Managed Inventory 12% 100% 100% 100% 78% 100% 61% 100% 39% 100% Global Pricing Rationalization 4% 50% 100% 17% 100% 27% 15% 29% 11% 41% Fulfillment (Perfect Order) 16% 94% 90% 48% 58% 100% 35% 53% 19% 56% Manufacturing Quality Optimization 0% 76% 88% 60% 66% 71% 100% 79% 43% 73% Preventative Maintenance Analysis 7% 75% 93% 61% 80% 68% 50% 100% 27% 82% Warranty Claims Analysis 0% 94% 100% 83% 100% 83% 94% 93% 100% 98% Quality Life Cycle Improvement 4% 57% 75% 35% 64% 41% 26% 47% 16% 100% SalesForce ProfitabilityAnalysis ProductionPlanning VendorManaged Inventory GlobalPricing Rationalization TransportationPlanning Fulfillment (PerfectOrder) ManufacturingQuality Optimization Preventative Maintenance Analysis WarrantyClaims Analysis QualityLifeCycle Improvement If Then
  • 75.
  • 76. 78 PRODUCT SENSOR SOCIAL MEDIA CUSTOMER CARE AUDIO RECORDINGS DIGITAL ADVERTISING CLICKSTREAM 65 41 32 19 28 How many visitors did we have to our hybrid cars microsite yesterday? What are the temperature readings for batteries by Manufactur er? What is the sentiment towards line of hybrid vehicles? Which customers likely expressed anger with customer care? Which ad creative generated the most clicks?
  • 77. 79 2855 SENSOR DIGITAL ADVERTISING CLICKSTREAM INTERACTIONS RATINGS & REVIEWS CUSTOMER PORTAL INTERACTIONS EXTERNAL INTERACTIONS SOCIAL MEDIA IVR Routing RFID ELECTRONIC COMMERCE FINANCE SALESMARKETING Inventory Returns Manufacturing Supply Chain Customer Service Orders Revenue Expenses Case History Customers Products PipelineCustomers Campaign History OPERATIONS CUSTOMER CARE – AUDIO RECORDINGS Maps Telemetry SERVER LOGS CUSTOMER EXPERIENCE Enterprise Data
  • 79. 81 2855 High, Well Known Quality Directional Quality Unknown/Low quality Well Defined Data Model Curated JSON, XML, DB Extracted Attributes Curation Required
  • 80. 82 2855 Out of Deviation Sensor Readings Fraud Events CUSTOMER EXPERIENCE SALES MARKETING OPERATIONS Abandoned Carts Online Price Quotes Social Media Influencers FINANCE 11234 Minimum Viable Curation Minimum Viable Data Quality
  • 81. 83 2855 Out of Deviation Sensor Readings Fraud Events CUSTOMER EXPERIENCE SALES MARKETING OPERATIONS Abandoned Carts Online Price Quotes SocialMedia Influencers FINANCE 11234 Sensor data formatting Unit Normalization Vehicle/version normalization Make/model/year selection Data Comprehension, Pipelines
  • 82. 84 2855 Enterprise Wide Use 10s to 100s of Users 10s Users Analysts / engineers Business Analysts Action list, Report, Dashboard Users Share across Enterprise Share in department Share over cube wall User Base and Sharing
  • 84. 86 Evolving Consumption Requirements 2855 Production tools, curated data, integrated across business areas Wide variety of tools and data forms Targeted data access, many applications, response time SLAs High CPU, moderate to low IO Bulk scans, large computation, transformation, data access, specific integration Moderate CPU, High IO, Resource management
  • 85. 87 MARKETING FINANCE SALES OPERATIONS CUSTOMER EXPERIENCE Given the rise in warranty costs, isolate the problem to be a plant and the specific lot. Exclude 2/3rd of the batteries fromthe lot that are fine. Communicate with affected customers, who havenot made a warranty claim, through Marketing and Customer Service channels to recall cars with affected batteries. 2855 MANUFACTURING CAMPAIGN HISTORY COSTS PRODUCTS CUSTOMERS CASE HISTORY SENSOR Access Wide Variety of Data to Answer a Question
  • 86. Copyright Third Nature, Inc. DATA ARCHITECTURE We’re so focused on the light switch that we’re not talking about the light
  • 87. CopyrightThird Nature,Inc. History: This is how BI was done through the 80s First there were files and reporting programs. Application files feed through a data processing pipeline to generate an output file. The file is used by a report formatter for print/screen. Files are largely single-purpose use. Every report is a program written by a developer. Data pipeline code
  • 88. CopyrightThird Nature,Inc. History: This is how BI ended the 80s The inevitable situation was... Data pipeline code
  • 89. CopyrightThird Nature,Inc. History: This is how we started the 90s Collect data in a database. Queries replaced a LOT of application code because much was just joins. We learned about “dead code” SQL SQL SQL SQL SQL
  • 90. CopyrightThird Nature,Inc. Pragmatism and Data Lessons learned during the ad-hoc SQL era of the DW market: When the technology is awkward for the users, the users will stop trying to use it. Even “simple” schemas weren’t enough for anyone other than analysts and their Brio… Led to the evolution of metadata-driven SQL- generating BI tools, ETL tools.
  • 91. CopyrightThird Nature,Inc. BI evolved to hiding query generation for end users With more regular schema models, in particular dimensional models that didn’t contain cyclic join paths, it was possible to automate SQL generation via semantic mapping layers. We developed data pipeline building tools (ETL). Query via business terms made BI usable by non-technical people. ETL SQL Life got much easier…for a while
  • 92. CopyrightThird Nature,Inc. Today’s model: Lake + data engineers, looks familiar… The Lake with data pipelines to files or Hive tables is exactly the same pattern as the COBOL batch.. Dataflow code We already know that people don’t scale…
  • 93. Copyright Third Nature, Inc. The architecture from 1988 we SHOULD HAVE BEEN USING The general concept of a separate architecture for BI has been around longer, but this paper by Devlin and Murphy is the first formal data warehouse architecture and definition published. 95 “An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Vol.27, No. 1, (1988) Slide 95Copyright Third Nature, Inc.
  • 94. But 30 years ago we did not expect so many different models of deployment,execution and use. Needs change DeployETL Data Data Storage Alerts / Reports/ Decisioning Deploy f Data Streams Intelligent Filter / Transform Model Execution Analytics for eyeballs and analytics for machines are different
  • 95. Copyright Third Nature, Inc. Decouple the Architecture The core of a data warehouse isn’t the database, it’s the data architecture that the database and tools implement. We need a new data architecture that is not limiting: ▪ Deals with change more easily and at scale ▪ Does not enforce requirements and models up front ▪ Does not limit the format or structure of data ▪ Assumes the range of data latencies in and out, from streaming to one-time bulk ▪ Allows both reading and writing of data ▪ Makes data linkable, and provide governance where required ▪ Does not give up the gains of the last 25 years
  • 96. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately, independently Data Acquisition Collect & Store Incremental Batch One-time copy Real time Platform Services Data storage Data arrives in many latencies, from real-time to one-time. Acquisition can’t be limited by the management or consumption layers. Data acquisition should not be directly tied to the needs of consumption. It must operate independently of data use.
  • 97. Copyright Third Nature, Inc. Data hoarding is not a data management strategy
  • 98. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately, independently Platform Services Data Management Process & Integrate Data storage Data management should not be subject to the constraints of a single use Data management has historically been blended with both data acquisition and structuring data for client tools. It should be an independent function.
  • 99. Copyright Third Nature, Inc. The goal is to decouple: solve the application and infrastructure problems separately, independently Platform Services Data Access Deliver & Use Data storage This separates uses of data from each other, allowingeach type of use to structure the data specific to its own requirements. Data access is already somewhat separate today. Make the separation of different access methods a formal part of the architecture. Don’t force one model.
  • 100. Copyright Third Nature, Inc. The full analytic environment subsumes all the functions of a data lake and a data warehouse, and extends them Data Acquisition Collect & Store Incremental Batch One-time copy Real time Platform Services Data Management Process & Integrate Data Access Deliver & Use Data storage The platform has to do more than serve queries; it has to be read-write.
  • 101. Copyright Third Nature, Inc. The data architecture must align with system components because each of them addresses different data needs Incremental Collect Batch One-time copy Real time Manage & Integrate Data Acquisition Collect & Store Data Management Process & Integrate Data Access Deliver & Use Separating concerns is part of the mechanism for change isolation
  • 102. Copyright Third Nature, Inc. Divide the data architecture to address three different goals Collection Creation, collection, storage of new data Distribution Organization and provisioning of data to multiple points of use Consumption Direct support of data use Separation of concerns, coordination of process Collect Curate Consume
  • 103. CopyrightThird Nature,Inc. The design focus is different in each area Ingredients Goal: available User needs a recipe in order to make use of the data. Pre-mixed Goal: discoverable and integrateable User needs a menu to choose from the data available Meals Goal: usable User needs utensils but is given a finished meal
  • 104. CopyrightThird Nature,Inc. Food supply chain: an analogy for analytic data Multiple contexts of use, differing quality levels You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.
  • 105. CopyrightThird Nature,Inc. Data has to be moved, standardized, tracked There is a lot of data policy and governance to think about Collect Manage & Integrate Data Acquisition Collect & Store Data Management Process & Integrate Data Access Deliver & Use CRM User reg Orders Leads Cust Cust Cust Cust Master DW Mktg Campaign analysis Rec engine
  • 106. CopyrightThird Nature,Inc. Data curation The problem with so many sources, types, formats and latencies of data is that it is now impossible to create one model for all of it in advance. Data modeling is about the inside of a dataset. Curation is about the set. Data curation, rather than data modeling, is becoming the more important data management practice.
  • 107. Copyright Third Nature, Inc. The missing ingredient from most data projects Specifically, metadata kept separate from the data.
  • 108. CopyrightThird Nature,Inc. The data is in zones of management, not isolating layers Raw data in an immutable storage area Standardized or enhanced data Common or usage- specific data Transient data Relax control to enable self-service while avoiding a mess. Do not constrain access to one zone or to a single tool. Focus on visibility of data use, not control of data.
  • 109. CopyrightThird Nature,Inc. This data architecture resolves rate of change problems Raw data in an immutable storage area Standardized or enhanced data Common or usage- specific data Transient data New data of unknown value, simple requests for new data can land here first, with little work by IT. More effort applied to management, slower. Optimized for specific uses / workloads. Generally the slowest change. Not fast vs slow: fast vs right Not flexibility vs control: flexibility vs repeatability Agile for structure change vs agile for questions / use
  • 110. Copyright Third Nature, Inc. The concept of a zone is not a physical system. It’s data architecture Apps DW The biggest decision is to separate all data collection from the data integration from consumption. Physical system/technology overlays are separate, depend on the specific use cases and needs of the organization. Dist FS RDBMSs Decompress & process device logs RDBMS RDBMS multiple Discovery platform
  • 111. Copyright Third Nature, Inc. Data has to be moved, standardized, tracked There is a lot of data policy and governance to think about Collect Manage & Integrate Data Acquisition Collect & Store Data Management Process & Integrate Data Access Deliver & Use CRM User reg Orders Leads Cust Cust Cust Cust Master DW Mktg Campaign analysis Rec engine
  • 112. Copyright Third Nature, Inc. HOW DO YOU EVOLVE WHAT YOU HAVE NOW?
  • 113. Copyright Third Nature, Inc. Agile methods without agile architectures fail
  • 114. 119 Process and governance are part of architecture Data policy and organizational model go together. Policy choices are like land use policies. Want free access to water? Ribbon farms in a capitalist world, domains in a feudal manor.
  • 115. Copyright Third Nature, Inc. Reinforcing relationships resist change, despite radical technology and practice shifts Note how only one third is tech Architectural Regime MethodologyTechnology Organization Organization defines wherethe work is done and the roles. Technology defines what work can be done in a given area. Methodology defines how work is done and what that work is. Slide 120
  • 116. Copyright Third Nature, Inc. What about the technology? Do I need an <X>?
  • 117. 122 “Technology first” thinking creates implementation problems If you start with technology,it will constrain the problems you can solve
  • 118. 123 Blended Architectures Are a Requirement, Not an Option Data Warehouse + Data Lake On Premise + Cloud RDBMS + S3 + HDFS Commercial + Open Source You can’t just buy one thing platform from one vendor. We aren’t building a death star. Each of the zones is likely to have products specific to that zone’s usage. The uses differ, the people using them differ, shouldn’t the tools should differ too?
  • 119. Copyright Third Nature, Inc. In other words… Software is like puppies. Getting a puppy is easy, raising one is hard. “The short term benefits of using a new [type of] database exceed the long term cost of operating it.” Dan Mckinley
  • 120. Copyright Third Nature, Inc. Choose Boring Technology You only get so many chances to make big changes. Don’t waste them. You can spend time focusing on the goal and worry less about the known tech, or you can spend time learning the new tech but less time focusing on the goal. The important thing is not the choice of tech, it’s knowing when the time is right to make a new tech choice.
  • 121. CopyrightThird Nature,Inc. TANSTAAFL When replacing the old with the new (or ignoring the new over the old) you always make tradeoffs, and usually you won’t see them for a long time. Technologiesare not perfect replacementsfor one another. Often not better, only different.
  • 122. © Third Nature Inc. Manage your data (or it will manage you) Data management is where developers are weakest. Modern engineering practices are where data management is weakest. You need to bridge these groups and practices in the organization if you want to do meaningful work with data. Remember Conway’s Law when you build.
  • 123. © Third Nature Inc. “Begin with the end in mind” The starting point can’t be with technology. That’s like starting with bricks when designing a house. You may get lucky but… The goals and specific uses are the place to start ▪ Use dictates need ▪ Need dictates capabilities ▪ Capabilities are solved with technology This is how you avoid spending $20M on a new cluster in order to serve data to analysts whose primary requirements aremet with laptops.
  • 124. © Third Nature Inc. A good design is not the one that correctly predicts the future, it’s one that makes adapting to the future affordable. — Venkat Subramaniam
  • 125. Mark Madsen is the global head of architectureat Think Big Analytics, Prior to that he was president of Third Nature, a research and consulting firm focused on analytics, data integration and data management.Mark is an award-winning author, architectand CTO whose work has been featured in numerous industry publications.Over the past ten years Mark received awards for his work from the American Productivity& Quality Center, TDWI, and the Smithsonian Institute.He is an international speaker, chairs several conferences, and is on the O’Reilly Strataprogram committee. For more information or to contactMark, follow @markmadsen on Twitter. About the Presenter
  • 126. Todd Walter Chief Technologist - Teradata • Chief Technologist for Teradata • A pragmatic visionary, Walter helps business leaders, analysts and technologists better understand all of the astonishing possibilities of big data and analytics • Works with organizations of all sizes and levels of experience at the leading edge of adopting big data, data warehouse and analytics technologies • With Teradata for more than 30 years and served for more than 10 years as CTO of Teradata Labs, contributing significantly to Teradata’s unique design features and functionality • Holds more than a dozen Teradata patents and is a Teradata Fellow in recognition of his long record of technical innovation and contribution to the company