Within the last 6 months, the U.S. agencies have begun defining a “Data Science Occupational Series”.
This means adding the term “(Data Scientist)” at the end of a job title to increase the odds of finding a candidate that understands data.
Watch the full presentation: https://resources.tamr.com/govdataops
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
DataOps @ Scale: A Modern Framework for Data Management in the Public Sector
1. DataOps @ Scale:
A Modern Framework for Data
Management in the Public Sector
Mark Marinelli, Head of Product, Tamr
May, 2020
2. Confidential – Tamr, Inc.
Agenda
2
● How DataOps Began
● A DataOps Framework
● DataOps in the Real World
○ DHS
○ Air Force
● Getting Started
● Q&A
3. Confidential – Tamr, Inc.
About the Speakers
3
Mark Marinelli
Head of Product at Tamr
Katie Everett
Marketing Manager at Tamr (Moderator)
4. Confidential – Tamr, Inc.
How DataOps Got Started &
What Actually is ‘DataOps’?
4
5. Confidential – Tamr, Inc.
Modern internet companies have data advantages
5
Unified Dataportal
Greenfield Infrastructure & High End Talent Pool
6. Confidential – Tamr, Inc.
Traditional orgs have significant “legacy drag coefficient”
Problem : Thousands of systems
generating data every day that
were built over decades to support
business processes.
6
Result: “Random Data
Salad”
Data debt from constant
change/entropy.
Data Silos created due to:
● Security concerns
● Organizational mismatch
Consequences:
1. Too much time spent on data prep vs.
analysis / action
2. High failure rate of BI / analytics projects
3. Game changing initiatives deemed
‘impossible’ and never start
Restructuring
Leadershi
p
Changes
Politics
Dynamic Schema
DBs - Mongo et al
“Data
Hoarding”
Legacy
Burden
M&A
8. Confidential – Tamr, Inc. 8
Today: we have data scientists! (and want to do cool AI stuff)
9. Confidential – Tamr, Inc.
Just not yet in the government… but it’s growing
● Within the last 6 months, the U.S. agencies have begun defining a “Data
Science Occupational Series”.
● This means adding the term “(Data Scientist)” at the end of a job title to
increase the odds of finding a candidate that understands data.
9
10. Confidential – Tamr, Inc.
“How do we take the data that we have — which is ubiquitous and it’s incredible across the
federal government — understand it, be able to leverage it at every step in the chain.”
- Deputy Federal CIO, Margie Graves
10
11. Confidential – Tamr, Inc.
Human/behavioral challenges don’t help
11
● Afraid to share data - Due to organizational
policies and security levels
● Hoarding data - A method of organizational
control or job preservation
● Obscuring data complexity - Failure to
embrace the complexity, diversity, and
idiosyncrasy of data generated in a large
organization
● Limiting access to a small number of users
- A method of control or as a reflection of
insecurity of data quality
Human Behavior Challenges
12. Confidential – Tamr, Inc.
This is a solved problem
12
● Item four
● Item five
● Item six
● Item sevenTop-down:
Architects drive
the spec
Monolithic: Single
application
A Priori Modeling:
Front-loaded view of
all components
Quality Assurance:
Manual QA
Waterfall Approach
12
Traditional SDLC:
Dev/test/prod →
major/minor release
Bottom-Up: Users
drive the spec
Distributed:
Loosely coupled,
scalable
Learn from Use:
Emergent feature set
Continuous
Integration:
Automated testing
Agile Development
Modern DevOps:
Continuous delivery
Just as DevOps drove rapid delivery of high-quality, scalable software applications,
DataOps is the path forward for data applications.
13. Confidential – Tamr, Inc.
What is DataOps? = Modern data engineering practice
13
DataOps is an automated, process oriented
methodology, used by analytic and data teams to
improve the quality and reduce the cycle time of data
analytics.
14. Confidential – Tamr, Inc.
A DataOps Framework:
Process, Technology, Organization
14
15. Confidential – Tamr, Inc.
DataOps framework components
15
Agile - Incremental delivery
method
Architecture - Tools which
comprise data supply chain
Infrastructure - Platform to
support architecture
Roles - division of labor across
mixed-skill teams
Structure - working model for
projects across technical and
business teams
OrganizationProcess Technology
16. Confidential – Tamr, Inc.
Process - The Wrong Way
16
Sources ConsumersProcess, Technology, Organization
● Labor-intensive
● Monolithic
● IT driven
Delivery
Time
RemainingWork
$
?
Modeling
Rules
Testing
?
$
!
Business
Users
Analysts
Data
Scientists
Developers
External Tabular Data
Internal Tabular Data
17. Confidential – Tamr, Inc.
Process - The Right Way
17
Sources ConsumersProcess, Technology, Organization
● Automated
● Incremental
● Collaborative
Time
RemainingWork
$
$
$
$
?
?
?
?
Analysts
Data
Scientists
Developers
Internal Tabular Data
External Tabular Data
!
Business
Users
18. Confidential – Tamr, Inc.
Why Use Human-Guided Machine Learning to
Master Data
18
Before: Data Scientists spent months and
100% of energy preparing data.
Today: ML can do 80% of
data mastering lift...
…. Enabling Data Scientists to put
final touches on the last 20%.
19. Confidential – Tamr, Inc.
The DataOps Component Architecture
19
Sources ConsumersTechnology, Organization, Process
Internal Tabular Data
External Tabular Data
Movement & Automation
Storage & Compute
Governance & Policy
Catalog &
Crawling
Publishing &
Versioning
Analysts
Data
Scientists
Developers
Mastering & Quality
Feedback & Usage
Business
Users
20. Confidential – Tamr, Inc.
Technology - Architectural Principles
20
Sources ConsumersProcess, Technology, Organization
Analysts
Data
Scientists
Developers
● Scale Out/Distributed
○ Cloud First
● Collaborative (Humans at the Core)
○ Highly Automated - automate whenever possible
○ Bi-Directional (Feedback)
● Open/Best of Breed (not one platform/vendor)
○ Service Oriented (clear endpoints for data)
○ Loosely Coupled (Restful Interfaces Table(s) In/Out)
● Continuous (assume data will change)
○ Both aggregated AND federated storage
○ Both batch AND Streaming
● Lineage/Provenance is essential
Internal Tabular Data
External Tabular Data
Business
Users
21. Confidential – Tamr, Inc.
Infrastructure - Key Components
21
Management
Compute
Search
Storage
Infrastructure
(Cloud & On-Prem)
Sources ConsumersProcess, Technology, Organization
Analysts
Data
Scientists
Developers
Internal Tabular Data
External Tabular Data
Business
Users
22. Confidential – Tamr, Inc.
Organization - Roles
22
Internal Tabular Data
External Tabular Data
Data
Suppliers
Data
Consumers
CIO
Source Owner
DBA
IT Professional
CDO
Data Engineer
Curator
Steward
Business Owners / CxOs
Data
Preparers
Sources ConsumersProcess, Technology, Organization
Analysts
Data
Scientists
Developers
Business
Users
23. Confidential – Tamr, Inc.
Organization - Roles
Role Goals Tools
Business
Users
Use data to make business decisions Viz, CRM, Excel, PowerPoint, Word, Web
Search
Analyst Deliver insights to the business, typically through dashboards and
reports
Viz, Excel, SSDP, Web Search
Data Scientist Deliver insights to the business, typically through models and
algorithms
R, Python, SAS, SSDP
Developer Build applications which leverage corporate data Python, Java, JS, SQL, REST
Data Engineer Deliver and manage data pipelines ETL, SQL
Curator Ensure consumers have the data they need, in the form they need it MDM, Catalog
Steward Create policies and drive governance MDM, Catalog, Governance
Source Owner Define and manage purpose, processes (data creation, consumption) &
users (i.e., access) of the data source
EDW, SQL, ERWin, LDAP, SAP
ConsumersPreparersSuppliers
24. Confidential – Tamr, Inc.
Organization - Structure
24Appropriate model will fluctuate with scale of DataOps project work
Shared Services Model
Full-service development of data applications, in
collaboration with business
Advantages
● Centralized technical knowledge
● Centralized resourcing - one-stop shop
● Accretive experience
Disadvantages
● Bandwidth contention - how to prioritize
competing projects?
Advisory Model
Bootstraps projects with best of breed tools and
approach, but does not complete them
Advantages
● Centralized technical knowledge
● Minimal resourcing - experts, not
implementers
● Flexibility - options to deviate from standard
tools
Disadvantages
● Resource burden in on each project /
department - both in development and
ongoing maintenance
● Limited feedback - does the advice get better
after each project?
26. Confidential – Tamr, Inc. 26
Global Travel Assessment System
(GTAS)
U.S. Customs and Border Protection (CBP)
developed GTAS - an open source
application providing nation-states and
border security entities the capacity to
screen persons against a risk criteria for
threat prevention, public health or a
variety of other use cases.
DHS selected Tamr to provide improved
entity resolution capability due to its fast,
scalable human guided machine learning-
based approach
GTAS: CBP’s Passenger Data Screening and Analysis System
● Receive and store air traveler reservation and manifest data
● Perform real time risk assessment
● Manage risk criteria and watch lists
● View high risk travelers with associated flight details and
reservation information
● Query flight and travel history
Human trafficker profile match
Terrorist database match
Recent travel in pandemic
affected area
Risk Criteria
Passenger Manifest
Travel Reservation
Intergovernmental
INTERPOL Exchanges
27. Confidential – Tamr, Inc.
DHS Case Study con’t- 4 Phases to Deployment
27
Phase 1 -
Accuracy
Improve accuracy of
entity resolution with
biographical and
reservation data
4PHASESResults
● Leverage all data
● Label examples of
matching/distinct
● Measure precision
and recall
● Iterate & optimize
Phase 2 - Automation
Build data pipeline and
automate data
movement and system
controls
Phase 3 - Performance
Optimize performance
through a new low
latency match service
and reliable, robust
communications
Phase 4-
Interoperability
Ensure stability for
deployment into variety
of environments
● Prepare, ingest,
export data
● Triggering, data
exchanges and
error handling
● Security and
authorization
● Optimize models for
risk, timing, cultural
patterns, etc.ERP
Consolidation
● Create low latency
match capability
● Measure and iterate end-
to-end latency
● Create documentation
and communication
channels for
international support
● Installation, testing
and sustainment
● Advanced feature
offerings
2017 2017 2018 2019
28. Confidential – Tamr, Inc. 28
DHS Case Study
“When we were looking for
companies, Tamr fit our bill
perfectly. They were interested in
the mission, they understood what
we were trying to do and why it
was important to international
security, and they demonstrated
the capacity to execute at a
commercial level.”
Ari Schuler, Advisor
Office of the Commissioner
U.S. Customs and Border Protection
Recent Tamr recognition by DHS
● Science & Technology-Funded 2019
Performer Award:
Crossing the Valley of Death
■ This honor is awarded to an effort that
has a technical transition success story.
● DHS Silicon Valley Innovation Program:
Snapshot Article
■ The article highlights the transition of
Tamr technology to CBP, to be release
January 2019 via Meltwater and
GovDelivery
29. Confidential – Tamr, Inc.
Case Study: US Air Force
29
Semi Automated Aircraft Wing Flutter (vibration) Analysis
Technical Outcome
Business Outcome
Technical Challenge
Business Challenge
● Use ML to understand a large corpus (30
yrs worth) of past testing, simulations,
and analyses
● Automate large portions of the process
to predict aircraft "flutter”
● Users quickly interrogate decades worth
of technical data via rich metadata
● Reduce engineer process time
dramatically by identifying relevant
antecedents and technical predictions
● Deliver on big data initiative, enable
end users easy search of historic data
● Present SMEs (PhD engineers) with
short list of relevant antecedents and
flutter predictions
● Tamr tagged 35K files with 645K
descriptive labels in 22 tag types
(aircraft, stores, author, etc.)
● Automatically create recommendation
based on a machine learning model
built on the historical data
30. Confidential – Tamr, Inc.
Accelerating Subject Matter Expert Recommendations
30
Machine learning models
for each discipline
Discipline-specific
recommendations
Metadata extraction
Relevant documents
INPUT AUTOMATED ANALYSIS AUTOMATED OUTPUTS PRODUCTIVITY TOOLS
Document browsing powered by
clean, consistent metadata
New
Config.
Request
32. Confidential – Tamr, Inc.
Getting started - Process
32
Agile is the key
● If not already there, choose a model that works (Scrum,
SAFe)
Inventory the set of available projects
● Score on availability of data vs. value of solving a
problem
Define high-value, data-rich project that will demand a
complex solution
● Forcing function to ensure end-to-end functionality will
be covered
Process
33. Confidential – Tamr, Inc.
Getting started - Technology
33
Identify path to a modern, modular service architecture
● Create blueprint for next generation data
management platform
● Revisit cloud migration strategy
Inventory current tool set
● TCO / skill requirements / etc.
● Determine which should be replaced, and when this is
viable
Decouple monolithic processes
● Wrap components in APIs, expose as services
Start building with new tech
● Choose subset of tools for proof of concepts to replace old
tech
Technology
34. Confidential – Tamr, Inc.
Getting started - Optimization
34
Inventory current team
● Identify existing key roles - data engineers and their
consumers
● Find best candidates for new roles - data curators and
data stewards
Create cross-functional team
● Data consumers - will depend upon project
● Data Engineer(s)
● Curator
● Steward
Choose your operating model
● Start with Shared Services for first project
Ensure executive alignment
● CDO or equivalent
Optimization
35. Confidential – Tamr, Inc.
What NOT to do
35
● Avoid boil the ocean/“waterfall” (projects measured in years/quarters)
○ Build rational long term infra while delivering real analytic value along the way
● Single “Platform”: Don’t overestimate what single piece of software can do
○ Focus on thoughtfully designed ecosystem of loosely coupled best of breed tools
● Single Vendor: Don’t overestimate what single vendor can do
○ Align vendors with APIs and expectations that they MUST work together
● Don’t Underestimate effort required to make FOSS work
○ Just because Google does it doesn’t mean you can do it
● Don’t underestimate human/behavioral challenges with data
○ Most often the reason that projects fail/stall are human/behavioral
36. Confidential – Tamr, Inc.
Key DataOps Principles
36
OrganizationProcess Technology
Agile - Quick wins +
incremental value delivery
Architecture - Loosely-coupled
best of breed components
which incorporate automation
+ human feedback
Infrastructure - Cloud-native,
scalable and elastic tooling
Roles - Specialization and
separation of duties
Structure - Centralized
expertise + knowledge capture
across projects
37. Confidential – Tamr, Inc. 37
New podcast series!
Featured guests include:
● Nick Sinai - Deputy CTO in the Obama Administration
● Eric Iverson - Former CIO at Sony
● And more data leaders...
Listen today on Spotify, Apple Podcasts and Google Podcasts!
https://www.tamr.com/datamasters/
Listen today on Spotify, Apple Podcasts and Google Podcasts!
https://www.tamr.com/datamasters/
New Podcast Series - DataMasters
38. Confidential – Tamr, Inc. 38
Questions?
Contact Michael Gormey with additional questions after the
webinar:
michael.gormey@tamr.com
Editor's Notes
Thank you all for joining us today. We are thrilled to put on today’s webinar, DataOps at Scale, a modern framework for data management in the public sector.
My name is Katie Everett, and I’m the public sector marketing manager at Tamr and will serve as today’s moderator.
In the webinar, we’ll review how DataOps began, what DataOps is and the components that go into a successful framework.
Additionally, we’ll give you real life examples of DataOps successfully deployed in the public sector as well as actionable steps on how you can get started.
We’ll leave some time at the end for Q&A.
I’d like to introduce today’s speaker who will take us through the webinar, Mark Marinelli.
Mark is the Head of Product at Tamr and is a 20-year veteran of Enterprise Data Management and Analytics software. He is well versed in coaching companies through deploying DataOps at scale across multiple industries. Mark has held engineering, product management, and technology strategy roles at Lucent Technologies, Macrovision, and most recently at Lavastorm, where he was Chief Technology Officer.
So, over to you Mark.
Manage data from their business systems more as “exhaust” than “asset” > “significant data debt”
Heavy shortage of data scientists
Rush to fill the gap
Companies starting filling the gaps… rapidly scooping up data talent
As a Chief Data Officer begins to tackle the human/behavioral challenges, - they need to also begin establishing their next generation technical infrastructure. Having worked with dozens of Global 2000 Customers on their data/analytics initiatives at Tamr, we’ve seen some key principles that work well as companies begin to establish their next generation data infrastructure.
Just as DevOps drove rapid delivery of high-quality, scalable software applications, DataOps is the path forward for data applications.
Components: What
There are many ways to think about the potential components of a next gen data ecosystem for the enterprise. Our friends at DataKitchen have done a good job with this post which refers to some solid work by the Eckerson group. In the interest of trying to simplify the context of what you might consider buying vs. building and which vendors you might consider, I’ve tried to lay out the primary components of a next gen enterprise data ecosystem based on the environments I’ve seen people configuring over the past 8-10 years and the tools (new and old) that are available. These components can be summarized as follows :
Components: What
There are many ways to think about the potential components of a next gen data ecosystem for the enterprise. Our friends at DataKitchen have done a good job with this post which refers to some solid work by the Eckerson group. In the interest of trying to simplify the context of what you might consider buying vs. building and which vendors you might consider, I’ve tried to lay out the primary components of a next gen enterprise data ecosystem based on the environments I’ve seen people configuring over the past 8-10 years and the tools (new and old) that are available. These components can be summarized as follows :
Components: What
There are many ways to think about the potential components of a next gen data ecosystem for the enterprise. Our friends at DataKitchen have done a good job with this post which refers to some solid work by the Eckerson group. In the interest of trying to simplify the context of what you might consider buying vs. building and which vendors you might consider, I’ve tried to lay out the primary components of a next gen enterprise data ecosystem based on the environments I’ve seen people configuring over the past 8-10 years and the tools (new and old) that are available. These components can be summarized as follows :
Q: Who are the people within the agencies that you work with that tend to champion DataOps initiatives from a leadership perspective?
Q: Are you advocating to eliminate the need for data scientists? Can you talk a little bit more about that? ---- Taking the human out of the loop, is ML as good as the humans? - ML runs 24/7, doesn’t need coffee or sleep
Q: Can subject matter experts really train this ML model? How is that possible?
Q: Do you see processes/tools in DataOps becoming open-sourced? If so, which processes do you think should be open-sourced to enable better integration of multiple vendors?
Thanks Mark. We have some very exciting news to share with you all today. Tamr launched a new podcast series called DataMasters. The podcast features data leaders that share stories about their journeys and offer insights on how they made their organizations more data driven.
Featured guests on the inaugural launch include:
Nick Sinai - Deputy CTO in the Obama Administration
and
Eric Iverson - Former CIO at Sony
We invite you to listen and subscribe today. The podcast will host many more data leaders across government organizations and enterprises.