Who changed my data?
Need for data governance and
provenance in a streaming world
Digital capability requires granular control of all data assets.
Dinesh Chandrasekhar
Director, Product Marketing
Paige Bartley
Senior Analyst, Data and
Enterprise Intelligence
Ovum | TMT intelligence | informa2 Copyright © Informa PLC
Ovum | TMT intelligence | informa3 Copyright © Informa PLC
Business challenges in achieving digital capability include:
 Reproducibility of analytics results
 Debugging of models and algorithms
 Ensuring correct access rights to data
 Consistent application of data policies
 Meeting regulatory compliance requirements
 Unifying data across repositories and silos
 Finding the right data at the right time
Digital Capability Depends on Full Control of Data
Addressing these
challenges requires
understanding how
data changes over time.
Ovum | TMT intelligence | informa4 Copyright © Informa PLC
Governance and Transparency of
Data Assets is More Important
than Ever
Ovum | TMT intelligence | informa5 Copyright © Informa PLC
More Data:
 Economics of storage have made keeping data cheap.
 New data types – sensor data, etc. – need to be combined with historical data.
More Users:
 Self-service era means more data consumers and more frequent data access.
 Varying users have varying access rights and privileges.
 More users means more proliferation of data versions.
More Complexity:
 Data repositories have become more distributed, and data sources more varied.
 Data resides in more locations than ever before, in the cloud and on-prem.
Factors Within the Enterprise
Ovum | TMT intelligence | informa6 Copyright © Informa PLC
More Regulatory Pressure
Regulations such as GDPR have indirect requirements for tracking lineage.
 Article 30 requirements for record keeping necessitate knowledge of provenance.
More Competitive Pressure
 Leverage of data is increasingly a competitive differentiator.
 Pace of change is accelerating, and comprehensive understanding of data is critical.
 Disruptors are emerging from unlikely industries, using data to their advantage.
Factors External to the Enterprise
Ovum | TMT intelligence | informa7 Copyright © Informa PLC
 Article 4: Definition of Personal Data
A person can be identified indirectly or directly
Data sources can be combined to make personal data
 Article 9: Processing of Special Categories of Personal Data
Processing of biometric data is highly restricted
Many types of sensors produce biometric data
 Article 30: Records of Processing Activities
“Who, what, when, where, and why” of processing
Need deep understanding of metadata and data lineage.
GDPR doesn’t differentiate between data-in-motion and data-at-rest! “Who changed what” is critical.
Lineage and provenance, while not directly required by GDPR, are critical to meeting requirements.
GDPR’s Specific Requirements for Data
Ovum | TMT intelligence | informa8 Copyright © Informa PLC
<<
<<
From an analytics standpoint, reaping the benefits of
big data means investing in data management and
governance. Without the correct people, processes, and
infrastructure, more casual business users will likely
struggle to see the benefits of big data technologies.
Laurent-Olivier Lioté
Analyst, Data and Enterprise Intelligence, Ovum
Ovum | TMT intelligence | informa9 Copyright © Informa PLC
A Holistic View of Data Requires Both Data-in-Motion and Data-at-Rest
Data at Rest Data in Motion
Contextual
Understanding
of Data
Ovum | TMT intelligence | informa10 Copyright © Informa PLC
Having a common enterprise metadata framework allows data of different types and from different sources
to be managed consistently.
A common metadata framework allows for:
 Common search and lineage for datasets
 Lifecycle management from ingestion to disposition
 Metadata exchange with other metadata tools
 Analysis of data usage and access trends
 Consistent application of access rights
 Analysis of behavior and anomalies
How Do We Do This? Metadata Management is Necessary for Governance
Metadata
Creation
Metadata
Enrichment
Metadata
Analysis
Ovum | TMT intelligence | informa11 Copyright © Informa PLC
The data lake, if properly managed, can support a common metadata framework which underpins enterprise data.
 Data-in-motion
 Data-at-rest
 Structured data
 Unstructured data
Common management of metadata allows for streamlined control and
visibility into data. Better control of data results in better business outcomes.
The Managed Data Lake Can Support a Common Metadata Framework
All metadata, managed together.
Ovum | TMT intelligence | informa12 Copyright © Informa PLC
The enterprise increasingly wants to analyze all data, both in-motion and at-rest, in context with each other.
Governance and lineage for data-in-motion allows for:
 Audit and regulatory compliance
 Insight into data history and provenance
 Comprehensive lifecycle management
 Security and access controls
 Better quality data = better analytics
Governance standards for data-in-motion need to match those for data-at-rest.
Governance Standards Need to be Equal
Common Metadata Framework
Data-in-Motion Data-at-Rest
Data Management Platform
13 © Hortonworks Inc. 2011–2018. All rights reserved
Changing face of data
Challenges and Solutions
14 © Hortonworks Inc. 2011–2018. All rights reserved
The New Way of Business Is Fueled By Connected Data
• Connected Customers,
Vehicles, Devices
• Socially crowd-sourced
requirements
• Digital design and
analysis
• Digital prototypes and
tests (simulations)
• Connected Factories,
Sensors, Devices
• Human-robotic
interaction
• 3D-printing on
demand
• Connected Trucks,
Inventory
• Location, traffic,
weather-aware
distribution
• Real-time inventory
visibility
• Dynamic rerouting
• Connected Customers,
Devices
• Omni- channel
demand sensing
• Real-Time
Recommendations
• Connected Assets
• Remote service
monitoring & delivery
• Predictive
maintenance
• OTA Updates
DEVELOPMENT MANUFACTURING DISTRIBUTION MARKETING/SALES SERVICE
15 © Hortonworks Inc. 2011–2018. All rights reserved
Today’s Digital Enterprises
RFID TRACKERS AND
NANO-DEVICES
to give you visibility into
movement of your goods
MOBILE NOTIFICATIONS
to inform you of shipment
delay from a supplier
BLOCKCHAINS
to give complete trust and
provenance in your supply
chain
VIRTUAL ASSISTANTS
to enhance your customer
experience
AI-POWERED CHATBOTS
to improve your customer
support functions
ELECTRONIC B2B
EXCHANGES
to streamline order processing
with partners
16
© Hortonworks, Inc. 2011-2018. All rights reserved. | Hortonworks confidential and proprietary information.
Modern Data Architecture
DATA CENTER
Machine
Learning/
Artificial
Intelligence
Telemetry –
Connected
Devices
Time Series
Databases
Stream Analytics
Deep Historical
Analysis
Exception
Monitoring
Legacy/
Operational
Data
Sensors,
Control
Systems
Cyber
Security
Edge
Analytics
Social Mobile
IoT
IoT
CLOUD
Geo Location
17
© Hortonworks, Inc. 2011-2018. All rights reserved. | Hortonworks confidential and proprietary information.
Data Challenges
Cannot get a
360 VIEW of
your customer? DROWNING in
data lakes?
TOO MUCH DATA
coming in from
TOO MANY
SOURCES and
devices?
New business
initiatives leading
to EXCESSIVE IT
COSTS?`
MOST IMPORTANTLY…
Don’t have the right data at the right time to make the right decision?
18
© Hortonworks, Inc. 2011-2018. All rights reserved. | Hortonworks confidential and proprietary information.
G L O B A L D ATA M A N A G E M E N T
DATA
SOURCES
DATA CENTER CLOUD EDGE
Exception
Monitoring
360 View of
Operations
Cyber
Security
Telemetry –
Connected
Devices
Time Series
Sensors,
Control
Systems
Telemetry –
Connected
Devices
Sensors,
Control
Systems
Time Series
Exception
Monitoring
Cyber
Security
Legacy/
Operational
Data
Global Data Management Enables Modern Data Architecture
19
© Hortonworks, Inc. 2011-2018. All rights reserved. | Hortonworks confidential and proprietary information.
Data Management Challenges
• Dealing with multi-clouds
• Avoiding cloud/ vendor lock-in
• Future proofing your architecture
• Common view of security, governance
• Manage all data, regardless of type or location
• Maximize data re-use for multiple workloads
DATA
SOURCES
DATA CENTER CLOUD EDGE
Exception
Monitoring
360 View of
Operations
Cyber
Security
Telemetry –
Connected
Devices
Time Series
Sensors,
Control
Systems
20
© Hortonworks, Inc. 2011-2018. All rights reserved. | Hortonworks confidential and proprietary information.
Global Data Management Platform
DATA
SOURCES
DATA CENTER CLOUD EDGE
Exception
Monitoring
360 View of
Operations
Cyber
Security
Telemetry –
Connected
Devices
Time Series
Sensors,
Control
Systems
DATA-IN-MOTION DATA-AT-REST
MANAGE, SECURE, GOVERN, CONSUME
21
© Hortonworks, Inc. 2011-2018. All rights reserved. | Hortonworks confidential and proprietary information.
Global Data Management - Powering Innovation
MODERN DATA USE CASES
EDW
OPTIMIZATION
CYBERSECURITY DATA SCIENCE
ADVANCED
ANALYTICS
IOT/ STREAMING
ANALYTICS
DATA
SOURCES
DATA CENTER CLOUD EDGE
Exception
Monitoring
360 View of
Operations
Cyber
Security
Telemetry –
Connected
Devices
Time Series
Sensors,
Control
Systems
DATA-IN-MOTION DATA-AT-REST
MANAGE, SECURE, GOVERN, CONSUME
22
© Hortonworks, Inc. 2011-2018. All rights reserved. | Hortonworks confidential and proprietary information.
Apache NiFi Overview
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
23 © Hortonworks Inc. 2011–2018. All rights reserved
Watch real time flow of data: Data Provenance in Apache NiFi
Select Data Provenance
24 © Hortonworks Inc. 2011–2018. All rights reserved
Easily access and trace changes to dataflow in Apache NiFi
25 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Atlas
• Enterprise data
governance
• Integration with
Apache NiFi
• Integration with
Apache Ranger
Apache Atlas
Knowledge Store
Audit Store
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
Real Time Tag BasedAccess Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM SERVICE: DATA STEWARD STUDIODSS
Discover&
Fingerprint
Data
Smart
Enterprise
Search
Data & Metadata
Security
Data Lineage &
Impact Analysis
Enterprise
Data
Catalog
Organize&
CurateData
26 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you

Who changed my data? Need for data governance and provenance in a streaming world

  • 1.
    Who changed mydata? Need for data governance and provenance in a streaming world Digital capability requires granular control of all data assets. Dinesh Chandrasekhar Director, Product Marketing Paige Bartley Senior Analyst, Data and Enterprise Intelligence
  • 2.
    Ovum | TMTintelligence | informa2 Copyright © Informa PLC
  • 3.
    Ovum | TMTintelligence | informa3 Copyright © Informa PLC Business challenges in achieving digital capability include:  Reproducibility of analytics results  Debugging of models and algorithms  Ensuring correct access rights to data  Consistent application of data policies  Meeting regulatory compliance requirements  Unifying data across repositories and silos  Finding the right data at the right time Digital Capability Depends on Full Control of Data Addressing these challenges requires understanding how data changes over time.
  • 4.
    Ovum | TMTintelligence | informa4 Copyright © Informa PLC Governance and Transparency of Data Assets is More Important than Ever
  • 5.
    Ovum | TMTintelligence | informa5 Copyright © Informa PLC More Data:  Economics of storage have made keeping data cheap.  New data types – sensor data, etc. – need to be combined with historical data. More Users:  Self-service era means more data consumers and more frequent data access.  Varying users have varying access rights and privileges.  More users means more proliferation of data versions. More Complexity:  Data repositories have become more distributed, and data sources more varied.  Data resides in more locations than ever before, in the cloud and on-prem. Factors Within the Enterprise
  • 6.
    Ovum | TMTintelligence | informa6 Copyright © Informa PLC More Regulatory Pressure Regulations such as GDPR have indirect requirements for tracking lineage.  Article 30 requirements for record keeping necessitate knowledge of provenance. More Competitive Pressure  Leverage of data is increasingly a competitive differentiator.  Pace of change is accelerating, and comprehensive understanding of data is critical.  Disruptors are emerging from unlikely industries, using data to their advantage. Factors External to the Enterprise
  • 7.
    Ovum | TMTintelligence | informa7 Copyright © Informa PLC  Article 4: Definition of Personal Data A person can be identified indirectly or directly Data sources can be combined to make personal data  Article 9: Processing of Special Categories of Personal Data Processing of biometric data is highly restricted Many types of sensors produce biometric data  Article 30: Records of Processing Activities “Who, what, when, where, and why” of processing Need deep understanding of metadata and data lineage. GDPR doesn’t differentiate between data-in-motion and data-at-rest! “Who changed what” is critical. Lineage and provenance, while not directly required by GDPR, are critical to meeting requirements. GDPR’s Specific Requirements for Data
  • 8.
    Ovum | TMTintelligence | informa8 Copyright © Informa PLC << << From an analytics standpoint, reaping the benefits of big data means investing in data management and governance. Without the correct people, processes, and infrastructure, more casual business users will likely struggle to see the benefits of big data technologies. Laurent-Olivier Lioté Analyst, Data and Enterprise Intelligence, Ovum
  • 9.
    Ovum | TMTintelligence | informa9 Copyright © Informa PLC A Holistic View of Data Requires Both Data-in-Motion and Data-at-Rest Data at Rest Data in Motion Contextual Understanding of Data
  • 10.
    Ovum | TMTintelligence | informa10 Copyright © Informa PLC Having a common enterprise metadata framework allows data of different types and from different sources to be managed consistently. A common metadata framework allows for:  Common search and lineage for datasets  Lifecycle management from ingestion to disposition  Metadata exchange with other metadata tools  Analysis of data usage and access trends  Consistent application of access rights  Analysis of behavior and anomalies How Do We Do This? Metadata Management is Necessary for Governance Metadata Creation Metadata Enrichment Metadata Analysis
  • 11.
    Ovum | TMTintelligence | informa11 Copyright © Informa PLC The data lake, if properly managed, can support a common metadata framework which underpins enterprise data.  Data-in-motion  Data-at-rest  Structured data  Unstructured data Common management of metadata allows for streamlined control and visibility into data. Better control of data results in better business outcomes. The Managed Data Lake Can Support a Common Metadata Framework All metadata, managed together.
  • 12.
    Ovum | TMTintelligence | informa12 Copyright © Informa PLC The enterprise increasingly wants to analyze all data, both in-motion and at-rest, in context with each other. Governance and lineage for data-in-motion allows for:  Audit and regulatory compliance  Insight into data history and provenance  Comprehensive lifecycle management  Security and access controls  Better quality data = better analytics Governance standards for data-in-motion need to match those for data-at-rest. Governance Standards Need to be Equal Common Metadata Framework Data-in-Motion Data-at-Rest Data Management Platform
  • 13.
    13 © HortonworksInc. 2011–2018. All rights reserved Changing face of data Challenges and Solutions
  • 14.
    14 © HortonworksInc. 2011–2018. All rights reserved The New Way of Business Is Fueled By Connected Data • Connected Customers, Vehicles, Devices • Socially crowd-sourced requirements • Digital design and analysis • Digital prototypes and tests (simulations) • Connected Factories, Sensors, Devices • Human-robotic interaction • 3D-printing on demand • Connected Trucks, Inventory • Location, traffic, weather-aware distribution • Real-time inventory visibility • Dynamic rerouting • Connected Customers, Devices • Omni- channel demand sensing • Real-Time Recommendations • Connected Assets • Remote service monitoring & delivery • Predictive maintenance • OTA Updates DEVELOPMENT MANUFACTURING DISTRIBUTION MARKETING/SALES SERVICE
  • 15.
    15 © HortonworksInc. 2011–2018. All rights reserved Today’s Digital Enterprises RFID TRACKERS AND NANO-DEVICES to give you visibility into movement of your goods MOBILE NOTIFICATIONS to inform you of shipment delay from a supplier BLOCKCHAINS to give complete trust and provenance in your supply chain VIRTUAL ASSISTANTS to enhance your customer experience AI-POWERED CHATBOTS to improve your customer support functions ELECTRONIC B2B EXCHANGES to streamline order processing with partners
  • 16.
    16 © Hortonworks, Inc.2011-2018. All rights reserved. | Hortonworks confidential and proprietary information. Modern Data Architecture DATA CENTER Machine Learning/ Artificial Intelligence Telemetry – Connected Devices Time Series Databases Stream Analytics Deep Historical Analysis Exception Monitoring Legacy/ Operational Data Sensors, Control Systems Cyber Security Edge Analytics Social Mobile IoT IoT CLOUD Geo Location
  • 17.
    17 © Hortonworks, Inc.2011-2018. All rights reserved. | Hortonworks confidential and proprietary information. Data Challenges Cannot get a 360 VIEW of your customer? DROWNING in data lakes? TOO MUCH DATA coming in from TOO MANY SOURCES and devices? New business initiatives leading to EXCESSIVE IT COSTS?` MOST IMPORTANTLY… Don’t have the right data at the right time to make the right decision?
  • 18.
    18 © Hortonworks, Inc.2011-2018. All rights reserved. | Hortonworks confidential and proprietary information. G L O B A L D ATA M A N A G E M E N T DATA SOURCES DATA CENTER CLOUD EDGE Exception Monitoring 360 View of Operations Cyber Security Telemetry – Connected Devices Time Series Sensors, Control Systems Telemetry – Connected Devices Sensors, Control Systems Time Series Exception Monitoring Cyber Security Legacy/ Operational Data Global Data Management Enables Modern Data Architecture
  • 19.
    19 © Hortonworks, Inc.2011-2018. All rights reserved. | Hortonworks confidential and proprietary information. Data Management Challenges • Dealing with multi-clouds • Avoiding cloud/ vendor lock-in • Future proofing your architecture • Common view of security, governance • Manage all data, regardless of type or location • Maximize data re-use for multiple workloads DATA SOURCES DATA CENTER CLOUD EDGE Exception Monitoring 360 View of Operations Cyber Security Telemetry – Connected Devices Time Series Sensors, Control Systems
  • 20.
    20 © Hortonworks, Inc.2011-2018. All rights reserved. | Hortonworks confidential and proprietary information. Global Data Management Platform DATA SOURCES DATA CENTER CLOUD EDGE Exception Monitoring 360 View of Operations Cyber Security Telemetry – Connected Devices Time Series Sensors, Control Systems DATA-IN-MOTION DATA-AT-REST MANAGE, SECURE, GOVERN, CONSUME
  • 21.
    21 © Hortonworks, Inc.2011-2018. All rights reserved. | Hortonworks confidential and proprietary information. Global Data Management - Powering Innovation MODERN DATA USE CASES EDW OPTIMIZATION CYBERSECURITY DATA SCIENCE ADVANCED ANALYTICS IOT/ STREAMING ANALYTICS DATA SOURCES DATA CENTER CLOUD EDGE Exception Monitoring 360 View of Operations Cyber Security Telemetry – Connected Devices Time Series Sensors, Control Systems DATA-IN-MOTION DATA-AT-REST MANAGE, SECURE, GOVERN, CONSUME
  • 22.
    22 © Hortonworks, Inc.2011-2018. All rights reserved. | Hortonworks confidential and proprietary information. Apache NiFi Overview • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Recovery/recording a rolling log of fine- grained history • Visual command and control • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering
  • 23.
    23 © HortonworksInc. 2011–2018. All rights reserved Watch real time flow of data: Data Provenance in Apache NiFi Select Data Provenance
  • 24.
    24 © HortonworksInc. 2011–2018. All rights reserved Easily access and trace changes to dataflow in Apache NiFi
  • 25.
    25 © HortonworksInc. 2011–2018. All rights reserved Apache Atlas • Enterprise data governance • Integration with Apache NiFi • Integration with Apache Ranger Apache Atlas Knowledge Store Audit Store ModelsType-System Policy RulesTaxonomies Tag Based Policies Data Lifecycle Management Real Time Tag BasedAccess Control REST API Services Search Lineage Exchange Healthcare HIPAA HL7 Financial SOX Dodd-Frank Energy PPDM Retail PCI PII Other CWM SERVICE: DATA STEWARD STUDIODSS Discover& Fingerprint Data Smart Enterprise Search Data & Metadata Security Data Lineage & Impact Analysis Enterprise Data Catalog Organize& CurateData
  • 26.
    26 © HortonworksInc. 2011–2018. All rights reserved Thank you

Editor's Notes

  • #3 Let’s step away from compliance, regulation, and requirements, and look at the major trends and drivers within the enterprise. Governance and provenance are often discussed as “checkbox” requirements, rather than as enablers. ICT Enterprise Insights survey identified “create digital capability” and “manage security, identity, and privacy” as the top two IT trends in the enterprise. What do these trends have in common?
  • #4 There are three pillars to creating digital capability. The first pillar is the creation of the digital platform and infrastructure itself. The second pillar is the creation of the ability to effectively exploit and utilize data. The third pillar is the development of the enterprise's innovation process and methodology for the digital age. All three are underpinned by a clearly articulated digital strategy.
  • #8 Article 4: Any information relating to an identified or identifiable natural person; a natural person can be identified indirectly or directly , and the enterprise needs to be cautious with combining data sources to ensure that innocuous information doesn’t become personal information Article 9: Processing of biometric data for the purpose of uniquely identifying a person is inherently prohibited, unless certain conditions are met, and this applies to several types of data in motion: sensor data from wearables, medical devices, and fitness devices. Article 30: Must document purposes of processing, transfers of data to non-EU countries, and the envisaged time limits for erasure of the data
  • #11 Data policies are applied and encoded at the metadata level. Metadata, or data about data, is critical to providing a common foundation for understanding the qualities of data residing in different systems and to provide lineage and cataloging capabilities. A shared or common metadata framework, where all metadata is managed together, allows data to be centrally searched, tracked, and monitored regardless of its "home" repository.
  • #13 To make this a reality, the same governance standards need to be applied to all enterprise data equally. There needs to be a single platform environment where data-in-motion and data-at-rest can be managed together, with a common metadata framework. All data-in-motion sources need a way to be ingested into this platform, with provenance and lineage tracked as they flow in.
  • #14 TALK TRACK Hortonworks Powers the Future of Data: data-in-motion, data-at-rest, and Modern Data Applications. [NEXT SLIDE]
  • #15 Data is often referred to as the fuel of today’s businesses. In reality, every business has data and perhaps can have access to the same types of data than most of their competitors. The real paradigm is not data but who uses it smarter with greater effect. And that usage often rely on connecting the data dots across your organization. By connecting customers to products to channels through which they interact of prefer to interact we can drive better customer experiences – resulting in better loyalty and hopefully better revenues. Every industry is being transformed through these connected use cases.
  • #19 1) Data is in multiple places (data centers that the company owns, cloud, owned by a third party,). 2) Different data in different places (data in your databases – numbers – data from sensors in a connected product not arranged in a database; 3) data flowing back and forth between data center and cloud. Talking points: There is a an entire new world being created by combining lots of data with break through tools. Data could be on-premises and in the cloud Data is moving from sensors in real time across our data fabric and giving us precise instrumentation of what happened just before an event as well as after the event. This is true for customers buying on the web as well as products that might fail. We can run our machine learning and deep learning on these vast repositories of data And we can push these models down to the edges to automate decision Note: For us as a community and as a company, we need to continue to innovate around the core technology, while thinking about how we enable 3 personas to be successful. This is the logical evolution and transformation that’s happening now.
  • #20 You need to holistically manage all the data in all places, then begin to move our platform into place
  • #21 You need to holistically manage all the data in all places, then begin to move our platform into place
  • #22 You need to holistically manage all the data in all places, then begin to move our platform into place
  • #24 HDF provides very fine-grained, high fidelity reporting about the origins of data, how it was used, who used it etc.