Mastering the Dark Data Challenge:
Harnessing AI for Enhanced Data
Governance and Quality
Unlocking the Potential of Unstructured Data
Maryam Nozari and Urmi Majumder
DGIQ 2024
1
ENTERPRISE KNOWLEDGE
Maryam Nozari
Sr. Data Scientist
Urmi Majumder
Principal Data Architecture Consultant
⬢ 15+ years of experience in enterprise system
architecture, design, implementation, and
operations
⬢ Principal architect in knowledge graphs,
enterprise AI, and scalable data management
systems
⬢ Ph.D in Computer Science, Duke University
⬢ Senior Data Scientist and consultant
specializing in cutting-edge algorithm design
and predictive model deployment
⬢ Expert in deploying Large Language Models
and knowledge graphs, enhancing business
intelligence
⬢ Ph.D. in Learning Sciences, The University of
Texas at Austin
2
Topics Covered
❖ Understanding Dark Data
❖ Using AI to Manage Dark Data
❖ Beyond Dark Data
3
Understanding Dark
Data
4
Understanding Dark Data
● Dark data, in the context of
sensitive information, refers to any
personal or confidential data that
is excessively disseminated or
accessible and that the
organization may not know exists
● Daily, approximately 328.77 million
terabytes of data are created
worldwide, with the average
organization holding about 17.5
petabytes of unstructured, often
unused data.
Data Breaches Hit Lots More People in 2022 - CNET
5
ENTERPRISE KNOWLEDGE
Data Security in Numbers
Data
breaches
are due to
human
error
Data
breaches in
financial
and
insurance
sectors
Cybersecurity
leaders
believe
attacks
powered by
AI in 2023
84% 74% 74%
CISOs
believe
security
equals
regulatory
compliance
85%
Source: The CISO Report 6
Size of the Opportunity in the Age of Gen AI
An estimated 660 prompts to ChatGPT for every 10,000 users, with
source codebeing the most frequently exposed type of sensitive data,
posted by 22 out of 10K users, generating 158data breach incidents monthly.
Other types of sensitive data accidentally shared with Gen AI apps include
regulated dataresulting in 18security incidents,
intellectual property resulting in 4 incidents, and posts
containing passwords and keysresulting in 4incidents monthly.
55% of 17.5 PB unstructured data in an organization can be
considered as Dark Data.
Source: Navigating the Rising Tide of Data Breaches and AI Security Risks
7
The High Cost of Ignoring Dark Data
● Financial Costs of Compliance Violations
○ The financial penalties companies have
faced due to non-compliance with data
protection regulations (e.g., GDPR, CCPA)
● Impact on Customer Trust
● Brand Reputation and Market Value
The average cost of a data breach was
$4.45 million in 2023, the highest average
on record. (IBM)
Security breaches increased in 2021
by 68 percent. (CNET)
8
https://www.ibm.com/reports/data-breach
Active Use and Storage
(Data is often actively used and
stored in an organized manner)
20XX
Data Creation
(automatic system logs,
user-generated content,
business transactions, etc.)
Data Silos and Lack of Accessibility
(data is stored in isolated systems or
formats, leading to a lack of
accessibility and visibility)
Neglect or Improper Management
(failure to delete obsolete data,
improper categorization, or simply
forgetting about its existence)
Becoming Dark Data
(data becomes dark—unused,
unanalyzed, and potentially
risky due to outdated or
sensitive information)
Dark data, if left
unchecked, poses
risks to privacy,
security, and
compliance, and
represents a
significant loss of
potential insights
and opportunities
Intervention
Encourage
inter-departmental
data sharing and
adopt integrated data
management systems
Increased risk of data
breaches, compliance
violations, and missed
opportunities for
insights
Intervention
Regular data hygiene
practices and clear data
governance policies can
prevent neglect
Intervention
Adopt advanced data
analytics, AI for data
classification, and
proactive data
management strategies
How Data Becomes Dark and How to
Recover from It
9
Using AI to Manage
Dark Data: A Case Study
10
The Cost to The Enterprise (a Case study)
● A leading federal research organization,
identified a significant challenge in
managing its vast amounts of unstructured
data.
○ The firm's data landscape was
cluttered with dark data, including
project documents, proposals and
research papers - where some have
“classified” government information.
○ This data was scattered across various
platforms such as shared drives, email
servers, and cloud storage solutions.
○ This lead to inefficiencies in data
access, increased risk of sensitive data
exposure, and difficulties in complying
with stringent federal regulations.
11
The Solution
AI-Powered Dark Data Identifier
Original State The Need Solution
● Organization aware of the
existence of dark data but has
no scalable automation in place
to identify such content beyond
rule-based scripts to classify
content with PII
● A more flexible and
sophisticated approach to
automatically identifying
classified information and
evolving categories of sensitive
content specific to the
organization buried in
enterprise data assets that are
not properly labeled before
leveraging Generative AI across
the enterprise to boost
employee productivity
● Implemented data pipelines –
to connect and extract data
siloed in different systems.
● Enabled hybrid content
classification based on
predefined sensitivity rules to
give the organization the ability
to identify overshared sensitive
content and remediate access.
● Built a BI dashboard to provide
system administrators a clear
view into dark data in the
organization.
12
ENTERPRISE KNOWLEDGE
Integration with
Enterprise Systems
Proof of
Concept
Access Control
Mechanism
Solution
Architecture &
Design
Compliance
& Ethical
Guidelines
Data Classification & Tagging
Overall Solution Approach
to Dark Data Discovery
13
ENTERPRISE KNOWLEDGE
Dark Data Discovery using AI: Demo Time!
14
Metadata Ingestion
● Crawl through enterprise
data and employ a hybrid
approach of deterministic
and probabilistic methods,
including pattern matching
and AI/ML models
● Flag sensitive content with
overly permissive access,
enabling administrators to
adjust access levels and
safeguard confidential
information
● Based on a sensitivity rules
database with three
categories of rules
● Clear view into uncovered
dark data and rationale
● Ensures a robust, adaptable
framework for securing
sensitive data across various
organizational contexts
Dark Data Identification
Dark Data Discovery
From Dark Data to Insights
The AI Connection
15
ENTERPRISE KNOWLEDGE
Conceptual Architecture for
Dark Data Discovery
16
Data Classification
● Categorize organizational data assets
based on sensitivity, criticality and
usage
● Typical classes: public, internal,
confidential, restricted
Dark Data Discovery
● Define organizational data categories
● Define data class for each category
● Classify data asset using AI/ML and rules
● Flag sensitive content shared broadly for
security review
● Remediate access to overshared
sensitive content
Data Classification -> Dark Data Discovery
17
1
3
2
Unique
Pending Patents,
Business Plans,
Private Source Code,
Engineering
Processes
Core
SSN, Location,
DOB,Gender,
Ethnicity, Residency
History
Common
Country/State specific
govt issued IDs,
system specific logs
collecting sensitive
information, salary
information, health
records
Dark Data Discovery Rules: Data Class to Role Mapping
Definition: Rules that
cover types of sensitive
information that are not
universal but fairly
common across
geographical regions, and
enterprise domain.
Definition: Rules that cover
types of sensitive information
specific to an organization.
Definition: Rules that cover
types of sensitive information
that are usually present in any
enterprise data set.
2
1
3
18
ENTERPRISE KNOWLEDGE
~30K documents were
scanned and analyzed
by AI as the first PoC
Identified documents
with sensitivity label
and access mismatch
● Success rate can be improved
through targeted discovery
rules specific to the
organization’s content
● Structured/semi-structured
sources such as glossaries, data
collection spreadsheets,
invoices, receipts not easily
classifiable
Results Summary: Findings
~30K ~30K Impact
19
ENTERPRISE KNOWLEDGE
Organizations belonging to highly regulated sectors such as Banking, Insurance, Finance, Healthcare,
Pharmaceutics, Automobile and Construction can readily leverage EK’s Dark Data Identification Service
for managing regulatory compliance
Hybrid Solution
Leverage
SME-defined Rules +
best-in-class LLMs
Extensible
Framework
Comprehensive View into Dark Data
Adjust rules to match
evolving landscape of
data privacy
Generate actionable insights to support
security SMEs in upholding regulatory
compliance
Powering Secure Data Management using AI
Data Security
Regulatory
Compliance Safe AI
20
The Bigger Impact:
Beyond Dark Data
21
Domain Specific Application
Identification Assessment Action Monitoring Reporting
Finance
1. Identification of transaction records and communications.
2. Risk assessment for data breaches and non-compliance with regulations like SOX, GDPR.
3. Encryption of sensitive data, implementation of access controls.
4. Continuous monitoring for unusual access patterns.
5. Regular reporting to regulatory bodies and internal audits.
Insurance
1. Identification of sensitive information through scanning claim forms and customer interactions
2. Compliance Assessment with HIPAA (in health insurance) and other sector-specific laws.
3. Anonymization of personal identifiers, secure data storage solutions.
4. Real-time monitoring for compliance adherence.
5. Compliance status reports to state insurance boards.
Healthcare
1. Identification of patient data in clinical trial records.
2. Assessing compliance with FDA regulations and data protection laws.
3. Data segregation, rigorous data access controls.
4. Monitoring access to clinical data.
5. FDA audit reports and internal compliance reviews.
Automotive
1. Identification of unstructured data within vehicle testing reports and manufacturing data.
2.Assessing compliance with safety standards and environmental regulations.
3.Implementing data retention policies and safety data protocols.
4.Continuous monitoring of compliance with emissions and safety regulations.
5.Reporting to automotive safety and environmental agencies.
1 2 3 4 5
22
Customer Support
05
● Use a multi-class classification algorithm to classify customer tickets by
predefined categories such as complaint, feedback, question, and so on
● Integrate classifier output with ticket workflow so that customer support
agents can focus on urgent tickets
Content Moderation
04
● Content generated using Gen AI tools must be effectively moderated
● Manual content moderation can be both time-consuming and flawed
● AI-powered content moderation can reduce this cognitive load
Survey Analysis
03
● Train an AI/ML model to categorize qualitative survey responses into
predefined categories such as usability and technical complexity (if the survey
is for an application, as an example)
● Use the model to categorize survey responses at scale and direct each group of
responses to the right team for further analysis
Search and Discovery
02
● Tag content using predefined categories
● Index tags and other metadata about the content into a search engine to
power search and discovery
Records Management
01
● For every content without a content type, auto-assign content type
● Assign record codes based on the organization’s record schedule for that
content type
Broader Content Classification Use Cases
23
ENTERPRISE KNOWLEDGE
Questions?
Thank you for listening.
We are happy to take any
questions at this time.
Urmi Majumder
umajumder@enterprise-knowledge.com
www.linkedin.com/in/urmim/
Maryam Nozari
mnozari@enterprise-knowledge.com
www.linkedin.com/in/maryamnozaiphd/
24

Mastering the Dark Data Challenge - Harnessing AI for Enhanced Data Governance and Quality

  • 1.
    Mastering the DarkData Challenge: Harnessing AI for Enhanced Data Governance and Quality Unlocking the Potential of Unstructured Data Maryam Nozari and Urmi Majumder DGIQ 2024 1
  • 2.
    ENTERPRISE KNOWLEDGE Maryam Nozari Sr.Data Scientist Urmi Majumder Principal Data Architecture Consultant ⬢ 15+ years of experience in enterprise system architecture, design, implementation, and operations ⬢ Principal architect in knowledge graphs, enterprise AI, and scalable data management systems ⬢ Ph.D in Computer Science, Duke University ⬢ Senior Data Scientist and consultant specializing in cutting-edge algorithm design and predictive model deployment ⬢ Expert in deploying Large Language Models and knowledge graphs, enhancing business intelligence ⬢ Ph.D. in Learning Sciences, The University of Texas at Austin 2
  • 3.
    Topics Covered ❖ UnderstandingDark Data ❖ Using AI to Manage Dark Data ❖ Beyond Dark Data 3
  • 4.
  • 5.
    Understanding Dark Data ●Dark data, in the context of sensitive information, refers to any personal or confidential data that is excessively disseminated or accessible and that the organization may not know exists ● Daily, approximately 328.77 million terabytes of data are created worldwide, with the average organization holding about 17.5 petabytes of unstructured, often unused data. Data Breaches Hit Lots More People in 2022 - CNET 5
  • 6.
    ENTERPRISE KNOWLEDGE Data Securityin Numbers Data breaches are due to human error Data breaches in financial and insurance sectors Cybersecurity leaders believe attacks powered by AI in 2023 84% 74% 74% CISOs believe security equals regulatory compliance 85% Source: The CISO Report 6
  • 7.
    Size of theOpportunity in the Age of Gen AI An estimated 660 prompts to ChatGPT for every 10,000 users, with source codebeing the most frequently exposed type of sensitive data, posted by 22 out of 10K users, generating 158data breach incidents monthly. Other types of sensitive data accidentally shared with Gen AI apps include regulated dataresulting in 18security incidents, intellectual property resulting in 4 incidents, and posts containing passwords and keysresulting in 4incidents monthly. 55% of 17.5 PB unstructured data in an organization can be considered as Dark Data. Source: Navigating the Rising Tide of Data Breaches and AI Security Risks 7
  • 8.
    The High Costof Ignoring Dark Data ● Financial Costs of Compliance Violations ○ The financial penalties companies have faced due to non-compliance with data protection regulations (e.g., GDPR, CCPA) ● Impact on Customer Trust ● Brand Reputation and Market Value The average cost of a data breach was $4.45 million in 2023, the highest average on record. (IBM) Security breaches increased in 2021 by 68 percent. (CNET) 8 https://www.ibm.com/reports/data-breach
  • 9.
    Active Use andStorage (Data is often actively used and stored in an organized manner) 20XX Data Creation (automatic system logs, user-generated content, business transactions, etc.) Data Silos and Lack of Accessibility (data is stored in isolated systems or formats, leading to a lack of accessibility and visibility) Neglect or Improper Management (failure to delete obsolete data, improper categorization, or simply forgetting about its existence) Becoming Dark Data (data becomes dark—unused, unanalyzed, and potentially risky due to outdated or sensitive information) Dark data, if left unchecked, poses risks to privacy, security, and compliance, and represents a significant loss of potential insights and opportunities Intervention Encourage inter-departmental data sharing and adopt integrated data management systems Increased risk of data breaches, compliance violations, and missed opportunities for insights Intervention Regular data hygiene practices and clear data governance policies can prevent neglect Intervention Adopt advanced data analytics, AI for data classification, and proactive data management strategies How Data Becomes Dark and How to Recover from It 9
  • 10.
    Using AI toManage Dark Data: A Case Study 10
  • 11.
    The Cost toThe Enterprise (a Case study) ● A leading federal research organization, identified a significant challenge in managing its vast amounts of unstructured data. ○ The firm's data landscape was cluttered with dark data, including project documents, proposals and research papers - where some have “classified” government information. ○ This data was scattered across various platforms such as shared drives, email servers, and cloud storage solutions. ○ This lead to inefficiencies in data access, increased risk of sensitive data exposure, and difficulties in complying with stringent federal regulations. 11
  • 12.
    The Solution AI-Powered DarkData Identifier Original State The Need Solution ● Organization aware of the existence of dark data but has no scalable automation in place to identify such content beyond rule-based scripts to classify content with PII ● A more flexible and sophisticated approach to automatically identifying classified information and evolving categories of sensitive content specific to the organization buried in enterprise data assets that are not properly labeled before leveraging Generative AI across the enterprise to boost employee productivity ● Implemented data pipelines – to connect and extract data siloed in different systems. ● Enabled hybrid content classification based on predefined sensitivity rules to give the organization the ability to identify overshared sensitive content and remediate access. ● Built a BI dashboard to provide system administrators a clear view into dark data in the organization. 12
  • 13.
    ENTERPRISE KNOWLEDGE Integration with EnterpriseSystems Proof of Concept Access Control Mechanism Solution Architecture & Design Compliance & Ethical Guidelines Data Classification & Tagging Overall Solution Approach to Dark Data Discovery 13
  • 14.
    ENTERPRISE KNOWLEDGE Dark DataDiscovery using AI: Demo Time! 14
  • 15.
    Metadata Ingestion ● Crawlthrough enterprise data and employ a hybrid approach of deterministic and probabilistic methods, including pattern matching and AI/ML models ● Flag sensitive content with overly permissive access, enabling administrators to adjust access levels and safeguard confidential information ● Based on a sensitivity rules database with three categories of rules ● Clear view into uncovered dark data and rationale ● Ensures a robust, adaptable framework for securing sensitive data across various organizational contexts Dark Data Identification Dark Data Discovery From Dark Data to Insights The AI Connection 15
  • 16.
  • 17.
    Data Classification ● Categorizeorganizational data assets based on sensitivity, criticality and usage ● Typical classes: public, internal, confidential, restricted Dark Data Discovery ● Define organizational data categories ● Define data class for each category ● Classify data asset using AI/ML and rules ● Flag sensitive content shared broadly for security review ● Remediate access to overshared sensitive content Data Classification -> Dark Data Discovery 17
  • 18.
    1 3 2 Unique Pending Patents, Business Plans, PrivateSource Code, Engineering Processes Core SSN, Location, DOB,Gender, Ethnicity, Residency History Common Country/State specific govt issued IDs, system specific logs collecting sensitive information, salary information, health records Dark Data Discovery Rules: Data Class to Role Mapping Definition: Rules that cover types of sensitive information that are not universal but fairly common across geographical regions, and enterprise domain. Definition: Rules that cover types of sensitive information specific to an organization. Definition: Rules that cover types of sensitive information that are usually present in any enterprise data set. 2 1 3 18
  • 19.
    ENTERPRISE KNOWLEDGE ~30K documentswere scanned and analyzed by AI as the first PoC Identified documents with sensitivity label and access mismatch ● Success rate can be improved through targeted discovery rules specific to the organization’s content ● Structured/semi-structured sources such as glossaries, data collection spreadsheets, invoices, receipts not easily classifiable Results Summary: Findings ~30K ~30K Impact 19
  • 20.
    ENTERPRISE KNOWLEDGE Organizations belongingto highly regulated sectors such as Banking, Insurance, Finance, Healthcare, Pharmaceutics, Automobile and Construction can readily leverage EK’s Dark Data Identification Service for managing regulatory compliance Hybrid Solution Leverage SME-defined Rules + best-in-class LLMs Extensible Framework Comprehensive View into Dark Data Adjust rules to match evolving landscape of data privacy Generate actionable insights to support security SMEs in upholding regulatory compliance Powering Secure Data Management using AI Data Security Regulatory Compliance Safe AI 20
  • 21.
  • 22.
    Domain Specific Application IdentificationAssessment Action Monitoring Reporting Finance 1. Identification of transaction records and communications. 2. Risk assessment for data breaches and non-compliance with regulations like SOX, GDPR. 3. Encryption of sensitive data, implementation of access controls. 4. Continuous monitoring for unusual access patterns. 5. Regular reporting to regulatory bodies and internal audits. Insurance 1. Identification of sensitive information through scanning claim forms and customer interactions 2. Compliance Assessment with HIPAA (in health insurance) and other sector-specific laws. 3. Anonymization of personal identifiers, secure data storage solutions. 4. Real-time monitoring for compliance adherence. 5. Compliance status reports to state insurance boards. Healthcare 1. Identification of patient data in clinical trial records. 2. Assessing compliance with FDA regulations and data protection laws. 3. Data segregation, rigorous data access controls. 4. Monitoring access to clinical data. 5. FDA audit reports and internal compliance reviews. Automotive 1. Identification of unstructured data within vehicle testing reports and manufacturing data. 2.Assessing compliance with safety standards and environmental regulations. 3.Implementing data retention policies and safety data protocols. 4.Continuous monitoring of compliance with emissions and safety regulations. 5.Reporting to automotive safety and environmental agencies. 1 2 3 4 5 22
  • 23.
    Customer Support 05 ● Usea multi-class classification algorithm to classify customer tickets by predefined categories such as complaint, feedback, question, and so on ● Integrate classifier output with ticket workflow so that customer support agents can focus on urgent tickets Content Moderation 04 ● Content generated using Gen AI tools must be effectively moderated ● Manual content moderation can be both time-consuming and flawed ● AI-powered content moderation can reduce this cognitive load Survey Analysis 03 ● Train an AI/ML model to categorize qualitative survey responses into predefined categories such as usability and technical complexity (if the survey is for an application, as an example) ● Use the model to categorize survey responses at scale and direct each group of responses to the right team for further analysis Search and Discovery 02 ● Tag content using predefined categories ● Index tags and other metadata about the content into a search engine to power search and discovery Records Management 01 ● For every content without a content type, auto-assign content type ● Assign record codes based on the organization’s record schedule for that content type Broader Content Classification Use Cases 23
  • 24.
    ENTERPRISE KNOWLEDGE Questions? Thank youfor listening. We are happy to take any questions at this time. Urmi Majumder umajumder@enterprise-knowledge.com www.linkedin.com/in/urmim/ Maryam Nozari mnozari@enterprise-knowledge.com www.linkedin.com/in/maryamnozaiphd/ 24