Driving Behavioral Change for Information Management through Data-Driven Green Strategy (EDW 2024)

Driving Behavioral Change for
Information Management through
Data-Driven Green Strategy
A Case Study
Urmi Majumder and Fernando Aguilar Islas
EDW 2024

Topics Covered
⬢ What is a Green Information Management (IM) Strategy, and
why should you have one?
⬢ How can Artificial Intelligence (AI) and Machine Learning (ML)
support your Green IM Strategy through content deduplication?
⬢ How can an organization use insights into their data to
influence employee behavior for IM?
⬢ How can you reap additional benefits from content reduction
that go beyond Green IM?

⬢ 15+ years of experience in enterprise system architecture,
design, implementation and operations
⬢ Leads the development of technical solutions in support of
wide variety of knowledge and data management solutions
⬢ Principal architect in knowledge graphs, enterprise AI, and
scalable data management systems
⬢ Ph.D in Computer Science, Duke University
Urmi Majumder
Principal Data Architecture Consultant
Fernando Aguilar Islas
Data Science Consultant
⬢ 9+ years of experience serving as data scientist for graph-powered
machine learning and AI-based solutions
⬢ Implemented several knowledge graph-based enterprise data
catalogs
⬢ Experience leading and supporting the integration and
implementation of 20+ data projects
⬢ MS, Applies Statistics, Penn State University
ENTERPRISE KNOWLEDGE

Green Information
Management (IM)
Strategy

Green Information Management:
Putting the “Green” in Information Management (IM)
What is it? Why should enterprises have a
green IM strategy?
Green Information
Management (gIM) is
a strategic approach
focused on optimizing
and minimizing the
environmental
impact of
information-related
processes within an
organization.
Sustainability
Reduce resource
consumption and waste
associated with IM practices.
Cost Efﬁciency
Reduce energy consumption
through streamlined
processes and optimized
infrastructure.
Compliance
Address regulatory
requirements to demonstrate
adherence to green
standards.
Corporate
Responsibility
Commit to environmental
stewardship.

A supply-chain giant is committed to an
organizational goal of becoming a
Net-Zero Emissions Business.
The organization realized that they have
a huge digital carbon footprint due to
proliferation of duplicate content – over
226K documents occupying ~1 PB of
space – through use of content
management systems and collaboration
software such as SharePoint and
Microsoft Teams that unintentionally
build siloes because of a lack of
visibility/awareness.
Case Study: The Challenge

The Solution
AI-Powered Digital Carbon Footprint Calculator
ORIGINAL STATE THE NEED SOLUTION
● Rules-based non-record
deletion application deleting
forgotten non-sensitive
documents periodically.
● Algorithm could only delete
documents that were not
modified for at least 3 years
and marked as non-records.
But a lack of sharing culture
meant most documents were
unnecessarily marked as
sensitive.
● Need to aggregate tens of
primary sources with slightly
different metadata and access
levels, and yet are duplicate or
near-duplicate content, to build
a content similarity and
resultant carbon footprint
dashboard.
● Need to augment rule-based
approach relying solely on
metadata with AI relying on
content similarity to identify
duplicate and near- duplicate
content.
● Implemented the data
pipelines – and matching
algorithms – to connect data
siloed in different systems.
● Automated duplicate content
identification to give the
organization the ability to drill
down into duplicate
content-related findings across
data sources and improve QA.
● Built a BI dashboard to provide
a clear view into content
duplication and its connection
to CO2 emissions.

The AI Connection:
How to Use AI for
Green IM

Overall Solution Phased Approach
Phase I: Proof of Concept
● Reﬁne use case, prioritize requirements and
deﬁne KPIs
● Conduct Exploratory Data Analysis
● Develop Matching Algorithms
● Implement Content Deduplication Data
Pipeline
● Implement CO2 Emissions BI Dashboard
● Track KPIs
Phase II: Productionalization
● Scale Data Pipeline
● Enhance BI Dashboard to make it
actionable
● Integrate Pipeline with Content
Management System and
Collaboration Software
● Develop broader Green IM strategy

Metadata Ingestion
● Data Source Integration
● Metadata Extraction
● Content Extraction
● Content Vectorization
using AI ● Rule-Based Metadata
Similarity Analysis
● Stochastic Content
Similarity Analysis
● Combining Metadata and
Content Findings to identify
duplicates and
near-duplicates
● Duplicate Storage Impact
Analysis
● Resultant CO2 Emissions
Calculation
● BI Dashboard for summary
statistics and drill down by
key metadata ﬁelds
Content Deduplication
CO2 Emissions Viewer
End-to-End Process Overview
The AI Connection

Data Ingestion
● Source system identification
● Establishment of data crawlers
that meet system-specific
access requirements
Duplicate Analysis Output
● Combination of rule-based and
stochastic analysis to identify
duplicates
● Resultant Storage Impact
● Resultant CO2 Emissions
Metadata & Content Extraction
● Metadata Extraction from
either source system or
supporting metadata store
● Content extraction based on
file type
Matching Algorithm Execution
● Rules-based duplicate
inferencing on content
metadata
● Stochastic duplicate
inferencing on content
vectors
Metadata Enrichment
● Use of reference data/taxonomy management system
● Content Vectorization via use of Generative AI
Content
Deduplication
Process
Content Deduplication Pipeline
The AI Connection

Conceptual Architecture

Minimize Data
Movement
Use Transformer
Models for
Vectorization
Run Analysis
Pipelines in Cloud
Infrastructure
⬢ Reduce energy
consumption from
duplicate content
storage and data transfer
⬢ No copies – extract
content from original ﬁle
for in-memory
processing
⬢ < 100m parameters (e.g.,
DistilBERT)
⬢ Use less memory and
storage space due to
smaller model size
⬢ Take advantage of
resource efﬁciency at
scale
⬢ Use under-utilized
regions (e.g., Azure
Norway East region) or
regions powered by 100%
renewable energy (e.g.,
AWS US East 2)
Training OpenAI’s GPT 3.5 requires 1K
GPU processors running in parallel for
weeks at a time
Green Application Development Considerations

Reconcile high volume of distinct content items to significantly
lower number of unique content items across silo-ed systems
Give users clear view into content duplication and its
connection to CO2 emissions through meaningful dashboard
Establish a plug-n-play architecture for extracting content from
many file types and vectorizing the same using multiple
Generative AI models to best align the content similarity
pipeline to the organizational needs
Benefits of the AI-Powered Digital
Carbon Footprint Calculator

The Power of Data:
Drive Social Changes
Through Data

Size of the Opportunity
An estimated*
50%
of corporate data is
duplicated across the
organization
Real World Example
⬢ An email server contains 100 instances of the same 1
MB file attachment sent to 100 people
⬢ Without content deduplication, if all 100 people backup
their mailboxes, it would consume 100 MB of storage
In the supply-chain
organization, ~226K
documents occupied
~1 PB of storage,
resulting in 228 tonnes
of CO2 emissions.
34
tonnes
CO2
15% content reduction through
duplicate identification
*equivalent to
20 flights
from JFK to
LHR
* https://www.xillio.com/blog/recognize-duplicate-folder-structures-with-xillio-insights

Why is content duplication so
prevalent in the enterprise?
NON-DELIBERATE action on part of the user
● Users forget a document exists and
recreates it
● Users cannot find what they are looking
for and creates it from scratch
● Users save email attachments,
sometimes the same file multiple times
● Users downloads files from the intranet,
sometimes the same file multiple times
DELIBERATE action on part of the user
● Maintain backup copy
● Copy file for easier transfer/distribution
● Use separate files for different document
versions

Defensible Deletion
● Redundant, obsolete, trivial data held on
by users just in case
● Non-record deletion policy in an
organization can save storage space by
deleting documents not marked as
records that have not been modiﬁed
for a predeﬁned period
Barriers to Automated
Content Removal
● Content incorrectly marked as records
due to lack of proper compliance
training
● Content marked with higher sensitivity
labels because of knowledge hoarding
culture
● Content duplicated to associate
different access permissions due to
limited cross departmental collaboration
Automated Content Renewal

DATA-DRIVEN USER
BEHAVIOR CHANGE: Goals
“Educate and empower to influence positive behavior change.”
Educate
● Facilitate self-directed and social
learning opportunities for green
information management
Empower
● Facilitate evidence-based decision by offering
easy-access to personal CO2 emissions viewer
● Propel user into action by equipping him with the
right interactive tool to act on the findings in the
flow of work
● Provide the data needed to identify personal
emissions trends and a way to track progress over
time

Pilot CO2 Emissions Viewer: Demo Time!

DATA-DRIVEN USER
BEHAVIOR CHANGE:
Recommended
Actions
“Educate and empower to inﬂuence positive behavior change.”
Educate
● Educate users to
use links instead
of attachments
for ﬁle sharing
● Educate users on
componentized
content
management
Empower
● Provide accurate data
○ Establish KPIs measuring accuracy of duplicate detection pipeline
● Frame up the data in the context of the bigger picture
○ Enable visualization of immediate CO2 emission reduction as a result of
deduplication
○ Enable visualization of impact of content reduction over time
○ Enable visualization of a personal digital footprint counter for unique content
over time
● Create a “don’t make me think” experience with push-of-a-button actions
available in the end-user application
○ Enable system triggers to remove content through the application interface

The Bigger Impact:
Beyond Green IM

Generative AI (LLMs)
05
● Increase efficiency in RAG applications by removing noise and bias
● Decrease costs associated with vectorizing content
Legal and Regulatory
Compliance
04
● Reduced exposure to copyrights, trademarks, or other intellectual
property rights violations
● Decrease the risk of privacy breaches, as they may contain sensitive
information that can be accessed by unauthorized parties
Content Auditing and
Analysis
03
● Identify redundant or obsolete data
● Surface similar content content with different associated metadata
Cloud Data Migration
02
● Minimizes the volume of data to be transferred, optimizing network
bandwidth and reducing associated costs and energy consumption
● Lower operational costs and environmental impact
Mergers and Acquisitions
01
● Content deduplication streamlines the integration of data from merged
entities, ensuring a more efficient and sustainable data consolidation process
● Lowering infrastructure costs and minimizing environmental impact through
efficient content management practices
Content Deduplication Use Cases

Questions?
Thank you for listening.
We are happy to take any
questions at this time.
Urmi Majumder
umajumder@enterprise-knowledge
.com
www.linkedin.com/in/urmim/
Fernando Aguilar Islas
ﬁslas@enterprise-knowledge.com
www.linkedin.com/in/feraguilaris/

Driving Behavioral Change for Information Management through Data-Driven Green Strategy (EDW 2024)

Recommended

Recommended

More Related Content

Similar to Driving Behavioral Change for Information Management through Data-Driven Green Strategy (EDW 2024)

Similar to Driving Behavioral Change for Information Management through Data-Driven Green Strategy (EDW 2024) (20)

More from Enterprise Knowledge

More from Enterprise Knowledge (20)

Recently uploaded

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Green Strategy (EDW 2024)