Mastering the Dark Data Challenge - Harnessing AI for Enhanced Data Governance and Quality

Enterprise-Knowledge 742 views 24 slides Jul 30, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Enterprise Knowledge’s Maryam Nozari, Senior Data Scientist, and Urmi Majumder, Principal Data Architecture Consultant, presented a talk on “Mastering the Dark Data Challenge: Harnessing AI for Enhanced Data Governance and Quality” at the Data Governance & Information Quality Conference (D...


Slide Content

Mastering the Dark Data Challenge:
Harnessing AI for Enhanced Data
Governance and Quality

Unlocking the Potential of Unstructured Data
Maryam Nozari and Urmi Majumder
DGIQ 2024
1

ENTERPRISE KNOWLEDGE
Maryam Nozari
Sr. Data Scientist
Urmi Majumder
Principal Data Architecture Consultant
⬢15+ years of experience in enterprise system
architecture, design, implementation, and
operations
⬢Principal architect in knowledge graphs,
enterprise AI, and scalable data management
systems
⬢Ph.D in Computer Science, Duke University
⬢Senior Data Scientist and consultant
specializing in cutting-edge algorithm design
and predictive model deployment
⬢Expert in deploying Large Language Models
and knowledge graphs, enhancing business
intelligence
⬢Ph.D. in Learning Sciences, The University of
Texas at Austin
2

Topics Covered
❖Understanding Dark Data

❖Using AI to Manage Dark Data

❖Beyond Dark Data

3

Understanding Dark
Data
4

Understanding Dark Data
●Dark data, in the context of
sensitive information, refers to any
personal or confidential data that
is excessively disseminated or
accessible and that the
organization may not know exists

●Daily, approximately 328.77 million
terabytes of data are created
worldwide, with the average
organization holding about 17.5
petabytes of unstructured, often
unused data.


Data Breaches Hit Lots More People in 2022 - CNET
5

ENTERPRISE KNOWLEDGE
Data Security in Numbers
Data
breaches
are due to
human
error
Data
breaches in
financial
and
insurance
sectors

Cybersecurity
leaders
believe
attacks
powered by
AI in 2023
84% 74% 74%
CISOs
believe
security
equals
regulatory
compliance
85%
Source: The CISO Report6

Size of the Opportunity in the Age of Gen AI
An estimated 660 prompts to ChatGPT for every 10,000 users, with
source code being the most frequently exposed type of sensitive data,
posted by 22 out of 10K users, generating 158 data breach incidents monthly.
Other types of sensitive data accidentally shared with Gen AI apps include
regulated data resulting in 18 security incidents,
intellectual property resulting in 4 incidents, and posts
containing passwords and keys resulting in 4 incidents monthly.
55% of 17.5 PB unstructured data in an organization can be
considered as Dark Data.
Source: Navigating the Rising Tide of Data Breaches and AI Security Risks
7

The High Cost of Ignoring Dark Data
●Financial Costs of Compliance Violations
○The financial penalties companies have
faced due to non-compliance with data
protection regulations (e.g., GDPR, CCPA)
●Impact on Customer Trust
●Brand Reputation and Market Value
The average cost of a data breach was
$4.45 million in 2023, the highest average
on record. (IBM)
Security breaches increased in 2021
by 68 percent. (CNET)
8
https://www.ibm.com/reports/data-breach

Active Use and Storage
(Data is often actively used and
stored in an organized manner)







20XX
Data Creation
(automatic system logs,
user-generated content,
business transactions, etc.)







Data Silos and Lack of Accessibility
(data is stored in isolated systems or
formats, leading to a lack of
accessibility and visibility)







Neglect or Improper Management
(failure to delete obsolete data,
improper categorization, or simply
forgetting about its existence)







Becoming Dark Data
(data becomes dark—unused,
unanalyzed, and potentially
risky due to outdated or
sensitive information)
Dark data, if left
unchecked, poses
risks to privacy,
security, and
compliance, and
represents a
significant loss of
potential insights
and opportunities
Intervention
Encourage
inter-departmental
data sharing and
adopt integrated data
management systems
Increased risk of data
breaches, compliance
violations, and missed
opportunities for
insights
Intervention
Regular data hygiene
practices and clear data
governance policies can
prevent neglect
Intervention
Adopt advanced data
analytics, AI for data
classification, and
proactive data
management strategies
How Data Becomes Dark and How to
Recover from It
9

Using AI to Manage
Dark Data: A Case Study
10

The Cost to The Enterprise (a Case study)
●A leading federal research organization,
identified a significant challenge in
managing its vast amounts of unstructured
data.
○The firm's data landscape was
cluttered with dark data, including
project documents, proposals and
research papers - where some have
“classified” government information.
○This data was scattered across various
platforms such as shared drives, email
servers, and cloud storage solutions.
○This lead to inefficiencies in data
access, increased risk of sensitive data
exposure, and difficulties in complying
with stringent federal regulations.


11

The Solution
AI-Powered Dark Data Identifier
Original State The Need Solution
●Organization aware of the
existence of dark data but has
no scalable automation in place
to identify such content beyond
rule-based scripts to classify
content with PII
●A more flexible and
sophisticated approach to
automatically identifying
classified information and
evolving categories of sensitive
content specific to the
organization buried in
enterprise data assets that are
not properly labeled before
leveraging Generative AI across
the enterprise to boost
employee productivity
●Implemented data pipelines –
to connect and extract data
siloed in different systems.

●Enabled hybrid content
classification based on
predefined sensitivity rules to
give the organization the ability
to identify overshared sensitive
content and remediate access.

●Built a BI dashboard to provide
system administrators a clear
view into dark data in the
organization.

12

ENTERPRISE KNOWLEDGE
Integration with
Enterprise Systems
Proof of
Concept

Access Control
Mechanism
Solution
Architecture &
Design

Compliance
& Ethical
Guidelines
Data Classification & Tagging
Overall Solution Approach
to Dark Data Discovery
13

ENTERPRISE KNOWLEDGE
Dark Data Discovery using AI: Demo Time!
14

Metadata Ingestion
●Crawl through enterprise
data and employ a hybrid
approach of deterministic
and probabilistic methods,
including pattern matching
and AI/ML models
●Flag sensitive content with
overly permissive access,
enabling administrators to
adjust access levels and
safeguard confidential
information
●Based on a sensitivity rules
database with three
categories of rules
●Clear view into uncovered
dark data and rationale
●Ensures a robust, adaptable
framework for securing
sensitive data across various
organizational contexts

Dark Data Identification
Dark Data Discovery
From Dark Data to Insights
The AI Connection
15

ENTERPRISE KNOWLEDGE
Conceptual Architecture for
Dark Data Discovery
16

Data Classification
●Categorize organizational data assets
based on sensitivity, criticality and
usage
●Typical classes: public, internal,
confidential, restricted
Dark Data Discovery
●Define organizational data categories
●Define data class for each category
●Classify data asset using AI/ML and rules
●Flag sensitive content shared broadly for
security review
●Remediate access to overshared
sensitive content
Data Classification -> Dark Data Discovery
17

1
3
2
Unique
Pending Patents,
Business Plans,
Private Source Code,
Engineering
Processes
Core
SSN, Location,
DOB,Gender,
Ethnicity, Residency
History
Common
Country/State specific
govt issued IDs,
system specific logs
collecting sensitive
information, salary
information, health
records
Dark Data Discovery Rules: Data Class to Role Mapping
Definition: Rules that
cover types of sensitive
information that are not
universal but fairly
common across
geographical regions, and
enterprise domain.

Definition: Rules that cover
types of sensitive information
specific to an organization.
Definition: Rules that cover
types of sensitive information
that are usually present in any
enterprise data set.
2
1
3
18

ENTERPRISE KNOWLEDGE
~30K documents were
scanned and analyzed
by AI as the first PoC
Identified documents
with sensitivity label
and access mismatch
●Success rate can be improved
through targeted discovery
rules specific to the
organization’s content

●Structured/semi-structured
sources such as glossaries, data
collection spreadsheets,
invoices, receipts not easily
classifiable
Results Summary: Findings
~30K ~30K Impact
19

ENTERPRISE KNOWLEDGE Organizations belonging to highly regulated sectors such as Banking, Insurance, Finance, Healthcare,
Pharmaceutics, Automobile and Construction can readily leverage EK’s Dark Data Identification Service
for managing regulatory compliance
Hybrid Solution
Leverage
SME-defined Rules +
best-in-class LLMs

Extensible
Framework

Comprehensive View into Dark Data
Adjust rules to match
evolving landscape of
data privacy

Generate actionable insights to support
security SMEs in upholding regulatory
compliance
Powering Secure Data Management using AI
Data Security
Regulatory
Compliance Safe AI
20

The Bigger Impact:
Beyond Dark Data

21

Domain Specific Application

Identification

Assessment

Action Monitoring

Reporting

Finance
1. Identification of transaction records and communications.
2. Risk assessment for data breaches and non-compliance with regulations like SOX, GDPR.
3. Encryption of sensitive data, implementation of access controls.
4. Continuous monitoring for unusual access patterns.
5. Regular reporting to regulatory bodies and internal audits.
Insurance
1. Identification of sensitive information through scanning claim forms and customer interactions
2. Compliance Assessment with HIPAA (in health insurance) and other sector-specific laws.
3. Anonymization of personal identifiers, secure data storage solutions.
4. Real-time monitoring for compliance adherence.
5. Compliance status reports to state insurance boards.

Healthcare
1. Identification of patient data in clinical trial records.
2. Assessing compliance with FDA regulations and data protection laws.
3. Data segregation, rigorous data access controls.
4. Monitoring access to clinical data.
5. FDA audit reports and internal compliance reviews.

Automotive
1. Identification of unstructured data within vehicle testing reports and manufacturing data.
2.Assessing compliance with safety standards and environmental regulations.
3.Implementing data retention policies and safety data protocols.
4.Continuous monitoring of compliance with emissions and safety regulations.
5.Reporting to automotive safety and environmental agencies.
1 2 3 4 5
22

Customer Support 05
●Use a multi-class classification algorithm to classify customer tickets by
predefined categories such as complaint, feedback, question, and so on
●Integrate classifier output with ticket workflow so that customer support
agents can focus on urgent tickets
Content Moderation 04
●Content generated using Gen AI tools must be effectively moderated
●Manual content moderation can be both time-consuming and flawed
●AI-powered content moderation can reduce this cognitive load
Survey Analysis 03
●Train an AI/ML model to categorize qualitative survey responses into
predefined categories such as usability and technical complexity (if the survey
is for an application, as an example)
●Use the model to categorize survey responses at scale and direct each group of
responses to the right team for further analysis
Search and Discovery 02
●Tag content using predefined categories
●Index tags and other metadata about the content into a search engine to
power search and discovery
Records Management 01
●For every content without a content type, auto-assign content type
●Assign record codes based on the organization’s record schedule for that
content type
Broader Content Classification Use Cases
23

ENTERPRISE KNOWLEDGE
Questions?
Thank you for listening.

We are happy to take any
questions at this time.
Urmi Majumder
[email protected]
www.linkedin.com/in/urmim/
Maryam Nozari
[email protected]
www.linkedin.com/in/maryamnozaiphd/
24