Content classification - where is my stuff?

mpeter22 15 views 47 slides Jun 07, 2024
Slide 1
Slide 1 of 47
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47

About This Presentation

This slideshow is about why is important to classify the documents in an enterprise


Slide Content

Content Classification –Where’s My Stuff?
1
IBM Confidential

Agenda
Why Classify?
How Does Classification Work?
Content Classification Features
Taxonomy Basics
Starting a Classification Project
Best Practices for Classification
Real World Example
Closing thoughts
2
IBM Confidential

Content that is not properly classifiedis not accessible
–1 in 2 business leaders don’t have access to the information they need
to do their jobs
Quality of decision-makingsuffers when content is not
accurate
–1 in 3 business leaders frequently make business decisions based on
information they lack or don’t trust
Companies face difficulty in deriving full visibility and insight
into breadth and depth of unstructured content
–77% of CEOs don’t have immediate information to make key business
decisions
Sources:
IBM 2010 CEO & CFO Studies,
IBM 2010 Break Away With Business Analytics and Optimization Study
Why Classify?
3
IBM Confidential

Why Classify?
What if you walked into the Library of Congress and there was no Dewey
Decimal System?
What about the hardware store, the grocery store, the clothing store?
Do you park your car in the living room and place your sofa in the garage?
You have:
Millions of pieces of content
Hundreds of repositories
Thousands of workers
You need to:
Find relevant content, quickly
Accurately, consistently categorize
content
Gather meaning and
understanding from the content
Everything in our life is categorized and classified in some way
4
IBM Confidential

Why Classify?
You have been storing content for many years, but…
can you find it when you need it?
can you produce it for audits and litigation?
can you gain insight from it?
How does your organization go from this….to this?
IBM Confidential
5

Why Classify?
6
IBM Confidential

Why Classify?
Can you find relevant content, quickly?
–“Search, Refine, Repeat” is no longer acceptable
–Image Capture, Content Collection, Enterprise Search
Are you uncovering business insight from your content?
–Organized content produces better insight
–Content Analytics
Is the right content available at the right time?
–Business processes require timely access to content
–Business Process Management, Case Management
Are you complying with Legal and Business mandates?
–Content has a compliance lifecycle that must be enforced
–Content Collection, Enterprise Records, eDiscovery
Accessibility, Usability, Compliance, Analytics
7
IBM Confidential

Automated Classification makes information accessible,leaving
your workers to focus on important business tasks rather
searching, over and over, for relevant content
Classification provides enhanced content usabilityby automating
routing decisions based on the meaning of the text in your
content
Advanced Classification, combined with collection and records,
enables your company to complywith business and legal
mandates
Classification augments Content Analytics by providing extended
facet navigationand content clustering,delivering added analysis
and insight
Why Classify?

Agenda
Why Classify?
How Does Classification Work?
Content Classification Features
Taxonomy Basics
Starting a Classification Project
Best Practices for Classification
Real World Example
Closing thoughts
9
IBM Confidential

How does Classification work?
CLASSIFICATION AS A FACTORY WORKER
Think of a worker at the end of an assembly line
Task is to sort items coming down the line into correct
containers
Four possible item types on the line:
–Can
–Box
–Bottle
–Jar
How do you tell the factory worker which is which?
Start with the item to the right as a ‘can’ reference model
–6.5” high
–Red with blue & white lettering
–3.5” diameter
–Opened with a tab
–Contains liquid
10
IBM Confidential

How does Classification work?
Based on initial assumptions, which of these are “cans”?
What are our identification
parameters?
─Shape?
─Capacity/size?
─Contents (liquid vs. solid)?
─Method of opening?
─Construction material?
11
IBM Confidential
Based on the original reference model,
which of these is a can?
─6.5” high
─Red with blue & white lettering
─3.5” diameter
─Opened with a tab
─Contains liquid

How does Classification work?
Analogy is very relevant to category definition & corpus selection
Document classification involves the same problems
–What is an “Accounting and Finance” document?
•How can we differentiate it from a “Legal” document?
•How about “Regulatory?”
–How do humans tell which is which?
•Keywords
•Phrases
•Intent
Some distinctions are clear…
–Legal vs. Engineering
–Personnel vs. Operations
–Manufacturing vs. Advertising
Others are not…
–Legal vs. Regulatory
Classification effort depends on your environment
12
IBM Confidential

A
Intellectual
Property is
essential
Context-Based
Classification
?
The core market
for this new
product has been
defined as
such by IBM
B
Engineering
drafts require
approvalB
Engineering
requires skilled
software staffB
Engineering
requires clear
requirements
A
Legal is
changing the
timeframe for
contract
approval
A
Legal is
currently
requiring
full approval
C
Strategy should
look out over
36 monthsC
Strategy is
Important to
the marketing
team
Business Information
Category ‘A’
Marketing
Category ‘B’
Engineering
A
The core market
for this new
product has been
defined as
such by IBM
Category ‘C’
Strategy
13
IBM Confidential
How does Classification work?

How does Classification work?
Content Classification combines multiple methods of categorization
technologies to deliver the automatic classification
–Uses natural language processing and semantic analysis
–Uses rules-basedon metadata or confidence score
–Can be used in tandem or separately depending on requirements
14
IBM Confidential
To: Bob Smith <[email protected]>
From: Bill Roker<[email protected]>
Subject: Contract?
Bob,
Hope you’re doing well.
A quick note to see if the payment came through, as prescribed by
the contract? It would be terrible to have the firm sued over such a
simple financial matter. No one wants this project to be derailed.
Regards,
Bill
Bill Roker
212-555-1234
Financial Advisors, Inc.
Does the email contains the phrase
“contract”?
Does the sender belongs to the broker
email group?
Does the email have anything that matches
the pattern “XXX-YY-ZZZZ”?
Natural Language Processing + Semantic Analysis + Targeted Rules =
Comprehensive Content Classification

Agenda
Why Classify?
How Does Classification Work?
Content Classification Features
Taxonomy Basics
Starting a Classification Project
Best Practices for Classification
Real World Example
Closing thoughts
15
IBM Confidential

Content Classification Features
1.Automatic Categorization of documents and emails
–Analyzes the content of documents and emails in order to categorize them
–Uses natural language processing and semantic analysis
–Handles imperfect language (misspellings, abbreviations, poor grammar)
–Assigns confidence score to each category suggestion (0 –100)
–Learns from examples or keywords
•Creates a profile for each category by analyzing sample texts
•Categories can also be defined by keywords
2.Combines classification methods using text analysis and rules processing
–Rules based on metadatacan be defined in combination with classification based
on confidence score
–Language identification capability can be used in tandem with rules
16
IBM Confidential

3.Learns in real-time
–Can adapt based on feedback from end users or administrators
–Feedback is incorporated into analysis on-the-fly for immediate adaptation
4.Classification Workbench configuration tool
–Enables the process of creation and maintenance of Knowledge Bases and
Decision Plans
–Facilitatesclassification tune-up and reporting
5.Integrated to IBM ECM offerings
–Application for bulk classification of content upon ingestion to repository and bulk
classification and reclassification of content already under management
–Integrated with Datacap, Content Collector, Enterprise Records, Analytics, etc.
6.Taxonomy Creation Assistance
–Suggests new taxonomies for organizations that do not have them
–Suggests new elements for existing taxonomies
17
IBM Confidential
Content Classification Features

A knowledge base contains learned information
that Classification needs to perform matching,
training, and online learning
It is filled with relevant statistical and semantic
information derived from sample texts
Statistical entities consist of words, number of
occurrences, hints about the text, and distance
between words
A knowledge base is created & maintained through
the Workbench application
1.Collect and organize sample content
2.Create, analyze, and learn
3.Assess performance, review reports
18
18
IBM Confidential
Content Classification Features –Knowledge Base

A Decision Plan is a collection of rules that you
configure to determine how content is classified
A Decision Plan is developed by configuring one
or more rules based on content or metadata.
Each rule consists of one trigger and one or
more actions
–Example: Trigger: “If Title contains
‘Contract’ ” then, Action: “Assign to
Contracts Category” & “Move to Contracts
folder”
Rules can use strings, word distance, regular
expressions, pattern extraction, Boolean
expressions
Actions include set properties, invoke analysis,
move to folder, declare record, custom actions,
and more
Decision Plans can be used with or without a
Knowledge Base
19
IBM Confidential
Content Classification Features –Decision Plan

Agenda
Why Classify?
How Does Classification Work?
Content Classification Features
Taxonomy Basics
Starting a Classification Project
Best Practices for Classification
Real World Example
Closing thoughts
20
IBM Confidential

Content Classification –Taxonomy Basics
Taxonomy
1.The science or technique of classification.
2.A classification into ordered categories.
3.The science dealing with the description,
identification, naming, and classification of
organisms.
Business Taxonomy
1.Usually follows a line of business hierarchy
2.Logical grouping of content for business,
repositories or compliance purposes.
3.Generally “flattened” for better control and
management
7 levels 3-4 levels
21
IBM Confidential

Content Classification –Taxonomy Basics
The Goldilocks Zone
Company
Claims
Vehicle
Boat
Yacht
< 20 Ft. <32 Ft. < 46 Ft.
Dingy Cruise
Motorcycle Auto
Make
Model
RV
Home
Single
Brick
Wood
Mobile
Health
Policies Finance
“Too Many Categories”
1000 categories is probably too many
22
IBM Confidential

“Too Few Categories”
10 categories is probably too few
Company
Claims Policies HR Legal Finance
23
IBM Confidential
Content Classification –Taxonomy Basics
The Goldilocks Zone

“Just Right”
Somewhere around 100 categories is probably just right
Company
Claims
Auto Home
Policies HR
Employee Policies
Purchasing Legal Finance
Contracts Reporting Budget
24
IBM Confidential
Content Classification –Taxonomy Basics
The Goldilocks Zone

Taxonomies are important, but…
They do not have to be complexor unwieldy
Need to be acceptable to different organization areas
─Finance, Legal, HR, IT
Your organization may have a formal, internal taxonomy
─If so, start there, but it may have to be flattened
Your organization may have a de facto taxonomy
─ECM document classes, folders, File System structures, Departmental
structures, may be enough to start
Publicly available or 3
rd
-party taxonomies may be used
─Again, may have to be flattened
How are humans classifying today?
─Are workers filing paper in folder, drawers, cabinets?
─Are worker putting content in ECM, File Systems, Folders?
25
IBM Confidential
Content Classification –Taxonomy Basics

Agenda
Why Classify?
How Does Classification Work?
Content Classification Features
Taxonomy Basics
Starting a Classification Project
Best Practices for Classification
Real World Example
Closing thoughts
26
IBM Confidential

Starting a Classification Project
Approaches
–Taxonomy Proposal through Content Clustering
–Taxonomy Creation through “Seeded” Keywords
–Taxonomy Creation through Manual Content Gathering
–Knowledge Base Creation through Content Extraction
27
IBM Confidential

Taxonomy Proposal through Content Clustering
─We don’t know, what we don’t know
─Starting from a blank sheet
create
Trained
Knowledge
Base
A
C
D
B
28
IBM Confidential
Starting a Classification Project
gather
crawl
evaluate
categorize
cluster A
B
C
D

Taxonomy Creation through “Seeded” Keywords
─We know, what we don’t know
─Starting from a blank sheet
evaluate
& tune
Trained
Knowledge
Base
A
C
D
B
Knowledge Base
creation
Workbench
review
Keyword-based
content set
29
IBM Confidential
Starting a Classification Project
gather
crawl
keyword
keyword
keyword
Keyword
Seeded
taxonomy

Taxonomy Creation through Manual Content Gathering
─We know, what we don’t know
─Starting with known content
evaluate
& tune
Trained
Knowledge
Base
A
C
D
B
Knowledge Base
creation
30
IBM Confidential
Starting a Classification Project
Strawman
Taxonomy
A
B
C
Manual content
gathering
Manually gathered
content set
A
D
C
B

Knowledge Base Creation through Content Extraction
─We know, what we know
─Starting with known content and taxonomy
evaluate
& tune
Trained
Knowledge
Base
A
C
D
B
Knowledge Base
creation
31
IBM Confidential
Established
ECM Repository
Starting a Classification Project
Content
extraction
Extracted
content set
A
B
D
C

Agenda
Why Classify?
How Does Classification Work?
Content Classification Features
Taxonomy Basics
Starting a Classification Project
Best Practices for Classification
Real World Example
Closing thoughts
32
IBM Confidential

Look
Listen
Learn
33
Best Practices for Classification
(or All I really Need to Know about Classification, I learned in Kindergarten)

Best Practices for Classification
(or All I really Need to Know about Classification, I learned in Kindergarten)
Look
─In order to properly classify , you need to know your content
─Understand how your content is created and by whom
─Understand how content used in your business
─Understand the meaning and purpose of content
─Set realisticexpectations
─100% automation with 100% accuracy is rare
─Balance automation expectations with accuracy requirements
34
─This is a resume
─It is used by Human Resources, Hiring
Managers
─It is a text document
─The purpose is to aide the hiring
process
─The document may have compliance
value

Listen
─All content owners and users have a stake in proper classification
─Gather input and consider all aspectsof content, users and organizations
─Define categories based on business use
•Categories should represent organizational content, not organizational structure
•Taxonomies are less hierarchical and flatterthan “standard” taxonomies
35
Marketing
Advertising
Public
Relations
Store
Operations
Store
Management
Sales
Catalog
Human
Resources
Employee
Management
Benefits
Training
Legal
Contracts
Audit
Records &
Retention
Finance
Corporate
Reporting
Pricing
AP/AR
Hierarchical Flat
Best Practices for Classification
(or All I really Need to Know about Classification, I learned in Kindergarten)

36
Learn
─Training is iterative, it improves and learns over time
─Training sets must contain “high value” examples
─Number of training documents varies by organization (~20 to ~50, rule of thumb)
─100’s of documents is less useful than 20 well selected documents
─More is not better, it’s just more
─Addition of new categories affects existing categories
─Some categories may perform well immediately, others may require additional
effort
─Categories may “drift” over time (content intent, phrases, business changes, etc.)
─Learning requires the active use of feedback capabilities
Remember what Grover taught us…
“Three of these things belong together...”
Classification systems have to learn…….
Best Practices for Classification
(or All I really Need to Know about Classification, I learned in Kindergarten)

37
Best Practices for Classification –Summary
Categories
─Should be content driven and represent organizational content, not organization
chart
Taxonomies
─Less hierarchical, generally flatter and less formal than “standard” taxonomies
Training Sets
─Training sets should be consistent with actual content and represent “high-
value” content
─Clearly delineation of content between various categories
Ongoing monitoring and training
─Training is iterative, similar to business process optimization, it improves over
time
Set Realistic expectations with business user
─Balance automation expectations with accuracy requirements
Engage competent and experienced service providers to assist with initial
classification project

Agenda
Why Classify?
How Does Classification Work?
Content Classification Features
Taxonomy Basics
Starting a Classification Project
Best Practices for Classification
Real World Example
Closing thoughts
38
IBM Confidential

Content Classification provides text
analytics and statistical probability to
provide another recognition approach to
Taskmaster’s vast array of methods
Real World Example
Image Capture and Classification
Integration between Datacap Taskmaster and Content Classification
brings the power of image capture and automated classification together

Classification Challenges
What type of document is this?
–to vary processing by type
What pages contain the data I need?
–to extract or key in the proper fields
Do the documents contain the correct pages?
–to ensure that the documents are “in good order” and not missing information
What is the business meaning of this document?
–to get the document to the right person or process with the right priority
Real World Example
Image Capture and Classification

The Separation Challenge
Where does one document end and the next begin?
41
Here? Here? Here? Here?
Real World Example
Image Capture and Classification
Traditional Methods
–Patch & Barcoded Separator Sheets
–Barcode Labels and Documents
–Manual Identification
–Paper Sorting
Shortcomings
–Labor-intensive
–Relies on a worker knowledge to correctly
identify and sort out the documents
–Externally generated documents cannot be
barcoded

Datacap Taskmaster & Classification for Separation & Page Identification
Taskmaster examines each page using multiple methods
–The fastest methods are done first : barcode, pattern match, & fingerprint
–The slower methods that require OCR follow: Text analytics and keywords
–Rules examine the context to determine if any remaining pages can be identified based on the
surrounding pages
–Taskmaster calls Content Classification to help identify pages
–Taskmaster separates and assembles the pages into documents
Content Classification analyzes the text content
–Statistical analysis of the text on a page compared to a knowledge base to find the closest
match
–Assigns confidence score to each category suggestion (0 –100)
–Returns the Classification results to Taskmaster
─Classification feedback loop improves future results by providing feedback to the classification
engine
Exceptions, low confidence results are reviewed and classified by users
Real World Example
Image Capture and Classification

Bank specializing in mortgage loan servicing
Slashing costs with IBM Production Imaging Edition
and IBM Content Classification
The solution is targeted to reduce
costs by automating the classifying,
keying and filing of millions of pages of
loan documentation per day.
The need
•Reduce paper document scanning and processing costs
•Reduce loan servicing customer service costs
•Processing volumes can exceed 100 million scanned pages per
month
•PIE -Datacap Taskmaster scans and imports paper documents
•PIE -Datacap Taskmaster rules classify documents to the page level
using barcodes, image fingerprint pattern matching, regular
expressions, and text analytic classification
•IBM Classification Module classifies pages using text analytics
•Taskmaster extracts text and data fields using optical character
recognition (OCR)
•Data collection, statistical reporting, and feedback loops improve
accuracy and configuration tuning
•PIE -FileNet Content Manager securely stores the documents
•Acquisition and servicing processes are automated through web-
based document access and PIE business process capabilities.
Projected benefits
•Save millions of dollars of staff time by
automating document classification, reducing
data entry, and providing direct access to the loan
documents with improved speed, accuracy, and
granularity.
•Save millions of dollars in per-page licensing fees
associated with the competitively replaced Kofax
KTM system
•Provide a platform that can be rapidly ramped up
to handle high loads associated with portfolio
acquisitions
The solution
The company contracted with IBM partner Imagine Solutions to
implement IBM Production Imaging Edition (PIE) and IBM
Classification Module software
43
IBM Confidential

Agenda
Why Classify?
How Does Classification Work?
Content Classification Features
Taxonomy Basics
Starting a Classification Project
Best Practices for Classification
Real World Example
Closing thoughts
44
IBM Confidential

Closing Thoughts
How can classification help my business?
Improve teaching programs and student learning
─Classifying educational content through analysis of lesson plan text
Automatically code medical bills
─Interpret doctors notes and apply industry standard codes (ICD-9, ICD-10)
Reduce manual, human intervention
─Automatically evaluate email service requests and establishing responses
Shorten process cycle time
─Distinguish mortgage, auto, personal, credit card loan applications
─Route content to appropriate worker or process step
Automatically understand Personally Identifiable Information
(PII), Personal Health Information (PHI) in unstructured content
─Take actionssuch as file, record, route, redact
45
IBM Confidential

Closing Thoughts
Classification is a powerful solution to automate the
categorization of text-based content
Properly categorized content provides better accessibility,
usability, compliance and analytics
Many factors lead to high-quality classification –consider and
understand all of them
They keys to success are planning, preparation and persistence
─Is there any project that does not require these?
Automated classification allows you to cut costs associated with
content capture, collection, archiving, retention, analysis and
more
46
IBM Confidential
“Anything worth doing, is worth doing right.” –Hunter S. Thompson

47