Web mining and data mining seminar topic

dewashishpradhan010 12 views 45 slides Oct 19, 2024
Slide 1
Slide 1 of 45
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45

About This Presentation

Text and Web Mining


Slide Content

Decision Support and Business Decision Support and Business
Intelligence SystemsIntelligence Systems
(9(9
thth
Ed., Prentice Hall) Ed., Prentice Hall)
Chapter 7:Chapter 7:
Text and Web MiningText and Web Mining

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-2
Learning ObjectivesLearning Objectives

Describe text mining and understand the
need for text mining

Differentiate between text mining, Web
mining and data mining

Understand the different application areas
for text mining

Know the process of carrying out a text
mining project

Understand the different methods to
introduce structure to text-based data

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-3
Learning ObjectivesLearning Objectives

Describe Web mining, its objectives, and its
benefits

Understand the three different branches of
Web mining

Web content mining

Web structure mining

Web usage mining

Understand the applications of these three
mining paradigms

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-4
Opening Vignette:Opening Vignette:
“Mining Text for Security and
Counterterrorism”

What is MITRE?

Problem description

Proposed solution

Results

Answer and discuss the case
questions

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-5
Opening Vignette:Opening Vignette:
Mining Text For Security…Mining Text For Security…
(L) Kampala
(L) Uganda
(P) Yoweri Museveni
(L) Sudan
(L) Khartoum
(L) Southern Sudan
(P) Timothy McVeigh
(P) Oklahoma City
(P) Terry Nichols
(E) election
(P) Norodom Ranariddh
(P) Norodom Sihanouk
(L) Bangkok
(L) Cambodia
(L) Phnom Penh
(L) Thailand
(P) Hun Sen
(O) Khmer Rouge
(P) Pol Pot
Cluster 1 Cluster 2 Cluster 3

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-6
Text Mining ConceptsText Mining Concepts

85-90 percent of all corporate data is in some
kind of unstructured form (e.g., text)

Unstructured corporate data is doubling in
size every 18 months

Tapping into these information sources is not
an option, but a need to stay competitive

Answer: text mining

A semi-automated process of extracting
knowledge from unstructured data sources

a.k.a. text data mining or knowledge discovery in
textual databases

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-7
Data Mining versus Text MiningData Mining versus Text Mining

Both seek for novel and useful patterns

Both are semi-automated processes

Difference is the nature of the data:

Structured versus unstructured data

Structured data: in databases

Unstructured data: Word documents, PDF
files, text excerpts, XML files, and so on

Text mining – first, impose structure to
the data, then mine the structured data

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-8
Text Mining ConceptsText Mining Concepts

Benefits of text mining are obvious especially
in text-rich data environments

e.g., law (court orders), academic research
(research articles), finance (quarterly reports),
medicine (discharge summaries), biology
(molecular interactions), technology (patent files),
marketing (customer comments), etc.

Electronic communization records (e.g., Email)

Spam filtering

Email prioritization and categorization

Automatic response generation

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-9
Text Mining Application AreaText Mining Application Area

Information extraction

Topic tracking

Summarization

Categorization

Clustering

Concept linking

Question answering

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-10
Text Mining TerminologyText Mining Terminology

Unstructured or semistructured data

Corpus (and corpora)

Terms

Concepts

Stemming

Stop words (and include words)

Synonyms (and polysemes)

Tokenizing

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-11
Text Mining TerminologyText Mining Terminology

Term dictionary

Word frequency

Part-of-speech tagging

Morphology

Term-by-document matrix

Occurrence matrix

Singular value decomposition

Latent semantic indexing

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-12
Text Mining for Patent AnalysisText Mining for Patent Analysis
(see Applications Case 7.2)(see Applications Case 7.2)

What is a patent?

“exclusive rights granted by a country to
an inventor for a limited period of time in
exchange for a disclosure of an invention”

How do we do patent analysis (PA)?

Why do we need to do PA?

What are the benefits?

What are the challenges?

How does text mining help in PA?

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-13
Natural Language Processing (NLP)Natural Language Processing (NLP)

Structuring a collection of text

Old approach: bag-of-words

New approach: natural language processing

NLP is …

a very important concept in text mining

a subfield of artificial intelligence and
computational linguistics

the studies of "understanding" the natural
human language

Syntax versus semantics based text mining

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-14
Natural Language Processing (NLP)Natural Language Processing (NLP)

What is “Understanding” ?

Human understands, what about computers?

Natural language is vague, context driven

True understanding requires extensive
knowledge of a topic

Can/will computers ever understand natural
language the same/accurate way we do?

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-15
Natural Language Processing (NLP)Natural Language Processing (NLP)

Challenges in NLP

Part-of-speech tagging

Text segmentation

Word sense disambiguation

Syntax ambiguity

Imperfect or irregular input

Speech acts

Dream of AI community

to have algorithms that are capable of
automatically reading and obtaining knowledge
from text

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-16
Natural Language Processing (NLP)Natural Language Processing (NLP)

WordNet

A laboriously hand-coded database of English
words, their definitions, sets of synonyms, and
various semantic relations between synonym sets

A major resource for NLP

Need automation to be completed

Sentiment Analysis

A technique used to detect favorable and
unfavorable opinions toward specific products
and services

See Application Case 7.3 for a CRM application

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-17
NLP Task CategoriesNLP Task Categories

Information retrieval

Information extraction

Named-entity recognition

Question answering

Automatic summarization

Natural language generation and understanding

Machine translation

Foreign language reading and writing

Speech recognition

Text proofing

Optical character recognition

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-18
Text Mining ApplicationsText Mining Applications

Marketing applications

Enables better CRM

Security applications

ECHELON, OASIS

Deception detection (…)

Medicine and biology

Literature-based gene identification (…)

Academic applications

Research stream analysis

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-19
Text Mining ApplicationsText Mining Applications

Application Case 7.4: Mining for Lies

Deception detection

A difficult problem

If detection is limited to only text, then
the problem is even more difficult

The study

analyzed text based testimonies of
person of interests at military bases

used only text-based features (cues)

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-20
Text Mining ApplicationsText Mining Applications

Application Case 7.4: Mining for Lies

Statements
Transcribed for
Processing
Text Processing
Software Identified
Cues in Statements
Statements Labeled as
Truthful or Deceptive
By Law Enforcement
Text Processing
Software Generated
Quantified Cues
Classification Models
Trained and Tested on
Quantified Cues
Cues Extracted &
Selected

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-21
Text Mining ApplicationsText Mining Applications

Application Case 7.4: Mining for Lies
Category Example Cues
Quantity Verb count, noun-phrase count, ...
Complexity Avg. no of clauses, sentence length, …
Uncertainty Modifiers, modal verbs, ...
Nonimmediacy Passive voice, objectification, ...
Expressivity Emotiveness
Diversity Lexical diversity, redundancy, ...
Informality Typographical error ratio
Specificity Spatiotemporal, perceptual information …
Affect Positive affect, negative affect, etc.

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-22
Text Mining ApplicationsText Mining Applications

Application Case 7.4: Mining for Lies

371 usable statements are generated

31 features are used

Different feature selection methods used

10-fold cross validation is used

Results (overall % accuracy)

Logistic regression67.28

Decision trees 71.60

Neural networks73.46

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-23
Text Mining ApplicationsText Mining Applications
(gene/protein interaction identification)(gene/protein interaction identification)
G
e
n
e
/
P
r
o
t
e
in
596 12043 24224 281020 42722 397276
D007962
D 016923
D 001773
D019254 D044465D001769D002477D003643 D016158
185 8511129 23017 27 5874 2791 895216235632 17 8252 82523
NN INNN IN VBZ IN JJ JJ NN NN NN CC NN INNN
NP PPNP NP PP NP NP PPNP
O
n
t
o
lo
g
y
W
o
r
d
P
O
S
S
h
a
llo
w

P
a
r
s
e
...expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-24
Text Mining ProcessText Mining Process
Extract
knowledge
from available
data sources
A0
Unstructured data (text)
Structured data (databases)
Context-specific knowledge
Software/hardware limitations
Privacy issues
Tools and techniques
Domain expertise
Linguistic limitations
Context diagram for Context diagram for
the text mining the text mining
process process

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-25
Text Mining ProcessText Mining Process

Establish the Corpus:
Collect & Organize the
Domain Specific
Unstructured Data
Create the Term-
Document Matrix:
Introduce Structure
to the Corpus
Extract Knowledge:
Discover Novel
Patterns from the
T-D Matrix
The inputs to the process
includes a variety of relevant
unstructured (and semi-
structured) data sources such
as text, XML, HTML, etc.
The output of the Task 1 is a
collection of documents in
some digitized format for
computer processing
The output of the Task 2 is a
flat file called term-document
matrix where the cells are
populated with the term
frequencies
The output of Task 3 is a
number of problem specific
classification, association,
clustering models and
visualizations
Task 1 Task 2 Task 3
FeedbackFeedback
The three-step text mining process The three-step text mining process

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-26
Text Mining ProcessText Mining Process

Step 1: Establish the corpus

Collect all relevant unstructured data
(e.g., textual documents, XML files,
emails, Web pages, short notes, voice
recordings…)

Digitize, standardize the collection
(e.g., all in ASCII text files)

Place the collection in a common place
(e.g., in a flat file, or in a directory as
separate files)

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-27
Text Mining ProcessText Mining Process

Step 2: Create the Term–by–Document Matrix
investm
ent risk
project m
anagem
ent
softw
are engineering
developm
ent
1
SAP
...
Document 1
Document 2
Document 3
Document 4
Document 5
Document 6
...
Documents
Terms
1
1
1
2
1
1
1
3
1

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-28
Text Mining ProcessText Mining Process

Step 2: Create the Term–by–Document
Matrix (TDM), cont.

Should all terms be included?

Stop words, include words

Synonyms, homonyms

Stemming

What is the best representation of the
indices (values in cells)?

Row counts; binary frequencies; log frequencies;

Inverse document frequency

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-29
Text Mining ProcessText Mining Process

Step 2: Create the Term–by–Document
Matrix (TDM), cont.

TDM is a sparse matrix. How can we
reduce the dimensionality of the TDM?

Manual - a domain expert goes through it

Eliminate terms with very few occurrences in
very few documents (?)

Transform the matrix using singular value
decomposition (SVD)

SVD is similar to principle component analysis

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-30
Text Mining ProcessText Mining Process

Step 3: Extract patterns/knowledge

Classification (text categorization)

Clustering (natural groupings of text)

Improve search recall

Improve search precision

Scatter/gather

Query-specific clustering

Association

Trend Analysis (…)

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-31
Text Mining ApplicationText Mining Application
(research trend identification in (research trend identification in
literature)literature)

Mining the published IS literature

MIS Quarterly (MISQ)

Journal of MIS (JMIS)

Information Systems Research (ISR)

Covers 12-year period (1994-2005)

901 papers are included in the study

Only the paper abstracts are used

9 clusters are generated for further analysis

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-32
Text Mining ApplicationText Mining Application
(research trend identification in (research trend identification in
literature)literature)
JournalYearAuthor(s) Title Vol/NoPagesKeywords Abstract
MISQ 2005A. Malhotra,
S. Gosain and
O. A. El Sawy
Absorptive capacity
configurations in
supply chains:
Gearing for partner-
enabled market
knowledge creation
29/1145-187knowledge management
supply chain
absorptive capacity
interorganizational
information systems
configuration approaches
The need for continual value
innovation is driving supply
chains to evolve from a pure
transactional focus to
leveraging interorganizational
partner ships for sharing
ISR 1999D. Robey and
M. C. Boudreau
Accounting for the
contradictory
organizational
consequences of
information
technology:
Theoretical directions
and methodological
implications
2-Oct167-185organizational
transformation
impacts of technology
organization theory
research methodology
intraorganizational power
electronic communication
mis implementation
culture
systems
Although much contemporary
thought considers advanced
information technologies as
either determinants or enablers
of radical organizational
change, empirical studies have
revealed inconsistent findings to
support the deterministic logic
implicit in such arguments. This
paper reviews the contradictory
JMIS 2001R. Aron and
E. K. Clemons
Achieving the optimal
balance between
investment in quality
and investment in self-
promotion for
information products
18/265-88information products
internet advertising
product positioning
signaling
signaling games
When producers of goods (or
services) are confronted by a
situation in which their offerings
no longer perfectly match
consumer preferences, they
must determine the extent to
which the advertised features of
… … … … … … … …

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-33
Text Mining ApplicationText Mining Application
(research trend identification in (research trend identification in
literature)literature)
Y E A R
N
o

o
f

A
r
t
ic
le
s
C L U S TE R : 1
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
2
0
0
3
2
0
0
4
2
0
0
5
0
5
1 0
1 5
2 0
2 5
3 0
3 5
C L U S TE R : 2
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
2
0
0
3
2
0
0
4
2
0
0
5
C L U S TE R : 3
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
2
0
0
3
2
0
0
4
2
0
0
5
C L U S TE R : 4
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
2
0
0
3
2
0
0
4
2
0
0
5
0
5
1 0
1 5
2 0
2 5
3 0
3 5
C L U S TE R : 5
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
2
0
0
3
2
0
0
4
2
0
0
5
C L U S TE R : 6
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
2
0
0
3
2
0
0
4
2
0
0
5
C L U S TE R : 7
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
2
0
0
3
2
0
0
4
2
0
0
5
0
5
1 0
1 5
2 0
2 5
3 0
3 5
C L U S TE R : 8
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
2
0
0
3
2
0
0
4
2
0
0
5
C L U S TE R : 9
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
2
0
0
3
2
0
0
4
2
0
0
5

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-34
Text Mining ApplicationText Mining Application
(research trend identification in (research trend identification in
literature)literature)
JO U R N AL
N
o

o
f A
r
ti
c
l
e
s
C LU S T E R : 1
IS RJ M ISM IS Q
0
10
20
30
40
50
60
70
80
90
100
C LU S T E R : 2
IS RJ M ISM IS Q
C LU S T E R : 3
IS RJ M ISM IS Q
C LU S T E R : 4
IS RJ M ISM IS Q
0
10
20
30
40
50
60
70
80
90
100
C LU S T E R : 5
IS RJ M ISM IS Q
C LU S T E R : 6
IS RJ M ISM IS Q
C LU S T E R : 7
IS RJ M ISM IS Q
0
10
20
30
40
50
60
70
80
90
100
C LU S T E R : 8
IS RJ M ISM IS Q
C LU S T E R : 9
IS RJ M ISM IS Q

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-35
Text Mining ToolsText Mining Tools

Commercial Software Tools

SPSS PASW Text Miner

SAS Enterprise Miner

Statistica Data Miner

ClearForest, …

Free Software Tools

RapidMiner

GATE

Spy-EM, …

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-36
Web Mining OverviewWeb Mining Overview

Web is the largest repository of data

Data is in HTML, XML, text format

Challenges (of processing Web data)

The Web is too big for effective data mining

The Web is too complex

The Web is too dynamic

The Web is not specific to a domain

The Web has everything

Opportunities and challenges are great!

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-37
Web MiningWeb Mining

Web mining (or Web data mining) is the
process of discovering intrinsic relationships
from Web data (textual, linkage, or usage)

Web Mining
Web Structure Mining
Source: the unified
resource locator (URL)
links contained in the
Web pages
Web Content Mining
Source: unstructured
textual content of the
Web pages (usually in
HTML format)
Web Usage Mining
Source: the detailed
description of a Web
site’s visits (sequence
of clicks by sessions)

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-38
Web Content/Structure MiningWeb Content/Structure Mining

Mining of the textual content on the
Web

Data collection via Web crawlers

Web pages include hyperlinks

Authoritative pages

Hubs

hyperlink-induced topic search (HITS) alg

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-39
Web Usage MiningWeb Usage Mining

Extraction of information from data
generated through Web page visits and
transactions…

data stored in server access logs, referrer logs,
agent logs, and client-side cookies

user characteristics and usage profiles

metadata, such as page attributes, content
attributes, and usage data

Clickstream data

Clickstream analysis

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-40
Web Usage MiningWeb Usage Mining

Web usage mining applications

Determine the lifetime value of clients

Design cross-marketing strategies across
products.

Evaluate promotional campaigns

Target electronic ads and coupons at user
groups based on user access patterns

Predict user behavior based on previously
learned rules and users' profiles

Present dynamic information to users based on
their interests and profiles…

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-41
Web Usage MiningWeb Usage Mining
(clickstream analysis)(clickstream analysis)

Weblogs
Website
Pre-Process Data
Collecting
Merging
Cleaning
Structuring
- Identify users
- Identify sessions
- Identify page views
- Identify visits
Extract Knowledge
Usage patterns
User profiles
Page profiles
Visit profiles
Customer value
How to better the data
How to improve the Web site
How to increase the customer value
User /
Customer

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-42
Web Mining Success StoriesWeb Mining Success Stories

Amazon.com, Ask.com, Scholastic.com, …

Website Optimization Ecosystem
Web
Analytics
Voice of
Customer
Customer Experience
Management
Customer Interaction
on the Web
Analysis of InteractionsKnowledge about the Holistic
View of the Customer

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-43
Web Mining ToolsWeb Mining Tools
Product Name URL
Angoss Knowledge WebMiner angoss.com
ClickTracks clicktracks.com
LiveStats from DeepMetrix deepmetrix.com
Megaputer WebAnalyst megaputer.com
MicroStrategy Web Traffic Analysis microstrategy.com
SAS Web Analytics sas.com
SPSS Web Mining for Clementine spss.com
WebTrends webtrends.com
XML Miner scientio.com

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-44
End of the ChapterEnd of the Chapter

Questions / comments…

Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall7-45
All rights reserved. No part of this publication may be reproduced, stored in a
retrieval system, or transmitted, in any form or by any means, electronic,
mechanical, photocopying, recording, or otherwise, without the prior written
permission of the publisher. Printed in the United States of America.
Copyright © 2011 Pearson Education, Inc.  Copyright © 2011 Pearson Education, Inc.  
Publishing as Prentice HallPublishing as Prentice Hall
Tags