Web mining .pdf module 6 dwm third year ce

Module 6:
Web Mining
-PradnyaBhangale

Content
•Introduction
•Web Content Mining
•Crawlers
•Harvest System
•Virtual Web View
•Personalization
•Web Structure Mining: PageRank, Clever
•Web Usage Mining

Introduction: Web Mining
•Application of data mining techniques to find information
patterns from the web data like web documents, web
contents, hyperlinks and server logs
•Web data can include:
•Contentof actual web pages
•Intra page structure which includes HTML or XML nodes
for the page
•Inter page structure which is the actual linkage structure
between web pages
•Usage data that describe how web pages are accessed by the
visitors
•User profile data include users profile, registration
information, and cookies

•Contents of web data mined may consist of text, structured
data such as lists and tables, and even images, video and
audio
•Goal of web mining:
•Look for patterns in Web data by collecting and analyzing
information in order to gain insight into trends, the industry
and users in general
Introduction (continued)

Web Mining: Applications
•Helps to improve power of search engines such as Google,
Yahoo etc. by classifying web documents and identifying the
web pages
•Used to predict user behavior
•Landing page optimization
•Useful for e-commerce websites and e-services

Web Mining: Techniques

•Web Content Mining: used for mining of useful data,
information and knowledge from web page content
•Web Structure Mining: helps to find useful knowledge or
information pattern from the structure of hyperlinks
•Web Usage Mining: used for mining the web log
records(access information of web pages) and helps to discover
the user access patterns of web pages
Web Mining: Techniques

Web Content Mining
•Process of mining useful information from the contents of
the web pages / web documents–text, image, audio, video
etc.
•Based on content of the input query, Web content mining
performs scanning and mining of the text, images, and display
in search engines group of web pages
•Ex: if user search for particular book in search engine then
search engine provides list of suggestions
•There are many techniques to extract the data like web
scraping.
•Scrapyand Octoparseare well known tools that perform web
content mining process

1. Crawlers
•Traditional search engines use crawlersto search the Web and
gather information, indexing techniques to store information
andquery processing to provide fast and accurate information
to users
•Web crawler is a program that acts as an automated script
which browse through the internet in a systematic way
•Primarily programmed for repetitive actions so that browsing
is automated
•Search engines use crawlers most frequently to browse
internet and build an index

1. Crawlers: Workflow

•Web crawlers are keyword-based, it looks at the keywords in
the pages, the kind of the content each page has and the links,
before returning the information to search engine. This
process is known as Web Crawling
1. Crawlers
Crawler Frontier: Stores list of
URL’s to visit
Page Downloader: download
pages from world wide web
Web Repository: receives web
pages from crawler and stores in
database

•These web crawlers go by different names, like bots,
automatic indexers and robots.
•For example, Google's search engine use crawlers to fetch
those pages to Google's servers.
•Some of the popular web crawlers are
•Googlebot
•Scrapy(the Python Scraper)
•Stormcrawler
•ElasticsearchRiver Web, etc.
1. Crawlers

Web Crawlers: Working
•The spider begins its crawl by going through the websites or
list of websites that is visited previously
•When crawler visit a website, they search for other pages that
are worth visiting
•Web crawlers can link to new sites, note changes to existing
sites and mark dead links
Google Search -How It works ?
•From trillions of pages on world wide web, Web Crawlers
crawl through pages to bring back the results demanded by
customers
•Site owners can decide which of their pages they want the
web crawlers to index, and they can block the pages that need
not be indexed.

•The indexing is done by sorting the pages and looking at the
quality of the content and other factors.
•Google then generates algorithms to get a better view of
what you are searching for, and provides a number of features
that make your search more effective, such as:
•Spelling: Incase there is an error in the word you typed,
Google comes up with several alternatives to help you get on
track
•Google Instant :Instant results as you type.
•Search methods :Different options for searching, other than
just typing out the words. This includesimages and voice
search.
Working of Web Crawlers

•Synonyms: Tackles similar worded meanings and produces
results
•Auto complete : Anticipates what you need from what you
type.
•Query understanding: Anin-depth understanding of what
you type
Working of Web Crawlers

Types of Crawlers
•Periodic crawler : A traditional crawler, in order to refresh
its collection, periodically replaces the olddocuments
with the newly downloaded documents. As it is activated
periodically every time it is activated it replacesthe existing
index
•Incremental Crawler: crawler incrementally refreshes the
existing collection of pages by visiting them frequently and
updates the index incrementallyinstead of replacing it
•Focused Crawler: This web crawler tries to download the
web pages that are related to each other i.e. it visits pages
related to topic of interest. This is also known as topic crawler

Types of Crawlers

Web Crawler: Applications
•Price comparison portals search
•Acrawler may collectpublicly available e-mail or postal
addresses of companies for target advertising
•Web analysis tools use crawlers to collect data for page views,
or incoming or outbound links.
•Crawlers serve to provide information hubs with data, for
example, news sites.

2. Harvest Systems
•Dataharvestingusesaprocessthatextractsandanalyzesdata
collectedfromonlinesources
•Basedontheuseofcaching,indexing,andcrawling.
•Harvest is actually a set oftools that facilitate gatheringof
information fromdiverse sources.
•Fordataharvesting,awebsiteistargeted,andthedatafromthat
siteisextracted.
•Mightbesimpletextfoundonthepageorwithinthepage's
code.
•Couldbedirectoryinformationfromaretailsite.
•Mightevenbeaseriesofimagesandvideos.

•The harvest design is centeredaroundtheuseofgatherers
and brokers.
•A gathererobtains information for indexing from an Internet
service provider
•Brokerprovides the index and query interface
•Brokers may interface directly with gatherers or may go
through other brokers to get to the gatherers.
2. Harvest Systems

2. Harvest Systems

3. Virtual Web View

3. Virtual Web View
•Web Data Mining Query Language
•Provides data mining operations on MLDB

•Web personalization is the process of customizing aweb site
to theneedsofeachspecific useror set of users like
•Provision of recommendations to the users
•Highlighting/adding links
•Creation of index pages, etc.
•The web personalization systems are mainly based on the
exploitation of the navigational patterns of the website's
visitors.
•The process of providing information that is related to
user's currentpageis knownas web personalization.
4. Personalization

•For example: e-commerce
•The key information that is required for suggesting these similar
web pages comes from
•Knowledge ofother userswhohave alsovisited thecurrent
page
•As well as web page content, the structure of the web page or
the user's profile information.
•All these help in creating a focused and personalized web
browsing experience for the user.
4. Personalization

4. Personalization

•The web personalization process can be
divided intofourphasesnamely
1.Datacollection
2.Pre-processingof web data
3.Analysis of web data
4.Decisionmaking or recommendation.
4. Web Personalization Phases

4. Web Personalization Phases
1. Datacollection
•Data collection is the process of gathering information
eitherexplicitly or implicitly specific to each visitor for
recording their interests and behavior while they browse a
web site
•Implicit data: collection of activities completed in the past and
recorded in web server logs
•Explicit data: information submitted by user at the time of
registration or in response to rating questionnaires
•Web data in the form of content, structure, semantic, usage
and user profile may be collected

Phase 2: Preprocessing of Data
•Log data collected from web server are text files with row for
each http transactions
•These data needs to be cleansed before putting them for
analysis
•Preprocessing filters out irrelevant information according to
goal of analysis
Phase 3: Data Analysis/ Mining
•Specific data mining techniqueswhich are used for mining of
web data are applied to the pre-processed data to discover
interesting usage patterns.
•It classifies the content of a web site into semantic categories
in order to make information retrievaland presentation
easierfor user
4. Web Personalization Phases

Phase 4: Recommendation Phase
•This last phase usually performs recommendation to users by
determining existing hyperlinks, dynamic insertion of new
hyperlinks that seems to be interest for current user to last
web page requested by user or even creation of new index
pages
4. Web Personalization Phases

Types of Personalization
•There are three approaches for generating a personalized web
experience for a user:
•Content based Filtering
•Collaborative Filtering
•Model based Techniques
•Memory based Techniques

1. Content Based Filtering
•The approach to recommendation
generation is based around the
analysis of items previously rated
by a user and generating a profile
for a user based on the content
descriptions of these items.
•Severalearlyrecommender
systemswerebasedoncontent-
basedfilteringincludingPersonal
WebWatcher,InfoFinder,
Newsreaders,LetiziaandSyskill
andWebert.

2. Collaborative Filtering
•Thebasicideaaspresentedby
Goldbergetal.wasthatpeople
collaboratetohelpeachother
performfilteringbyrecordingtheir
reactionstoe-mailsinthe'formof
annotations.
•Users provide feedback on the items
that they consume, in the form ratings.
•To recommend items to the active user,
previous feedback is used to find other
likeminded users.
•Items that have been consumed by
compatible users butnot by
thecurrent user arecandidatesfor
recommendation.

3. Model based Techniques
•Modelbasedcollaborativetechniquesuseatwo-stageprocess
forrecommendation
•Thefirststageiscarriedoutoffline,whereuserbehavioral
datacollectedduringpreviousinteractionsisminedandan
explicitmodelgeneratedforuseinfutureonlineinteractions.
•The second stage is carried out in real-time as a new visitor
begins an interaction with the Web site.
•Datafromthecurrentusersessionisscoredusingthemodels
generatedoffline,andrecommendationsgeneratedbasedonthis
scoring.

Model based vs. Memory based
Techniques

Web Structure Mining

Web Structure Mining
•Web structure mining is used for creating a model of
web organization
•Process of analyzing the nodes and connection structure
of a website using graph theory

Web Structure Mining
Why?
•Used to classify web pages
•Helpful to create information such as relationship and
similarity between different websites
•Useful for discovering website types
•Authority Sites: information about the subject
•Hubs sites: point to many authority sites

Algorithms for Web Structure Mining
PageRank algorithm (Google Founders)
•Looks at number of links to a website and importance of
referring links
•Computed before the user enters the query.
HITS algorithm (Hyperlinked Induced Topic Search)
•User receives two lists of pages for query (authority and
hubpages)
•Computations are done after the user enters the query

PageRank Algorithm
•The idea of the algorithm came from academic citation
literature.
•It was developed in 1998 as part of the Google search
engine prototype
•Studies citation relationship of documents within the web.
•Google search engine ranks documents as a function of both
the query terms and the hyperlink structureof the web

Definition of PageRank
•The Page Rank produces ranking independent of a user's
query.
•The importance of a web page is determined by the
number of other important web pages that are pointing
to that page and the number of out links from other web
pages

Examples of Backlinks
•Page A is a inlinksof page B and
page C, while page B and page C
are inlinksof page D.

Page Ranking

Damping Factor d

Computing PageRank

PageRank Algorithm

PageRank Numerical

HITS Algorithm (Hyperlink
Induced Topic Search)

Authorities and Hubs

HITS Algorithm

Difference from PageRank

HITS Algorithm

HITS Numerical

Authority and Hubness

Numerical Example

Comparison
Algorithm PageRank HITS
Mining
Technique Used
Web structure mining Web structure and web
content mining
Working •Computesscores at
indexing time
•Results are sorted
according to importance
of pages
•Computes hub and
authority scores of n
highly relevant pages
on the fly
Applied on Entire Web Local neighborhood of
pagessurrounding results
of a query
Input parametersBack links Back links, Forward links
and content
Complexity O(logN) O(logN)
Limitations Query Independent Efficiency problem
Search EngineGoogle CLEVER

CLEVER Algorithm

Web Usage Mining

Web Usage Mining
•Web usage Mining is processofextracting patterns and
informationfromserver logstogaininsights on user activity
like:
•where the users arefrom
•how many users clicked what item on the site and
•types of activities being done on the site.
•Web server logs are considered as a raw data in return
meaningful data are extracted and patterns are identified.
•For instance, for any e-commerce business, when they want to
increase the scope of business, user's web activity through the
application logs are monitoredand data mining is applied to it.

•Some ofthetechniquestodiscover andanalyzetheweb usage
pattern are :
Session and visitor analysis
•Theanalysis ofpre-processeddatacanbeperformed in
session analysis, which includesthe record of visitors, days
and sessions etc.
•Information can be used to analyze behavior of visitors
•Report generated after analysis, contains details of frequently
visited web pages, common entry and exits
Web Usage Mining

OLAP (Online Analytical Processing):
•OLAP performsMultidimensionalanalysis of complex data
•OLAP canbeperformed on different partsoflog relateddata
ininterval of time.
•TheOLAP tool canbe used to derive important business
intelligence metrics
Web Usage Mining

Web Usage Mining Process

1.Preprocessing:
•Preprocessing consists of converting the usage, content, and
structure information contained in the various available data
sources into the data abstractions necessary for pattern
discovery
•Usage Preprocessing :
•Usage preprocessing is most difficult task in the Web
Usage Mining process due to the incompleteness of the
available data.
•Unless a client-side tracking mechanism is used, only the
IP address, agent, and Server-side click stream are
available to identify users and server sessions.
Web Usage Mining Process

•Content Preprocessing:
•Consist of converting thetext, image,scripts,and
othermultimedia into forms that are useful for web usage
mining
•It consists of performingcontent miningsuch a
classification or clustering.
•Structure Preprocessing:
•Structure of website is created by hypertext links between
page views
•Structure can be preprocessed in same manner as content
Web Usage Mining Process

2.Pattern Discovery
•Pattern discovery uses methods and algorithms developed in
several domains like statistics, data mining, machine learning
and pattern recognition
•Statistical Analysis: extract knowledge about visitors by
performing descriptive statistical analysis frequency of page
views, viewing time and length of navigational path
•Association Rules: set of pages that are accessed together with
min support count
•Clustering: Two kinds of interesting clusters to mine: usage
cluster and page cluster
Web Usage Mining Process

•Classification: Classify user profiles into different
class/category based on their browsing activity
•Sequential Patterns: Web marketers can predict future visit
patterns which help in placing advertisement aimed for certain
group of users
•Dependency modeling: develop model capable of
representing significant dependencies among various variables
in web domain
Web Usage Mining Process

3.Pattern Analysis:
•Filter out uninteresting rules and patterns from set found in
pattern discovery phase
•Load usage data into data cube to perform OLAP operations
•Visualization techniques like graphs or assign colors to
different values can highlight overall pattern
Web Usage Mining Process

Thank You!!

Web mining .pdf module 6 dwm third year ce

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Web mining .pdf module 6 dwm third year ce

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx