Introduction to Web Crawling and
Regular Expression
CSC4170 Web Intelligence and Social Computing
Tutorial 1
Tutor: Tom Chao Zhou
Email: [email protected]
Outline
Course & Tutors Information
Introduction to Web Crawling
Utilities of a crawler
Features of a crawler
Architecture of a crawler
Introduction to Regular Expression
Appendix
Utilities of a crawler
Web crawler, spider.
Definition:
A Web crawler is a computer program that browses the World
Wide Web in a methodical, automated manner. (Wikipedia)
Utilities:
Gather pages from the Web.
Support a search engine, perform data mining and so on.
Object:
Text, video, image and so on.
Link structure.
Features of a crawler
Must provide:
Robustness: spider traps
Infinitely deep directory structures: http://foo.com/bar/foo/bar/foo/...
Pages filled a large number of characters.
Politeness: which pages can be crawled, and which cannot
robots exclusion protocol: robots.txt
http://blog.sohu.com/robots.txt
User-agent: *
Disallow: /manage/
Features of a crawler (Cont’d)
Should provide:
Distributed
Scalable
Performance and efficiency
Quality
Freshness
Extensible
Architecture of a crawler
www
DNS
Fetch
Parse
Content
Seen?
URL
Filter
Dup
URL
Elim
URL Frontier
Doc
Fingerprint
Robots
templates
URL
set
www
DNS
Fetch
Parse
Content
Seen?
URL
Filter
Dup
URL
Elim
URL Frontier
Doc
Fingerprint
Robots
templates
URL
set
URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a
seed set is stored in URL Frontier, and a crawler begins by taking a URL from
the seed set.
DNS: domain name service resolution. Look up IP address for domain names.
Fetch: generally use the http protocol to fetch the URL.
Parse: the page is parsed. Texts (images, videos, and etc.) and Links are
extracted.
Architecture of a crawler (Cont’d)
www
DNS
Fetch
Parse
Content
Seen?
URL
Filter
Dup
URL
Elim
URL Frontier
Doc
Fingerprint
Robots
templates
URL
set
Content Seen?: test whether a web page with the same content has already been seen a
t another URL. Need to develop a way to measure the fingerprint of a web page.
URL Filter:
Whether the extracted URL should be excluded from the frontier (robots.txt).
URL should be normalized (relative encoding).
en.wikipedia.org/wiki/Main_Page
<a href="/wiki/Wikipedia:General_disclaimer" title="Wikipedia:General disclaimer">Disclai
mers</a>
Dup URL Elim: the URL is checked for duplicate elimination.
Architecture of a crawler (Cont’d)
Architecture of a crawler (Cont’d)
Other issues:
Housekeeping tasks:
Log crawl progress statistics: URLs crawled, frontier size, etc. (Ever
y few seconds)
Checkpointing: a snapshot of the crawler’s state (the URL frontier) is
committed to disk. (Every few hours)
Priority of URLs in URL frontier:
Change rate.
Quality.
Politeness:
Avoid repeated fetch requests to a host within a short time span.
Otherwise: blocked
Regular Expression
Usage:
Regular expressions provide a concise and flexible means for id
entifying strings of text of interest, such as particular characters,
words or patterns of characters.
Today’s target:
Introduce the basic principle.
A tool to verify the regular expression: Regex Tester
http://www.dotnet2themax.com/blogs/fbalena/PermaLink,guid,13
bce26d-7755-441e-92b3-1eb5f9e859f9.aspx
Regular Expression
Metacharacter
Similar to the wildcard in Windows, e.g.: *.doc
Target: Detect the email address
Regular Expression
\b: stands for the beginning or end of a Word.
E.g.: \bhi\b find hi accurately
\w: matches letters, or numbers, or underscore.
.: matches everything except the newline
*: content before * can be repeated any number of times
\bhi\b.*\bLucy\b
+: content before + can be repeated one or more times
[]: match characters in it
E.g: \b[aeiou]+[a-zA-Z]*\b
{n}: repeat n times
{n,}: repeat n or more times
{n,m}: repeat n to m times
Regular Expression
Target: Detect the email address
Specifications:
A@B
A: combinations English characters a to z, or digits, or . or _ or
% or + or –
B: cse.cuhk.edu.hk or cuhk.edu.hk (English characters)
Answer:
\b[a-z0-9._%+-]+@[a-z.]+\.[a-z]{2}\b