Common Crawl: An Open Repository of Web Data

huguk 4,794 views 19 slides Oct 17, 2012
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

Talk given by Lisa Green from the Common Crawl Foundation at the Hadoop User Group UK meetup on 10 October in London


Slide Content

What Does The Data World Mean to Society? Lisa Green 1 October 2012 London HUG Lisa Green 10 October 2012 Common Crawl : An Open Repository of Web Data

Photo license: Public Domain Origin: http:// en.wikipedia.org / wiki /File:Floppy_disk_2009_G1.jpg

Photo license: CC-BY-SA Origin: http:// en.wikipedia.org / wiki /File:Wikimedia_Foundation_Servers-8055_08.jpg

Image license: CC-BY Origin: http:// en.wikipedia.org / wiki /File:Internet_map_1024.jpg

Still Nascent Still Nascent Even cheaper storage Even cheaper compute Education Open Data Still Nascent Even cheaper storage Even cheaper compute Education Still Nascent Even cheaper storage Even cheaper compute Still Nascent Even cheaper storage Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team ( STScI /AURA)

Proprietary Commercial Gratis Libre

Gil Elbaz

Common Crawl Data ~8 Billion web pages ~120 TB 2008-2012 ARC files, JSON metadata, text files Available to anyone

ARC Files - Raw Content Metadata S tatus information HTTP response code File names & offsets of ARC files HTML title HTML meta tags RSS /Atom information All anchors/hyperlinks Text Files - Text Only http:// commoncrawl.org /get- started

http:// webdatacommons.org Change between 2010 and 2012 URLs with embedded data +6% Microdata +14% RDFa +26%

22% of Web pages contain Facebook URLs 8 % of Web pages implement Open Graph tags

A corpus of anchortext - WikipediaConcept -Count from the CommonCrawl dataset, to benefit research on WSD, NLP and IR. Explicit Topic Modeling: Given a concept (represented as a wikipedia page), it can tell what are the most common terms people use to describe the concept. Given a sentence, it can help identify entities (person, location , organization) in the sentence and map them onto Wikipedia concepts. http:// wikientities.appspot.com

Mapping French websites related to Open Data

Other Use Examples Apache Giraph Testing Maplight Tineye Factual Sentiment Analysis Projects

In Development N-gram and Link G raph E xtracts Pig R eader More F requent F ull Crawls Focused S ubset C rawls at High Frequency Open Educational Resources

What Does The Data World Mean to Society? Lisa Green 1 October 2012 Lisa Green [email protected] www.commoncrawl.org @ commoncrawl @ boudicca Thank You London HUG