Web of Data Usage Mining

MLuczakRoesch 374 views 51 slides Apr 21, 2016
Slide 1
Slide 1 of 51
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51

About This Presentation

Overview of how data on the Web of Data can be consumed (first and foremost Linked Data) and implications for the development of usage mining approaches.

References:
Elbedweihy, K., Mazumdar, S., Cano, A. E., Wrigley, S. N., & Ciravegna, F. (2011). Identifying Information Needs by Modelling Col...


Slide Content

Web of Data Usage Mining
Markus Luczak-Roesch
@mluczak | http://markus-luczak.de

What you should learn:
• describe the architectural differences between content
negotiation and Linked Data queries;
• develop applications that use different strategies to
consume Linked Data;
• develop usage mining methods that exploit the atomic parts
of the SPARQL query language.

Linked Data principles
1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the
access to resources on the
Web.
3. On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4. Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
http://dbpedia.org/resource/Berlin
"
"
"



"
http://dbpedia.org/page/Berlin"
http://dbpedia.org/data/Berlin

"
yago-res:BerlinS
owl:sameAsP
dbpedia:Berlin O
h"p://www.w3.org/DesignIssues/LinkedData.html
Content Negotiation

Linked Data principles
1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the
access to resources on the
Web.
3. On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4. Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin
"
"
"



"
http://dbpedia.org/page/Berlin"
http://dbpedia.org/data/Berlin

"
yago-res:BerlinS
owl:sameAsP
dbpedia:Berlin O
Content Negotiation

Linked Data principles
1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the
access to resources on the
Web.
3. On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4. Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin
"
"
"



"
http://dbpedia.org/page/Berlin"
http://dbpedia.org/data/Berlin

"
yago-res:BerlinS
owl:sameAsP
dbpedia:Berlin O
Content Negotiation

Linked Data principles
1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the
access to resources on the
Web.
3. On resource access,
deliver meaningful
information conforming to
Web standards (RDF,
SPARQL).
4. Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin
"
"
"



"
http://dbpedia.org/page/Berlin"
http://dbpedia.org/data/Berlin

"
yago-res:BerlinS
owl:sameAsP
dbpedia:Berlin O
Content Negotiation

Linked Data principles
1. Use URIs as names for
“Things” (resources).
2. Use HTTP URIs to allow the
access to resources on the
Web.
3. On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4. Set RDF links to resources
published by other parties
to allow the discovery of
more resources.
h"p://www.w3.org/DesignIssues/LinkedData.html
http://dbpedia.org/resource/Berlin
"
"
"



"
http://dbpedia.org/page/Berlin"
http://dbpedia.org/data/Berlin

"
yago-res:BerlinS
owl:sameAsP
dbpedia:Berlin O
Content Negotiation

Linked Data exploits RDF
h"p://markus-luczak.de#me
“Markus Luczak-Roesch“
foaf:name
h"p://markus-luczak.de#me
h"p://hannes.muehleisen.org#me
foaf:knows

Linked Data vocabularies
• Vocabulary reuse:
– Geo
– FOAF
– GoodRelations
– SIOC
– DOAP
– …
• Vocabulary development:
– Thing
• Person
– OfficeHolder
– …
• …
http://dbpedia.org/ontology/Person
http://dbpedia.org/ontology/OfficeHolderhttp://xmlns.com/foaf/0.1/knows

Linked Data vocabularies
• Mixing:
– Geo
– FOAF
– Dublin Core
– DBpedia Ontology
– ...
http://xmlns.com/foaf/0.1/Person
http://www.w3.org/2003/01/geo/wgs84_pos#lat
http://dbpedia.org/ontology/leader
http://dbpedia.org/ontology/City

Linked Data is self-descriptive
Instance level Schema level
int:resA
ont:ClassA
owl:sameAs
„ABC“
foaf:name
ext:resA
int:resB
rdf:type
owl:equivalentClass
rdf:type
foaf:name
rdf:type
rdf:type
rdf:type
rdfs:subClassOf
foaf:Agent
rdf:type
foaf:Person
rdfs:subClassOf
owl:sameAs
owl:equivalentClass

h"p://markus-luczak.de#me
“Markus Luczak-Roesch“
rdf:type
u_id firstname surname
45 Markus Luczak-Roesch
… … …
foaf:name
foaf:Person

“3.375.222“
dbpedia:Berlin
c_id city country inhabitants
67 Berlin Germany 3.375.222
… … …
dbp:populaVon

h"p://markus-luczak.de#me
“Markus Luczak-Roesch“
rdf:type
foaf:name
dbp:birthPlace
foaf:Person
“3.375.222“
dbp:populaVon
dbpedia:Berlin

h"p://markus-luczak.de#me
foaf:basedNear
dbp:birthPlace
h"p://markus-luczak.de/res/Soton
dbpedia:CiVes_in_Europe
skos:subject
dbpedia:Berlin
skos:subject
dbpedia:Southampton

h"p://markus-luczak.de#me
foaf:basedNear
dbp:birthPlace
h"p://markus-luczak.de/res/Soton
dbpedia:CiVes_in_Europe
skos:subject
dbpedia:Berlin
skos:subject
dbpedia:Southampton
rdfs:seeAlso

h"p://markus-luczak.de#me
foaf:basedNear
h"p://markus-luczak.de/res/Soton
rdfs:seeAlso
rdf:type
foaf:Person
owl:equivalentClass
dbp:Person
rdf:type
dbpedia:Southampton
dbp:birthPlace
dbpedia:Benny_Hill

Linked Data Infrastructure
Image source: Tom Heath and ChrisVan Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st ediVon). Synthesis Lectures on the SemanVc Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.

Consuming Linked Data
• stateless
• request-response
t
Client Server
request
response
TCP life cycle
derived from R. Tolksdorf
Open connection
Close connection

Consuming Linked Data
GET / HTTP/1.1
User-Agent: Mozilla/5.0 … Firefox/10.0.3
Host: markus-luczak.de:80
Accept: */*
HTTP/1.1 200 OK
Server: Apache/2.0.49
Content-Language: en
Content-Type: text/html
Content-length: 2990

<!DOCTYPE html>
<html xml:lang="en"

Client
Server
derived from R. Tolksdorf

Server
Consuming Linked Data
Representation 1
index.html
Representation 2
index.rdf
Information
Resource
http://example.com/content/index
Client
HTTP GET

Consuming Linked Data
• Discover URIs
– Lookup services
• http://rkbexplorer.com
– Web of Data search engines
• http://sindice.com
• http://ws.nju.edu.cn/falcons/objectsearch/index.jsp

Consuming Linked Data
• Discover additional data for the resource at hand
• follow links („follow your nose“)
– rdfs:seeAlso
– owl:sameAs
• Co-Reference services
– http://sameas.org
• Web of Data search engines

Linked Data
Source: http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/
The server can trace this usage.

Linked Data is queryable
?s
“Markus Luczak-Roesch“
foaf:name
h"p://markus-luczak.de#me
?o
foaf:knows

SPARQL-recap
• Basic principle: pattern matching
– describe pattern
– query RDF triple set („RDF graph“)
– matching subset comes into results
?s
http://dbpedia.org/resource/Berlin

SPARQL-recap
?s
dbp:Klaus_Wowereit

dbp:Reinhard_Mey

dbp:Klaus_Wowereit
dbp:Berlin
dbp:birthPlace
dbp:Reinhard_Mey
Berlino
dbp:Axel_Springer

SPARQL queries on the Web
• RESTful service endpoint
GET /sparql?query=PREFIX+rdf… HTTP/1.1
Host: dbpedia.org
h"p://www.w3.org/TR/rdf-sparql-XMLres/ h"p://www.w3.org/TR/rdf-sparql-json-res/

Querying Linked Data
dbp:Klaus_Wowereit
dbp:Berlin
dbp:birthPlace
dbp:Reinhard_Mey
http://www.markus-luczak.de/me
dbp:birthPlace

Querying Linked Data
• distribution of data creates challenges for querying them
• Query approaches
– follow-up queries ß application-dependent, proprietary
– query a central data repository (e.g. LOD cache) ß trivial
– federated queries ß more interesting
• idea: query a mediator that distributes the sub-queries and returns
aggregated result (as of SPARQL 1.1)
– link traversal ß very interesting
• idea: follow links in the results retrieved from a source to expand the data
dynamically

Dataset
User Client/ApplicaVon
Query Pa"ern Access







Resource Centered
Access
HTTP
Query Processing







Graph CreaVon and
Content NegoVaVon
GET /resource/resA
GET /sparql?query=SELECT…
applicaVon/rdf+xml, …
Evaluate and perform query, create result set
Process and select result
text/xml, …
Data Publisher
Data Consumer
Data Publisher
Data Consumer

h"p://www.flickr.com/photos/therichbrooks/4040197666/, CC-BY 2.0, h"ps://creaVvecommons.org/licenses/by/2.0/
A game of pairs with SPARQL

SPARQL queries are self-descriptive data
themselves
{
?s1 foaf:name “Markus Luczak-Roesch”.
?s1 rdf:type dbp:Person
}
TP
TP
BGP

SPARQL queries are self-descriptive data
themselves
{
?s1 foaf:name “Markus Luczak-Roesch”.
?s1 rdf:type dbp:Person
}
h"p://markus-luczak.de#me
“Markus Luczak-Roesch“
rdf:type
foaf:name
foaf:Person


SPARQL queries are self-descriptive data
themselves
{
dbpedia:Benny_Hill dbp:birthPlace ?o1 .
?s dbp:basedNear ?o1 .
?s foaf:name ?o2
}



SPARQL queries are self-descriptive data
themselves
{
dbpedia:Benny_Hill dbp:birthPlace ?o1
}

SPARQL queries are self-descriptive data
themselves
{
?s dbp:basedNear ?o1
}

SPARQL queries are self-descriptive data
themselves
{
?s foaf:name ?o2
}

all TP
all TP in successful BGP
all TP in successful queries
all TP in failing queries
all TP in failing BGP

Statistical analysis
missing facts
inconsistent
data
• ns:Band ns:knownFor ?x
• ns:Band ns:naVonality ?y
• ns:Band ns:instrument ?x
• ns:Band ns:genre ?y
• ns:Band ns:associatedBand ?z

Statistical analysis
(a)SWC (b)DBpedia (c)LGD
Abbildung 20:Nutzung der Konzepte der Multi-Ontologien (Kanten sind ausgeblendet),
Quelle: eigene Darstellung
dieser Datensets besitzt noch ein großes Verbesserungspotential. Beispielsweise sind die
M¨oglichkeiten gegeben, eine h¨ohere Anzahl an speziellen Konzepten zu nutzen. Eben-
so k¨onnen theoretisch mehr Konzepte aus anderen Bereichen als Personen, Orte und
Dokumente verwendet werden.
Experiment 3In diesem Experiment wurde evaluiert, welche Ergebnisse mit dem ent-
wickelten ubKCE Algorithmus in Abh¨angigkeit der Gewichtung der Aspekte Dichte und
Nutzung erhalten werden. Damit soll es m¨oglich sein, die Frage zu beantworten, ob sich
Schl¨usselkonzepte¨uberhaupt auf Basis von Nutzungsdaten ermitteln lassen. Dies kann
eindeutig mit Ja beantwortet werden. Schon anhand der Ergebnisse von Experiment 2
ließ sich erkennen, dass viele der KCE - Schl¨usselkonzepte ebenfalls am st¨arksten von
Nutzern des Datensets verwendet werden. Je nach gew¨ahlter Gewichtung der Aspekte
variiert jedoch die
¨
Ubereinstimmung der Ergebnisse von ubKCE mit den von KCE er-
mittelten Schl¨usselkonzepten. So werden beispielsweise bei einer Gewichtung von 100%
Nutzung zu 0% Dichte im Allgemeinen andere Schl¨usselkonzepte als bei einem 30:70
Verh¨altnis ermittelt. Bez¨uglich der
¨
Ubereinstimmung zu KCE kann gesagt werden, dass
diese im Allgemeinen steigt, je h¨oher die Gewichtung der Dichte in ubKCE ist. Dies ist
nicht verwunderlich, da der Dichte-Aspekt aus KCE¨ubernommen wurde.
[27] gab zur Evaluation des KCE - Algorithmus an, dass es ab einer 70%
¨
Ubereinstimmung
der KCE - Ergebnisse zu den von Experten ermittelten Schl¨usselkonzepten nicht mehr
m¨oglich ist zu erkennen, ob ein bestimmtes Ergebnis von einem Menschen oder vom
Algorithmus bestimmt wurde. Um eine Gewichtung zu finden, die m¨oglichst gut mit
dem KCE - Ergebnis¨ubereinstimmt, jedoch gleichzeitig so stark wie m¨oglich auf dem
Nutzungsaspekt beruht, werden auch im Rahmen der vorliegenden Arbeit diese 70%
¨
Ubereinstimmung angestrebt. Nach Aussage von [27]bedeutetdiese
¨
Ubereinstimmung,
dass man nicht unterscheiden k¨onnte, ob ein Ergebnis von einem Experten, von KCE
oder von ubKCE stammt.
Durch die unterschiedlichen Gewichtungen der Aspekte kann die Zusammenfassung der
91
Source: Masterthesis of Markus Bischoff

Estimating the effects of change
Usage-dependent maintenance of structured Web data sets
to be added to the DBpedia 3.4 data set conforming to our approach
16
.
Table 7.14: Recommended predicates to be added to the data set and the estimated
e↵ects of change.
Primitive to addE↵ects of changeExists in data set
dbp:manufacturer 0.004505372 x
dbp:firstFlight 0.004505372 x
dbp:introduced 0.004505372 x
dbp:nationalOrigin0.004505372
dbo:thumbnail 0.021986718 x
dbo:director 0.025047524
dbp:director 0.02503915 x
dbp:abstract 0.025797024 x
dbo:starring 0.034066643
dbp:starring 0.034066643 x
dbp:stars 0.034066643 x
skos:Concept 0.040946128 x
skos:broader 0.04116386 x
dbp:redirect 0.066441677 x
Since this change recommendation is only additive it is clear that no negative
e↵ects are estimated. However, it is possible to estimate the positive potential of a
change and consequently to prioritize the changes to be performed in case of conflict-
ing or contradicting recommendations.
More complex and also subtractive change recommendations may emerge from
additive ones. This is typified by the recommendation to add dbo:director and
dbp:director for example to the data set which appear to be contradicting. Hence,
they should be either matched to each other by an owl:equivalentProperty relation
or one of the two should be eliminated.
7.3.3 Further data set analysis
Our case study has shown how the usage-dependent data set maintenance approach
performs in the context of a cross-domain data set like DBpedia. We will now present
results from our studies with SWDF and LGD as two di↵erent domain-specific data
sets.
16
To save space we apply the following namespace prefixes in addition to the ones defined b efore:
dbo:http://dbpedia.org/ontology/,dbp:http://dbpedia.org/property/.
178

Log files
Selected log files
Preprocessed
queries
Decomposed
queries
and
transac<on
tables
Pa=erns
Change
recommenda<ons
[0,1]

What’s in your SPARQL shopping bag?
{
?s1 foaf:name “Markus Luczak-Roesch”.
?s1 rdf:type dbp:Person
}
{
dbpedia:Benny_Hill dbp:birthPlace ?o1 .
?s dbp:basedNear ?o1 .
?s foaf:name ?o2
}
{
?s1 foaf:name “Markus Luczak-Roesch”.
?s1 rdf:type dbp:Person
}
{
dbpedia:Benny_Hill dbp:birthPlace ?o1 .
?s dbp:basedNear ?o1 .
?s foaf:name ?o2
}
{
?s1 foaf:name “Markus Luczak-Roesch”.
?s1 rdf:type dbp:Person
}
{
dbpedia:Benny_Hill dbp:birthPlace ?o1 .
?s dbp:basedNear ?o1 .
?s foaf:name ?o2
}
T1
T2
T1


30 mins., same IP, same user agent


Usage-dependent maintenance of structured Web data sets
Figure 7.20: Visualization of association rules computed by application of the unknown predicates restriction in the
context of the LGD log file (size: support, color: lift).
184
LGD

Linked Data
Source: http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/
The server can trace this usage.

SPARQL
7. Evaluation
The visualization shows how primitives on the left hand side (LHS) of a rule imply
particular ones on the right hand side (RHS) and which likelihood such an associa-
tion has. In our specific case this allows us to analyze which primitives are queried
together frequently in failing queries. We spot two characteristic usage patterns: (1)
the properties and classes queried in the context ofhttp://dbpedia.org/ontology/
Aircraft; (2) the properties and classes queried in the context of an object variable.
These can be further analyzed by exporting the association rules to GraphML and vi-
sualizing the network by use of a network visualization and analysis tool like Gephi
15
for example. Figure 7.13 depicts one filtered network representation for our example
case. Nodes with a degree lower than 5 are filtered out (k-core network withk= 5)
to derive a well-arranged visualization of the most important primitives in failing
queries. Nodes represent LHS and RHS of the computed rules. Edges point from the
LHS to the RHS of the particular rules.
^"SUHGBYDULDEOHKWWSGESHGLDRUJSURSHUW\QDPH`
^"SUHGBYDULDEOHKWWS[POQVFRPIRDIGHSLFWLRQ`
^KWWSGESHGLDRUJRQWRORJ\$LUFUDIW`
^KWWSGESHGLDRUJSURSHUW\DEVWUDFWKWWSGESHGLDRUJSURSHUW\QDPH`
^KWWSGESHGLDRUJSURSHUW\DEVWUDFWKWWS[POQVFRPIRDIGHSLFWLRQ`
^KWWSGESHGLDRUJSURSHUW\ILUVW)OLJKW`
^KWWSGESHGLDRUJSURSHUW\LQWURGXFHG`
^KWWSGESHGLDRUJSURSHUW\PDQXIDFWXUHU`
^KWWSGESHGLDRUJSURSHUW\QDPHKWWSGESHGLDRUJSURSHUW\W\SH`
^KWWSGESHGLDRUJSURSHUW\QDPHKWWS[POQVFRPIRDIGHSLFWLRQ`
^KWWSGESHGLDRUJSURSHUW\QDWLRQDO2ULJLQ`
^KWWSGESHGLDRUJSURSHUW\W\SHKWWS[POQVFRPIRDIGHSLFWLRQ`
Figure 7.13: Filtered visualization of the association rule network (k-core 5 filter
applied to reduce nodes with degree lower than 5).
Table 7.14 lists the an exemplary set of primitives which would be recommended
15
http://gephi.org/
177
{
?s1 foaf:name “Markus Luczak-Roesch”.
?s1 rdf:type dbp:Person
}
h"p://markus-luczak.de#me
“Markus Luczak-Roesch“
rdf:type
foaf:name
foaf:Person



query applied to dataset
The server can trace detailed usage.

Linked Data Fragments
Querying Datasets on the Web with High Availability 5
generic requests
high client effort
high server availability
specific requests
high server effort
low server availability
data
dump
Linked Data
document
sparql
result
triplepattern
fragments
various types of
Linked Data Fragments
Fig. 1:Allhttptriple interfaces offer Linked Data Fragments of a dataset. They differ
in the specificity of the data they contain, and thus the effort needed to create them.
3.2 Formal definitions
As a basis for our formalization, we use the following concepts of therdfdata
model [16] and thesparqlquery language [12]. We writeU,B,L,andVto
denote the sets of alluris, blank nodes, literals, and variables, respectively.
Then,T=(U[B)⇥U⇥(U[B[L)is the (infinite) set of allrdftriples.Any
tupletp2(U[V)⇥(U[V)⇥(U[L[V)is atriple pattern.Anyfinitesetof
such triple patterns is abasic graph pattern(bgp). Any more complexsparql
graph pattern,typicallydenotedbyP,combinestriplepatterns(orbgps) using
specific operators [12, 20]. The standard (set-based) query semantics forsparql
defines thequery resultof such a graph patternPover a set ofrdftriples
G✓Tas a set that we denote by[[P]]Gand that consists of partial mappings
µ:V!(U[B[L), which are calledsolution mappings.Anrdftripletis
amatching triplefor a triple patterntpif there exists a solution mappingµ
such thatt=µ[tp], whereµ[tp]denotes the triple (pattern) that we obtain by
replacing the variables intpaccording toµ.
For the sake of a more straightforward formalization, in this paper, we as-
sume without loss of generality that every datasetGpublished via some kind of
fragments on the Web is a finite set of blank-node-freerdftriples; i.e.,G✓T

whereT

=U⇥U⇥(U[L).Eachfragmentofsuchadatasetcontainstriples
that somehow belong together; they have been selected based on some condition,
which we abstract through the notion of a selector:
Definition 1 (selector).Aselectoris a partial functions:2
T
!{true,false}.
A more concrete type of this abstract notion are triple pattern selectors, which
select triples that match a certain triple pattern:
Definition 2 (triple pattern selector).Given a triple patterntp, thetriple
pattern selectorfortpis the selectorstpthat, for any singleton set{t}✓2
T
,is
defined by
stp({t})=
(
trueiftis a matching triple fortp,
falseelse.
When publishing data on the Web, we should equip its representations with
hypermedia controls [1, 8, 9]. We encounter them on a daily basis when browsing
htmlpages; they are usually present as hyperlinks or forms. What all these
controls have in common is that, given some (possibly empty) input, they result
in our browser performing a request for a specificurl.
Definition 3 (control).Acontrolis a function that maps from some set toU.
Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014).
The final publication is available at link.springer.com.
xxx.xxx.xxx.xxx - - [17/Oct/2014:07:43:02 +0000] 

"GET /2014/en?subject=&predicate=&object=dbpedia%3AAustin HTTP/1.1" 200
1309 "http://fragments.dbpedia.org /2014/en" …
10 Ruben Verborgh et al.
Data:(predecessorIp,bgpB={tp1,...,tpn}withn!2,startpage!0)
1I nil;c the triple pattern control in the control setC0of!0;
2FunctionBasicGraphPatternIterator.next()
3 µ nil;
4 whileµ=nildo
5 whileI=nildo
6 µp Ip.next();
7 returnnilifµp=nil;
8 " {!
i
1|!
i
1=httpGETfirst fragment page usingurlc(µp[tpi])};
9 ✏ isuch thatcnt!
i
1
=min({cnt!
1
1
,...,cnt!
n
1
});
10 I✏ TriplePatternIterator(StartIterator(),µp[tp✏],!

1);
11 I BasicGraphPatternIterator (I✏,{µ[tp]|tp2B\{tp✏}},!

1);
12 µ I.next();
13 returnµ[µp;
Algorithm 1:For all mappingsµpof a predecessorIp,abgpiterator for
apatternB={tp1,...,tpn}creates a triple pattern iteratorI✏for the least
frequent patterntp✏,passedtoabgpiterator for the remainder ofP.
fetches the first page of the correspondingldf. This page contains thecntmeta-
data, which tells us how many matches the dataset has for each triple pattern.
The pattern is then decomposed by evaluating it usinga)atriplepatterniter-
ator for the triple pattern with the smallest number of matches, andb)anew
bgpiterator for the remainder of the pattern. This results in adynamicpipeline
foreachof the mappings of its predecessor, as visualized in Fig. 2. Each pipeline
is optimizedlocallyfor a specific mapping, reducing the number of requests.
To evaluate asparqlquery over a triple pattern fragment collection, we pro-
ceed as follows. For eachbgpof the query, abgpiterator is created. Dedicated
iterators are necessary for othersparqlconstructs such asUNIONandOPTIONAL,
but their implementation need not beldf-specific; they can reuse the triple
pattern fragmentbgpiterators. The predecessor of the first iterator is a start
iterator. We continuously pull solution mappings from the last iterator in the
pipeline and output them as solutions of the query, until the last iterator re-
sponds withnil. This pull-based process is able to deliver results incrementally.
...
B
00
={ Drago_Ibler a Architect. }
Alen_Peternac
Drago_Ibler
Juraj_Neidhardt
...
?personbirthPlaceZagreb.
B
0
={?person a Architect. ?person birthPlace Zagreb. }
Zagreb
Budapest
Rome
...
?citysubject
Capitals_in_Europe.
B={?person a Architect. ?person birthPlace ?city. ?city subject Capitals _in_Europe. }
Fig. 2:Abgpiterator decomposes abgpB={tp1,...,tpn}into a triple pattern
iterator for an optimaltpiand, for each resulting solution mappingµoftpi,creates
abgpiterator for the remaining patternB
0
={tp|tp=µ[tpj]^tpj2B}\{µ[tpi]}.
Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014).
The final publication is available at link.springer.com.
Querying Datasets on the Web with High Availability 9
4.2 Dynamic iterator pipelines
Acommonapproachtoimplementqueryexecutionindatabasesystemsisthrough
iteratorsthat are typically arranged in a tree or apipeline, based on which query
results are computed recursively [10]. Such a pipelined approach has also been
studied for Linked Data query processing [13, 15]. In order to enableincremental
results and allow the straightforward addition ofsparqloperators, we imple-
ment a triple pattern fragments client using iterators.
The previous algorithm, however, cannot be implemented by astaticiterator
pipeline. For instance, consider a query for architects born in European capitals:
SELECT ?person ?city WHERE {
?person a dbpedia-owl:Architect. #tp1
?person dbpprop:birthPlace ?city. #tp2
?city dc:subject dbpedia:Capitals _in_Europe. # tp3
} LIMIT 100
Suppose the pipeline begins by finding?citymappings fortp3. It then needs
to choose whether it will next considertp1ortp2. The optimal choice, however,
differs depending on the value of?city:
–Fordbpedia:Paris,thereare±1,900 matches fortp2,and±1,200 matches
fortp1, so there will be lesshttprequests if we continue withtp1.
–Fordbpedia:Vilnius,thereare164matchesfortp2,and±1,200 matches for
tp1, so there will be lesshttprequests if we continue withtp2.
With a static pipeline, we would have to choose the pipeline structure in advance
and subsequently reuse it.
In order to generate an optimized pipeline for each (sub-)query, we propose
a divide-and-conquer strategy in which a query is decomposeddynamicallyinto
subqueries depending on partial solution mappings. The main function of an
iterator isnext(), which either returns a mapping ornilif no mappings are left.
We first introduce a trivialstart iterator, which outputs the empty map-
pingµ0on the first call tonext(),andnilon all subsequent calls.
Next, we implement a previously definedtriple pattern iterator[15] for triple
pattern fragments. This iteratorItpis initialized with a predecessor iteratorIp,
atriplepatterntp,andapage!0of an arbitrary triple pattern fragment of a col-
lectionF. The iterator then extends mappings from its predecessor by reading
triples from theldfcorresponding to triple patterntp. Theurlof thisldfis re-
trieved through the collection control in the start page!0.EachcalltoItp.next()
results in mappings fortpinF,dependingonthepredecessor’smappings.
To solvebgpsofsparqlqueries, we introduce a triple pattern fragment
bgpiterator.Suchabgpiteratoris initialized with a predecessorIp,abgpB=
{tp1,...,tpn},andanarbitrarytriplepatternfragmentpage!0of a collectionF.
For an empty pattern (n=0), abgpiterator is equal to a start iterator. For
apatternlengthn=1,itisconstructedbycreatingatriplepatterniterator
for(Ip,tp1,!0).Forn!2,abgpiterator uses Algorithm 1.
bgpiterators evaluate abgpby recursively decomposing it into smaller itera-
tors. For each triple pattern in thebgpmapped by each result ofIp,theiterator
Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014).
The final publication is available at link.springer.com.

Wikidata
• API access to
• items
• edit history
• items’ discussions
• items’ access statistics
• and more
• Linked Data interface
• MediaWiki API
• Wikidata Query
• SPARQL
• Linked Data Fragments
Access to more than
“just” usage.

Thank you very much!
@mluczak | http://markus-luczak.de
h"p://www.flickr.com/photos/therichbrooks/4040197666/, CC-BY 2.0, h"ps://creaVvecommons.org/licenses/by/2.0/

References
• Luczak-Rösch, M., & Bischoff, M. (2011). Statistical analysis of web of data usage. In Joint Workshop on Knowledge Evolution and
Ontology Dynamics (EvoDyn2011), CEUR WS.
• Luczak-Rösch, M. (2014). Usage-dependent maintenance of structured Web data sets (Doctoral dissertation, Freie Universität Berlin,
Germany), http://edocs.fu-berlin.de/diss/receive/FUDISS_thesis_000000096138.
• Elbedweihy, K., Mazumdar, S., Cano, A. E., Wrigley, S. N., & Ciravegna, F. (2011). Identifying Information Needs by Modelling Collective
Query Patterns. COLD, 782.
• Elbedweihy, K., Wrigley, S. N., & Ciravegna, F. (2012). Improving Semantic Search Using Query Log Analysis. Interacting with Linked Data
(ILD 2012), 61.
• Raghuveer, A. (2012). Characterizing machine agent behavior through SPARQL query mining. In Proceedings of the International
Workshop on Usage Analysis and the Web of Data, Lyon, France.
• Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study of real-world SPARQL queries. arXiv
preprint arXiv:1103.5043.
• Hartig, O., Bizer, C., & Freytag, J. C. (2009). Executing SPARQL queries over the web of linked data (pp. 293-309). Springer Berlin
Heidelberg.
• Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014). Querying
datasets on the web with high availability. In The Semantic Web–ISWC 2014 (pp. 180-196). Springer International Publishing.
• Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., & Van de Walle, R. (2014, April). Web-Scale Querying through
Linked Data Fragments. In LDOW.