Term Paper on merging web tables for relation extraction

SumirVats 5 views 20 slides Jul 22, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Term paper


Slide Content

MERGING WEB TABLES FOR RELATION
EXTRACTION WITH KNOWLEDGE GRAPHS
Presented By : Sumir Vats
COE4380 | 21COB133
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 2, FEBRUARY 2023
JhomaraLuzuriaga, Emir Munoz, Henry Rosales-Mendez, and AidanHogan

We propose methods for extracting triples from Wikipedia’s HTML tables using a reference knowledge
graph. Our methods use a distant-supervision approach to find existing triples in the knowledge graph
for pairs of entities on the same row of a table, postulating the corresponding relation for pairs of
entities from other rows in the corresponding columns, thus extracting novel candidate triples. Binary
classifiers are applied on these candidates to detect correct triples and thus increase the precision of
the output triples. We extend this approach with a preliminary step where we first group and merge
similar tables, thereafter applying extraction on the larger merged tables. More specifically, we
propose an observed schema for individual tables, which is used to group and merge tables. We
compare the precision and number of triples extracted with and without table merging, where we show
that with merging, we can extract a larger number of triples at a similar precision. Ultimately, from the
tables of English Wikipedia, we extract 5.9 million novel and unique triples for Wikidata at an estimated
precision of 0.718.
COE4380 | 21COB133
ABSTRACT
1

OVERVIEW
•Introduction
•Setting
•Web Tables
•Results
•Recommendations
•Data Extraction
•Conclusion
2
COE4380 | 21COB133

INTRODUCTION
3
The prevailing format for web content is semi-
structured HTML, primarily designed for human
comprehension rather than machine interpretation,
presenting a significant hurdle for automated
information integration across multiple sources.
COE4380 | 21COB133
Advent of Knowledge Graphs

INTRODUCTION
4
Notably, innovative applications leveraging these
knowledge graphs, like Wikidata, have surfaced, serving
not only to organize and enrich data-supporting
platforms like Wikipedia but also fueling diverse data-
intensive ventures like Apple’s Siri and the
WikiGenomes project.
COE4380 | 21COB133

5
COE4380 | 21COB133
WEB TABLES

DATA EXTRACTION
Information Extraction From
Individual Tables
First ProblemAs introduced earlier, several prior
studies have tackled various
Information Extraction tasks
related to web tables. Initially,
works presenting extraction
techniques designed for individual
tables are explored, followed by
discussions on proposals that
consolidate information from
multiple tables.
Merging Tables
COE4380 | 21COB133
Second Problem
6
Novelty
Third Problem

7
1Information Extraction from Individual Tables
COE4380 | 21COB133
•Table Extraction and Normalisation
•Table Classification
•Entity Linking
•Column Typing
•Attribute Extraction
•Relation Extraction
DATA EXTRACTION

8
1Information Extraction from Individual Tables
COE4380 | 21COB133
•Table Extraction and Normalisation
•Table Classification
•Entity Linking
•Column Typing
•Attribute Extraction
•Relation Extraction
DATA EXTRACTION

9
2Merging Tables
COE4380 | 21COB133
•Joined Tables
DATA EXTRACTION

10
2Merging Tables
COE4380 | 21COB133
•Union Tables
DATA EXTRACTION

11
2Merging Tables
COE4380 | 21COB133
•Hybrid Merges = Joins + Union
DATA EXTRACTION

12
3Novelty
COE4380 | 21COB133
•Aim to extract new triples for Wikidata from Wikipedia tables, enhancing previous methods by
refining triple extraction techniques and considering the table's main entity.
•Adapt existing approaches to better suit the specific characteristics of Wikipedia tables.
•Previous approaches often focus on merging only pairs of columns, potentially combining
unrelated information.
•In contrast, this method merges entire tables, ensuring coherence by leveraging all available
columns.
•Previous studies investigated table stitching methods, primarily targeting predefined schema-
matching tasks, which are less effective for the diverse tables found in Wikipedia.
•The new approach extracts millions of new triples with a precision of 0.718, showcasing its
effectiveness on Wikipedia tables.
DATA EXTRACTION

SETTING
13
Directed, edge-labelled graph G := (V, E, P)
s o
COE4380 | 21COB133
Vertices
Edges
Edge labels(properties)
A triple (s, p, o) ϵ E denotes a labelled edge
p

14
RESULTS
COE4380 | 21COB133
BINARY CLASSIFICATION
I M
From Individual Tables From Merged Tables

15
RESULTS
COE4380 | 21COB133

16
RESULTS
COE4380 | 21COB133

17
CONCLUSION
So far, we discussed relation extraction on small tables.
Generating merged tables of thousands or hundreds of
thousands of rows also suggests that semi-supervised
methods may become practical, where input from an
expert could help to extract large batches of triples with
relatively little cost. It would also be of interest to adapt
and evaluate the proposed methods for HTML tables
taken from the broader Web.
COE4380 | 21COB133

Recommendation 1
Recommendation 2
For large tables, make use of captions attached with the table to get an
idea of the content
For social media applications on the web, make use of post hashtags to
extract information through knowledge graphs (no need of tables)
RECOMMENDATIONS
18
COE4380 | 21COB133
(My Suggestions)

THANK YOU
Presented By : Sumir Vats
COE4380 | 21COB133