Semantic Data Enrichment: a Human-in-the-Loop Perspective

palmonari 67 views 56 slides Jul 29, 2024
Slide 1
Slide 1 of 56
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56

About This Presentation

Presentation to the INRIA Wimmics group, July 2024. The presentation covers the following points (summarized by ChatGPT):

covers the following key points:

1. Introduction to the Topic: Overview of semantic data integration and enrichment.
2. Semantic Enrichment of Tabular Data: Detailed method...


Slide Content

Semantic Data Enrichment: a
Human-in-the-Loop Perspective
Matteo [email protected]
INSID&S Lab
Department of Informatics,
Systems and Communication
Università degli Studi di
Milano-Bicocca
Seminar at INRIA –
Sophie Antipolis, July
20th, 2023

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
About me/this seminar…
nAssociate Prof. at University of Milano-Bicocca
¨INSID&S Lab: 4 faculty / 2 assistant prof. / 4 PhD students (now)
nCovered quite a broad spectrum of topics
¨AI / Data Integration >> Knowledge Graphs (KGs)
¨Representation Learning & NLP to track the evolution and to compare
distributional representations >> Computational Social Science (CSS)
nWhich topic for this talk ?
¨Human-in-the-loop (HITL) semantic data enrichment >> broad topic
driving specific work; should match WIMMICS (NLP and KG)
¨More in-depth presentation of recent work and CSS-related work >>
Manuel Vimercati
2

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Overview
nSemantic Data Integration, Annotations and Data
Enrichment
nSemantic Enrichment of Tabular Data
nHITL Tabular Data Enrichment
nTowards HITL Textual Data Enrichment
nConclusions
3
**Slides contain excerpts of content created by former/currrent PhD students Vincenzo Cutrona and Riccardo Pozzi

ARTIFICIAL INTELLIGENCE @UNIMIB
1)
SEMANTIC DATA
INTEGRATION,
ANNOTATIONS, AND DATA
ENRICHMENT
4

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Semantic Data Integration
6
Company data from
.data.gouv.fr
https://annuaire-
entreprises.data.gouv.fr/entrep
rise/sienna-real-estate-
holding-france-492220553
Person
Country
Org.
Foundations firms 'offshore' customers through banks in Wikipedia
Background
KG
Entities in OffshoreLeaks linked to France
https://offshoreleaks.icij.org/search?c=FRA&cat=0
Inspiration for this example: [Knoblock&Szekely 2015], ICIJ + Neo4J work for Panama Papers

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Semantic Data Integration
7
Company data from
.data.gouv.fr
https://annuaire-
entreprises.data.gouv.fr/entrep
rise/sienna-real-estate-
holding-france-492220553
Person
Country
Org.
Foundations firms 'offshore' customers through banks in Wikipedia
Named Entity Recognition
Annotations: named entities
Entities in OffshoreLeaks linked to France
https://offshoreleaks.icij.org/search?c=FRA&cat=0
Background
KG

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Semantic Data Integration
8
Company data from
.data.gouv.fr
https://annuaire-
entreprises.data.gouv.fr/entrep
rise/sienna-real-estate-
holding-france-492220553
Person
Country
Org.
Foundations firms 'offshore' customers through banks in Wikipedia
Named Entity Recognition (NER)
Annotations: named entities
Entities in OffshoreLeaks linked to France
https://offshoreleaks.icij.org/search?c=FRA&cat=0
Background
KG

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Semantic Data Integration
9
Company data from
.data.gouv.fr
https://annuaire-
entreprises.data.gouv.fr/entrep
rise/sienna-real-estate-
holding-france-492220553
Person
Country
Org.
Foundations firms 'offshore' customers through banks in Wikipedia
Annotations: data linking
Named Entity RecognitionNamed Entity Linking (NEL)
Entities in OffshoreLeaks linked to France
https://offshoreleaks.icij.org/search?c=FRA&cat=0
Background
KG

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Semantic Data Integration
10
Company data from
.data.gouv.fr
https://annuaire-
entreprises.data.gouv.fr/entrep
rise/sienna-real-estate-
holding-france-492220553
Person
Country
Org.
Foundations firms 'offshore' customers through banks in Wikipedia
Annotations: data linking
Named Entity RecognitionNamed Entity Linking
Entities in OffshoreLeaks linked to France
https://offshoreleaks.icij.org/search?c=FRA&cat=0
… Sienna …
ClusteringNIL Prediction
=
=
Background
KG

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Techniques from Established Research Fields
nTexts
¨Annotation / Information Extraction
nNER and NEL: huge body of work
¨Recent work on NEL: BLINK [Wu&al.EMNLP20] and GENRE [DeCao&al.TACL22] …
nNIL prediction and Clustering: ~less investigated
¨Increased interest in the last 2 years
¨[Argawal&al.NAACL22], [Kassner&al.ACL22], [Heist&Pauheim ESWC23]
nTabular data
¨Annotation / Semantic Table Interpretation
nMore details in this presentation
nSurvey: [Liu&al.JWA22] (R. Troncy and P. Monnin are co-authors J)
11

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Semantic Data Enrichment and KG Construction
12
A shift in perspective:
•Users are interested in their content
•Background KGs useful to
•support integraton
•extend their content with additional data
•The construction of a KG can be a byproduct

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Semantic Data Enrichment and KG Construction
13
A shift in perspective:
•Users are interested in their content
•Background KGs useful to
•support integraton
•extend their content with additional data
•The construction of a KG can be a byproduct

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Downstream Applications of Data Enrichment
14
Query
Answering
Semantic Search
&
Data Exploration
“Traditional” ML
&
Data Analytics
Analyses with
Representation
Learning
•Criminal
investigations
[SDSM20]
•Explorinig data-
contexts to
contextualize news
articles
[ISWCdemo15,
ESWC17]
•Enrichment and
analysis of social
media [EACLdemo17]
•Weather-
based
optimization in
digital
marketing
[ISWC19,Tech
. and Appl. for
BDV22]
•Text-based
entity
embeddings and
time-aware
entity similarity
[ISWC18]
•Entity evolution
(+ with CADE
alignment
[AAAI19])
DocumentsTabular
dataDocuments




THIS PROJECT HAS RECEIVED FUNDING FROM THE EUROPEAN UNION'S HORIZON EUROPE
RESEARCH AND INNOVATION PROGRAMME UNDER GRANT AGREEMENT NO 101070284.




Enabling Data Enrichment Pipelines for
AI-driven Business Products and Services
HORIZON-CL4-2021-DATA-01-03


D4.1: Business Cases Requirements Analysis & Specifications


Work Package 4




Type of document: Report
Dissemination level: SEN - Sensitive
Lead beneficiary: JOT
Authors: Fernando Perales and Cynthia Parrondo (JOT)
Cuong Xuan Chu and Evgeny Kharlamov (BOS)
Qi Gao (PHI)
Alex Young and Ian Makgill (SN)
Luis Rei and Besher Massri (JSI)
Tao Song (BGRIMM)
Version: 1.0
Due Date of document: 30/06/2023
Delivery Date of document: 30/06/2023





THIS PROJECT HAS RECEIVED FUNDING FROM THE EUROPEAN UNION'S HORIZON EUROPE
RESEARCH AND INNOVATION PROGRAMME UNDER GRANT AGREEMENT NO 101070284.




Enabling Data Enrichment Pipelines for
AI-driven Business Products and Services
HORIZON-CL4-2021-DATA-01-03


D4.1: Business Cases Requirements Analysis & Specifications


Work Package 4




Type of document: Report
Dissemination level: SEN - Sensitive
Lead beneficiary: JOT
Authors: Fernando Perales and Cynthia Parrondo (JOT)
Cuong Xuan Chu and Evgeny Kharlamov (BOS)
Qi Gao (PHI)
Alex Young and Ian Makgill (SN)
Luis Rei and Besher Massri (JSI)
Tao Song (BGRIMM)
Version: 1.0
Due Date of document: 30/06/2023
Delivery Date of document: 30/06/2023

Contributions:
applications
and novel
analytical
methods

Mainprojects
Maindata








Elicitazione dei bisogni informativi dei magistrati
nell’ambito del sistema di ricerca semantica e serialità


Progetto Datalake Giustizia e Progetto Next Generation UPP
(PON Governance e capacità istituzionale 2014-2020)



Documento informativo e traccia d’intervista










Elicitazione dei bisogni informativi dei magistrati
nell’ambito del sistema di ricerca semantica e serialità


Progetto Datalake Giustizia e Progetto Next Generation UPP
(PON Governance e capacità istituzionale 2014-2020)



Documento informativo e traccia d’intervista


Applications
and analytical
methods
This talk

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Several Examples from Past/Ongoing Projects
15
DomainValue Enrichment
Data Sources
Data
eCommercePredict impact of events on customer searchesEvents, weatherTabular
RetailWorkforce/budget optimizationEvents, weatherTabular
CRMWorkforce optimization Events, weatherTabular
IOTCustomer flow analysisEvents, weatherTabular
Digital MarketingAd impression prediction for campaign optimizationWeatherTabular
Digital MarketingAd impression prediction for campaign optimizationEventsTabular
ManufactoringAI-based analytics on welding robot data (tables and user manuals)Prorpetary ~KGTabular, Texts
ManufactoringTroubleshooting and repair based on service manuals, records, log dataProrpetary ~KGTabular, Texts
Open dataConstruction and maintenance of a European dataset of organizations in
procurement from tenders
Prorpetary ~KG,
Wikidata, Crunch Base
Tabular, Texts
Observatory on AIConstruction and maintenance of a KG to track AI-related innovations
from different data sources
Crunch Base, WikiDataTabular, Texts
Business analysisCost-effective enrichment of client datasets’ with proprietary company KGProprietary KGTabular




THIS PROJECT HAS RECEIVED FUNDING FROM THE EUROPEAN UNION'S HORIZON EUROPE
RESEARCH AND INNOVATION PROGRAMME UNDER GRANT AGREEMENT NO 101070284.




Enabling Data Enrichment Pipelines for
AI-driven Business Products and Services
HORIZON-CL4-2021-DATA-01-03


D4.1: Business Cases Requirements Analysis & Specifications


Work Package 4




Type of document: Report
Dissemination level: SEN - Sensitive
Lead beneficiary: JOT
Authors: Fernando Perales and Cynthia Parrondo (JOT)
Cuong Xuan Chu and Evgeny Kharlamov (BOS)
Qi Gao (PHI)
Alex Young and Ian Makgill (SN)
Luis Rei and Besher Massri (JSI)
Tao Song (BGRIMM)
Version: 1.0
Due Date of document: 30/06/2023
Delivery Date of document: 30/06/2023

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
2)
SEMANTIC
ENRICHMENT OF
TABULAR DATA
16
“Traditional” ML
&
Data Analytics
•Weather-
based
optimization in
digital
marketing
[ISWC19,Tech
. and Appl. for
BDV22]
Tabular
data




THIS PROJECT HAS RECEIVED FUNDING FROM THE EUROPEAN UNION'S HORIZON EUROPE
RESEARCH AND INNOVATION PROGRAMME UNDER GRANT AGREEMENT NO 101070284.




Enabling Data Enrichment Pipelines for
AI-driven Business Products and Services
HORIZON-CL4-2021-DATA-01-03


D4.1: Business Cases Requirements Analysis & Specifications


Work Package 4




Type of document: Report
Dissemination level: SEN - Sensitive
Lead beneficiary: JOT
Authors: Fernando Perales and Cynthia Parrondo (JOT)
Cuong Xuan Chu and Evgeny Kharlamov (BOS)
Qi Gao (PHI)
Alex Young and Ian Makgill (SN)
Luis Rei and Besher Massri (JSI)
Tao Song (BGRIMM)
Version: 1.0
Due Date of document: 30/06/2023
Delivery Date of document: 30/06/2023

Semantically-Enabled Optimization of
Digital Marketing Campaigns
Vincenzo Cutrona1, Flavio De Paoli1, Aljaž Košmerlj2, Nikolay Nikolov3,
Matteo Palmonari1, Fernando Perales4, and Dumitru Roman3
1 University of
Milano -Bicocca
2 Josef Stefan Institute3 SINTEF DIGITAL4 JOT Internet Media

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Vincenzo Cutrona -Ph.D. Presentation -25/05/2021
Weather-based Campaign Scheduler
18
New services for campaign optimization:
●Main service: weather-based campaign
scheduler
○Predict the best dates to launch the
campaign with weather-sensitive keywords○in the upcoming week ○for each region
●+ additional services
●Why do we focus on data enrichment?
○80% time in data analysis project is spent for
cleaning and enriching the data*
C°/+0C°/+1
1820
1719
1720
KEYWORD#imREGIONDate
19490664Thuringia2017-03-11
51782750Bavaria2017-03-12
45914342Berlin2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
Input dataAdditional data
Target data
ML modelBusiness
service
*Worldwide Semiannual Big Data and Analytics
Spending Guidefrom International Data Corporation
(IDC)

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Weather Service
city: 2950157
-date: 2017-03-12
2t: 17
-date: 2017-03-13
2t: 20
regionID (GeoNames)date (ISO 8601)
Data Enrichment: Digital Marketing Example
19
KEYWORD#imREGIONDate
19490664Thuringia11/03/2017
51782750Bavaria12/03/2017
45914342Berlin12/03/2017
DIFFERENTsystems of identifiers

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
KEYWORD#imREGIONDate
19490664Thuringia2017-03-11
51782750Bavaria2017-03-12
45914342Berlin2017-03-12
Data Enrichment: Digital Marketing Example
20
KEYWORD#imREGIONDate
19490664Thuringia11/03/2017
51782750Bavaria12/03/2017
45914342Berlin12/03/2017
STEP 1
VALUE MANIPULATION

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
KEYWORD#imREGIONDate
19490664Thuringia2017-03-11
51782750Bavaria2017-03-12
45914342Berlin2017-03-12
Data Enrichment: Digital Marketing Example
21
gn:2822542
gn:2951839
gn:2950157
KEYWORD#imREGIONDate
19490664Thuringia2017-03-11
51782750Bavaria2017-03-12
45914342Berlin2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
STEP 2
LINKING
The region, notthe city

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
KEYWORD#imREGIONDate
19490664Thuringia2017-03-11
51782750Bavaria2017-03-12
45914342Berlin2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
Data Enrichment: Digital Marketing Example
22
EQUALsystems of identifiers
C°/+0C°/+1
1820
1719
1720
KEYWORD#imREGIONDate
19490664Thuringia2017-03-11
51782750Bavaria2017-03-12
45914342Berlin2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
Weather Service
city: 2950157
-date: 2017-03-12
2t: 17
-date: 2017-03-13
2t: 20
cityID (GeoNames)date (ISO 8601)
STEP 3
EXTENSION

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Vincenzo Cutrona -Ph.D. Presentation -25/05/2021
Semantic Data Enrichment: Problem Statement
●Inputs:
○a sourcedataset
○a pool of referencedata sources
Data Enrichment:a path on the data transformations graph GT
Semantic Data Enrichment: at least one node is linking
23
●Output:
○the source dataset extended with
modified/additional columns
LinkingExtension
Value
manipulation
source outputexternal
data sources reference
KGs
Large data volumes
Unknownor little-known, large
and complexdata sources
Intrinsicuncertainty

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
Vincenzo Cutrona -Ph.D. Presentation -25/05/2021
Semantic Data Enrichment: Problem Statement
●Inputs:
○a sourcedataset
○a pool of referencedata sources
Data Enrichment:a path on the data transformations graph GT
Semantic Data Enrichment: at least one node is linking
24
●Output:
○the source dataset extended with
modified/additional columns
LinkingExtension
Value
manipulation
source outputexternal
data sources reference
KGs
Large data volumes
Unknownor little-known, large
and complexdata sources
Intrinsicuncertainty
Annotations from algorithms

DATA SEMANTICS @ DATA SCIENCE
-UNIMIB
2.A)
TABULAR DATA ANNOTATION
ALGORITHMS:
SEMANTIC TABLE INTERPRETATION
25

ARTIFICIAL INTELLIGENCE @UNIMIB
Semantic Table Interpretation
26
Given
●a relational table T
●a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
●each column of the table is associated with one or more types (CTA)
●each cell in the table is annotated with the entityin the catalog(CEA)
●each pair of columns is annotated with a binary relationin the catalog(CPA)
NameCoordinatesHeightRange
Le Mont Blanc45°49′57″N06°51′52″E4808M. Blanc massif
Hohtälli45°98’96″N07°80’25″E3275Pennine Alps
Monte Cervino45°58′35″N07°39′31″E4478Pennine Alps
KNOWLEDGE GRAPH
Mountain
RangeMountainxsd:integerxsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange…

Mont_Blanc MontBlanc
Massif4808
dbo:elevation
Schema level
Entity level

ARTIFICIAL INTELLIGENCE @UNIMIB
27
Given
●a relational table T
●a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
●each column is associated with one or more KG-types(CTA)
●each cell in the table is annotated with the entityin the catalog(CEA)
●each pair of columns is annotated with a binary relationin the catalog(CPA)
NameCoordinatesHeightRange
Le Mont Blanc45°49′57″N06°51′52″E4808M. Blanc massif
Hohtälli45°98’96″N07°80’25″E3275Pennine Alps
Monte Cervino45°58′35″N07°39′31″E4478Pennine Alps
KNOWLEDGE GRAPH
Mountain
RangeMountainxsd:integerxsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange…

Mountainxsd:stringxsd:integerMountain
Range
Mont_Blanc MontBlanc
Massif4808
dbo:elevation
Schema level
Entity level
Semantic Table Interpretation

ARTIFICIAL INTELLIGENCE @UNIMIB
NameCoordinatesHeightRange
Mont Blanc45°49′57″N06°51′52″E4808Mont Blanc massif
Hohtälli45°98’96″N07°80’25″E3275Pennine Alps
Monte Cervino45°58′35″N07°39′31″E4478Pennine Alps
28
Given
●a relational table T
●a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
●each column is associated with one or more KG-types (CTA)
●each cell in “entity columns” is annotated with a KG-entity (CEA)
●each pair of columns is annotated with a binary relationin the catalog(CPA)
KNOWLEDGE GRAPH
Mountain
RangeMountainxsd:integerxsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange…

Mountainxsd:stringxsd:integerMountain
Range
Mont_Blanc MontBlanc
Massif4808
dbo:elevation
Schema level
Entity levelMont_Blanc MontBlanc
Massif
Semantic Table Interpretation

ARTIFICIAL INTELLIGENCE @UNIMIB
NameCoordinatesHeightRange
Mont Blanc45°49′57″N06°51′52″E4808Mont Blanc massif
Hohtälli45°98’96″N07°80’25″E3275Pennine Alps
Monte Cervino45°58′35″N07°39′31″E4478Pennine Alps
29
Given
●a relational table T
●a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
●each column is associated with one or more KG-types (CTA)
●each cell in “entity columns” is annotated with a KG-entity (CEA)
●each pair of columns is annotated with a binary relationin the catalog(CPA)
KNOWLEDGE GRAPH
Mountain
RangeMountainxsd:integerxsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange…

Mountainxsd:stringxsd:integerMountain
Range
Mont_Blanc MontBlanc
Massif4808
dbo:elevation
Schema level
Entity levelMont_Blanc MontBlanc
Massif
Subject column
Named-Entitycolumn
Literal column
Also referred to as “entity
linking” (for tables)
Semantic Table Interpretation

ARTIFICIAL INTELLIGENCE @UNIMIB
NameCoordinatesHeightRange
Mont Blanc45°49′57″N06°51′52″E4808Mont Blanc massif
Hohtälli45°98’96″N07°80’25″E3275Pennine Alps
Monte Cervino45°58′35″N07°39′31″E4478Pennine Alps
30
Given
●a relational table T
●a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
●each column is associated with one or more KG-types (CTA)
●each cell in “entity columns” is annotated with a KG-entity (CEA)
●some pair of columns is annotated with a binary KG-predicate(CPA)
KNOWLEDGE GRAPH
Mountain
RangeMountainxsd:integerxsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange…

Mountainxsd:stringxsd:integerMountain
Range
Mont_Blanc MontBlanc
Massif4808
dbo:elevation
Schema level
Entity levelMont_Blanc MontBlanc
Massif
dbo:mountainRange
dbo:elevationgeorss:point
Semantic Table Interpretation

ARTIFICIAL INTELLIGENCE @UNIMIB
NameCoordinatesHeightRange
Mont Blanc45°49′57″N06°51′52″E4808Mont Blanc massif
Hohtälli45°98’96″N07°80’25″E3275Pennine Alps
Monte Cervino45°58′35″N07°39′31″E4478Pennine Alps
31
Given
●a relational table T
●a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
●each column is associated with one or more KG-types (CTA)
●each cell in “entity columns” is annotated with a KG-entity (CEA)
●some pair of columns is annotated with a binary KG-predicate(CPA)
KNOWLEDGE GRAPH
Mountain
RangeMountainxsd:integerxsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange…

Mountainxsd:stringxsd:integerMountain
Range
Mont_Blanc MontBlanc
Massif4808
Schema level
Entity levelMont_Blanc MontBlanc
Massif
dbo:mountainRange
dbo:elevationgeorss:point
dbo:mountainRange
dbo:elevationgeorss:point
45°49′57″N
06°51′52″E
Semantic Table Interpretation
… for KG completion

ARTIFICIAL INTELLIGENCE @UNIMIB
NameCoordinatesHeightRange
Mont Blanc45°49′57″N06°51′52″E4808Mont Blanc massif
Hohtälli45°98’96″N07°80’25″E3275Pennine Alps
Monte Cervino45°58′35″N07°39′31″E4478Pennine Alps
32
Given
●a relational table T
●a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
●each column is associated with one or more KG-types (CTA)
●each cell in “entity columns” is annotated with a KG-entity or with NIL (if not in the KG)
●some pair of columns is annotated with a binary KG-predicate(CPA)
KNOWLEDGE GRAPH
Mountain
RangeMountainxsd:integerxsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange…

Mountainxsd:stringxsd:integerMountain
Range
Mont_Blanc MontBlanc
Massif4808
Schema level
Entity levelMont_Blanc MontBlanc
Massif
dbo:mountainRange
dbo:elevationgeorss:point
dbo:mountainRange
dbo:elevationgeorss:point
45°49′57″N
06°51′52″E
Semantic Table Interpretation
… with novel entities
Pennine
Alps
Monte
Cervino
[NIL: Hohtälli] Pennine
Alps

ARTIFICIAL INTELLIGENCE @UNIMIB
INSID&S Contributions
nEntity linking in tables
¨Soft filters to filter candidate entities based on
type embedding similarity [SEMANTICS’21]
¨LamAPI: supporting indexing and matching
[OM@ISWC’22]
nEnd-to-end STI
¨s-elBat: dealing with messy tables
[SemTab@ISWC’22]*
¨MantisTable
[Fut.Gen.Internet’20,SemTab@ISWC’19-21]*
nEvaluation & datasets
¨Tough Tables: misspelling and noisy labels
[ISWC’20]
¨MammoTab: large dataset of annotated tables,
to learn neural linking algorithms and evaluate
them [SemTab@ISWC’22]
nParticipation to STI Challenges
¨2019-2022
33
http://www.cs.ox.ac.uk/isg/challenges/sem-tab/

ARTIFICIAL INTELLIGENCE @UNIMIB
Recap: Annotations, Enrichment and KG Construction
34
Table annotation
•Schema mapping
•Entity linking
Table augmentation
•With links and data
extention services
Export: graph
•Table to graph
transformations
KG generation
KG completion
Export: tabular data
Downstream
analysis
EnrichmentExploitation

ARTIFICIAL INTELLIGENCE @UNIMIB
2.B)
HITL TABULAR DATA
ENRICHMENT
35

ARTIFICIAL INTELLIGENCE @UNIMIB
36
1 – User interfaces for interactive data annotation and enrichment
2.B)
HITL TABULAR DATA
ENRICHMENT

ARTIFICIAL INTELLIGENCE @UNIMIB
ASIA: AssistedSemanticInterpretationand Annotationof tabular data
•Interactive annotation
•Execute linking services
•Exploit vocabulary suggestions from
ABSTAT[…, VLDBJ21]
•Edit / revise annotations
Table
Vocabulary suggestions and search
Cutrona, V., Ciavotta, M., De Paoli, F., & Palmonari, M. (2019). ASIA: A
tool for assisted semantic interpretation and annotation of tabular data.
InProceedings of ISWC Demo Papers [ISWCdemo19]
•Interactive extension
•Execute data extension services specifying
parameters from the interface

ARTIFICIAL INTELLIGENCE @UNIMIB
SemTUI–Interactive Semantic
Enrichment of Tabular Data
nUI accessing external services
¨STI (full)
nS-elBat
¨Reconciliation/linking services (OpenRefineinterface)
nGeonames
nWikiData
nDBpedia
nAtoka-linking (SpazioDati)
¨Extension services
nWikiData/ DBpedia(SPARQL)
nWeather extension (ECMWF)
nHERE (georeferencing)
nShortest-route
nAtoka-extension (SpazioDati)
n…
38
Support to Linking–Revision–Extensionof tabular data
nGraphical view & revision of annotations
¨Global and specific annotation rendering
¨Single cell editing / annotation revision
¨Column annotation revision
Ripamonti, M., De Paoli, F., & Palmonari, M. (2022). SemTUI: a
Framework for the Interactive Semantic Enrichment of Tabular
Data.arXivpreprint arXiv:2203.09521.

ARTIFICIAL INTELLIGENCE @UNIMIB
39
2.B)
HITL TABULAR DATA
ENRICHMENT
2 – Make data enrichment pipelines scalable

ARTIFICIAL INTELLIGENCE @UNIMIB
●Remember: enrichment ~ sequence of transformations that can be executed (batch mode)
●A two-step paradigm [ISWC19,ISWC19demo,Tech.andAppl.for BDV22]
●Small-scale design
●Algorithms + UI to specify annotationsand data extensions on a data sample
●Large-scale execution
●Big data technologies to speed up large-scale execution of transformations on large data
●Docker
●Parallelization
●…
Annotation for Tabular Data Enrichment at Scale
40
SAMPLE
QUALITY
INSIGHTS
ENRICHMENT
DESIGN
QUALITY ASSESSMENT
STACK
CONFIGURATION
ENRICHED
SAMPLE
DATASET
ENRICHED
DATASET
SMALL-SIZE
PROCESSING
TRANSFORMATION
MODELBATCH PROCESSING
Ciavotta, M., Cutrona, V., De Paoli, F., Nikolov, N., Palmonari, M., & Roman, D. (2022). Supporting semantic data enrichment at
scale. InTechnologies and Applications for Big Data Value(pp. 19-39). Cham: Springer International Publishing.
[Tech.andAppl.for BDV22]

ARTIFICIAL INTELLIGENCE @UNIMIB
41
2.B)
HITL TABULAR DATA
ENRICHMENT
3 – Deeper integration of UI and algorithms (ongoing)

ARTIFICIAL INTELLIGENCE @UNIMIB
Challenges: Entity Disambiguation and Ranking in Tables
42
titledirectorrelease
yeardomestic distributorlength
in min
worldwide
gross
jurassic worldcolin trevorrow2015universal pictures1241670400637
Q3512046
(Jurassic World)
12 June 2015
124
1670400637
P577 (publication date)
Q13377
(Universal Pictures)
P2047 (duration)
P2142 (box office)
P272 (production company)
Q5145625
(Colin Trevorrow)
P57 (director)
Q20647533
(Jurassic World)
2015
P577 (publication date)
Q937857
(Michael Giacchino)
P175 (performer)
P58 (screenwriter)
Q17862144
(Jurassic Park)
P179 (part of the series) Q21877685
(Jurassic World)
22 June 2018
128
1309500000
P577 (publication date)
Q13377
(Universal Pictures)
P2047 (duration)
P2142 (box office)
P750 (distributed by)
Q937857
(Colin Trevorrow)
P57 (director)
P58 (screenwriter)
Q17862144
(Jurassic Park)
P179 (part of the series)
Q932019
(J. A. Bayona)
P750 (distributed by)
P272 (production company)
...
✔ ""

ARTIFICIAL INTELLIGENCE @UNIMIB
Challenges: Novel Entities
nLinking with NIL prediction
¨Detection of novel entities
¨Underrepresented task in benchmark data
nGreedy algorithms often rewarded
¨Important problem in real-world data enrichment
settings
nE.g., a fragment of organizations in tables not
extracted/constructed from WikiData have links to WIkiData
43




THIS PROJECT HAS RECEIVED FUNDING FROM THE EUROPEAN UNION'S HORIZON EUROPE
RESEARCH AND INNOVATION PROGRAMME UNDER GRANT AGREEMENT NO 101070284.




Enabling Data Enrichment Pipelines for
AI-driven Business Products and Services
HORIZON-CL4-2021-DATA-01-03


D4.1: Business Cases Requirements Analysis & Specifications


Work Package 4




Type of document: Report
Dissemination level: SEN - Sensitive
Lead beneficiary: JOT
Authors: Fernando Perales and Cynthia Parrondo (JOT)
Cuong Xuan Chu and Evgeny Kharlamov (BOS)
Qi Gao (PHI)
Alex Young and Ian Makgill (SN)
Luis Rei and Besher Massri (JSI)
Tao Song (BGRIMM)
Version: 1.0
Due Date of document: 30/06/2023
Delivery Date of document: 30/06/2023

ARTIFICIAL INTELLIGENCE @UNIMIB
HITL in Linking Tasks
nPersonal background on HITL approaches
¨Ontology matching with multi-user feedback [SWJ’16,KEOD’17]
¨Active learning to rank for semantic association relevance [ESWC’17]
nObjective
¨Maximize quality while minimizing user effort
nTwo levels
¨Fast revision
nRevise first links that are more likely to be incorrect
¨Learning from the user feedback
nFeedback propagation, learn from limited data
44

ARTIFICIAL INTELLIGENCE @UNIMIB
Sel-Bat
‘22>>’23
n[SemTab22]:
¨Ad-hoc transformation
of features into
unbound ranking
score
nNew:
¨NN-based
transformation into a
bounded confidence
score !∈[0,1]
¨NIL prediction with
threshold
45
Mention vs labels
Row vs properties
Row vs description
Predicates and types hits

ARTIFICIAL INTELLIGENCE @UNIMIB
Entity Linking with NIL Prediction
46
nConfidence-based revision:
¨Use the confidence score to order links to revise
nE.g., mentions with lower confidence first, i.e., order all mentions m by increasing !!
nE.g., mentions that are more uncertain first, i.e., order all mentions m by distance of !!from the threshold
¨Optimal! for ranking is learned on the train set (maximize F-1/minimize revisions)
PN-Θ RN-ΘDecision ω(δ,ρ, k),σ
i,j
ci,j,1s i,j,1
……
ci,j,ks i,j,k
Entity Retriever
ci,j,1s i,j,1Fi,j,1
……
ci,j,ks i,j,kFi,j,1
ci,j,1ρi,j,1

ci,j,kρi,j,k
ci,j,1ρi,j,1
ωi,jL……
ci,j,kρi,j,k
NL
L
Fi,j,1
Fi,j,1
pi,j,1
pi,j,k
Top-k candidates from ER with featuresNormalized scores from PN (pi,j,h∈[0,1])Column-wise type-consistency features
added from other rows
Refined matching scores (ρi,j,h∈[0,1])
Ti,j,1
Ti,j,k
Candidates for the i-th row values in the j-th column
Feature Generator
Confidence scoreLink | Not Link
decision
2
Feature Refiner
Learning from human feedbackΘ
<δ,ρ,σ>Candidates for the values in the other cells in the j-th column
NL
NL
ci,jωi,jL
NL
Smart revision
!!=1−%&"#$%(!)+%(!
)*+%(-,/012-)iif !!≥5
δ

ARTIFICIAL INTELLIGENCE @UNIMIB
Experimental Settings
nEvaluate
¨Quality of the links with NIL prediction in ~ out-of-domain training settings
nMain: F-1 compared with top SemTab scorers (greedy algorithms)
¨k-fold validation with out-of-domain testing (5 dataset for train, 1 for test)
nAblation: impact of different components (ranking + PN + RN)
nAblation: impact of parameter k (final matching score vs distance between top candidates)
¨Effectiveness of the uncertainty measure to support smart revision
nMain: increase in link quality at incremental revision iterations
¨User revision simulated with an Oracle
¨Area Under the Curve of F-1 at increasing number or revised mentions
¨Fair experimental simplification: global ranking (all tables) vs. local ranking (one table)
nAblation: impact of parameter k for ordering mentions to be revised
47
TABLE I
STAT I S T I C S O F T H EDATA S E T SUSED IN THEEXPERIMENTS
dataset # tables# columns# rows# entities (CEA)# classes (CTA)# predicates (CPA)
Round1T2D 64 323 9089 8078 119 115
Round3 2161 9736 152753 390456 5761 7574
Round4 22207 78750 475897 994920 31921 56475
2T-2020 180 802 194438 667243 539 0
HardTableR2 1750 5589 29280 47439 2190 3835
HardTableR3 7207 17902 58949 58948 7206 10694
of the problem, including textual, semantic, and contextual
information, to enhance the entity resolution process.
Table II provides details about the architecture of the
neural network employed in this study. It is a plain feed-
forward neural network, whose hyper-parameters were deter-
mined through preliminary experiments. In recent times, deep
networks have demonstrated remarkable potential in handling
increasingly difficult and complex tasks, often rivaling or even
surpassing human capabilities. These networks are typically
built using highly intricate architectures. However, for this
particular work, we opted for an approach that prioritizes
simplicity and speed, while still maintaining excellent learning
capability and generalization. Although we acknowledge that
the network’s classification capability could be enhanced,
devising an architecture optimized for the candidate ranking
task is beyond the scope and objectives of this paper.
TABLE II
MODELARCHITECTURE
Layer (type) Output Shape Param # Connected to
dense (64,) 1344 (20,)
batchnorm (64,) 256 (64,)
dense1 (128,) 8320 (64,)
batchnorm1 (128,) 512 (128,)
dense2 (256,) 33024 (128,)
batchnorm2 (256,) 1024 (256,)
dense3 (128,) 32896 (256,)
batchnorm3 (128,) 512 (128,)
dense4 (64,) 8256 (128,)
batchnorm4 (64,) 256 (64,)
dense5 (2,) 130 (64,)
C. Uncertainty Estimation and Decision
Uncertainty estimation and decision is a two-step task:
computing for each candidate an uncertainty measure!to
rank all mentions, and setting aSthreshold (0.5 by default)
to classify a mention aslinkedorunlinked. The parameter!
is computed as a linear combination of the confidence score⇢
and the parameterithat is defined as the difference between
the scores⇢of the top two candidates.
The formula to compute the set⌦composed of the!for
eachi-thmention is reported in 1, whereMis the total number
of mentions, andkis a learnable weight that can be, possibly,
fine-tuned with the HITL support.
⌦={!i}
M
i=1,!i=(1*k)⇢i+kii (1)
Sorting⌦produces a global ranking of candidates asso-
ciated with mentions that can be used to split the set of
mentions intolinkedandunlinkedsubsets. The intuition is
that candidates with the highest!can be considered correct,
and the others of uncertain classification. Human review can
help disambiguate uncertain cases.
D. Human Revision
The process described so far automatically classifies each
mention as eitherlinkedorunlinked. Subsequently, the anno-
tations are presented to the user, who reviews the results and
verifies or corrects the annotations generated by the matching
algorithm.
The ordered set⌦facilitates the assessment of the level
of uncertainty associated with each link. This empowers
the user to determine which cases should be prioritized for
review based on the estimated degree of uncertainty involved.
Considering that manual link review is a time-consuming
task, a straightforward criterion is to commence from the
most uncertain links (i.e., those with the lowest values of
!) and proceed incrementally, stopping when the estimated
uncertainty is deemed sufficiently low.
User choices can be collected and propagated backward to
ranking, uncertainty estimation and decision modules to review
alternative results, thereby enhancing the overall performance
of the approach across the entire dataset. Examples of potential
fine-tuning actions include: i) updating the⇥parameters of the
neural network, ii) adjusting the weight (k) used to estimate
link uncertainty, and iii) modifying the threshold (S) value.
The next section discusses the experiments to show that the
proposed approach is effective.
IV. EVA L UAT I O N
In this section, we outline the implementation of the pro-
posed approach in the current tools, provide a description of
the conducted experiments, and discuss of the obtained results.
A. Tools
In our experiments, we exploited two tools:LamAPIthat
aggregates the preliminary step ofCandidate retrieval, and
Selbatfor the annotation steps ofCandidate rankingand
Uncertainty estimation and decision. The Human revision step
has been simulated by anOraclethat has been used to perform
experiments ranging from reviewing all mentions in a dataset
to establish the upper limit of the review process, to minimal
nBenchmark data
¨Links to DBpedia | WikiData
¨Tables may introduce specific/different challenges

ARTIFICIAL INTELLIGENCE @UNIMIB
Experimental Results
nEntity linking (main)
¨All components are relevant
¨Competitive results despite NIL
prediction (benchmark data reward
greedy decisions)
¨Gaps on test sets with specific
data distributions (also due to
retrieval module)
48
nSmart revision (main)
¨Confidence-based revision >>
faster than >> random revision
TABLE III
F1FOR EACHSTEP IN THELINKINGWORKFLOW
Test Dataset
Retrieval
with
indexing
PN
ranking
PN + RN
ranking
with types
SemTab
Top
Scorer
F1 F1 F1 F1
RoundT2D 0.82 0.83 0.86 0.90
Round3 0.72 0.73 0.76 0.97
Round4 0.83 0.90 0.91 0.99
2T-2020 0.62 0.86 0.89 0.90
HardTableR2 0.90 0.91 0.93 0.98
HardTableR3 0.52 0.54 0.62 0.97
TABLE IV
F1WITHHITL INCREMENTAL PERCENTAGE OFREVIEWS
Test Datasetk
10% 20% 30% 40% 50%
F1 F1 F1 F1 F1
RoundT2D 0.40.910.950.970.980.98
Round3 0.50.820.870.940.970.98
Round4 0.10.950.970.980.990.99
2T-2020 0.90.930.940.950.960.98
HardTableR20.40.980.991.0 1.0 1.0
HardTableR30.40.680.750.810.860.90
the model’s predictive quality, irrespective of the chosen clas-
sification threshold. Fig. 2 shows the values of F1 calculated
for different percentages of links to be reviewed and different
values ofk. The embedded table reports the performance
measures AUC. The figure refers to the experiment with the
fold that excludes theHartTable-R2dataset.
The evidence is that we need to review at most 30% of
mentions of the training set to reach 0.98 for F1 and that
almost any value ofkproduces similar results. The best value
forkis 0.4 with AUC=0.9725.
Fig. 2. F1 and AUC computed for the training dataset.
The learned value ofkis finally applied to the test dataset
to compute the F1 score and confirm the effectiveness of
the method. The result for theHartTable-R2test dataset is
reported in Fig. 3 where the results obtained withk=0.0(i.e.,
considering only the⇢scores given by the model),k=1.0
(i.e., considering only theSvalues), and a random selection
of candidates for review are also displayed.
Fig. 3. F1 and AUC computed for the test dataset.
The results provide evidence that the learned value ofk
demonstrates even better performance on the test dataset,
achieving a remarkable AUC value of 0.9929 and an F1 score
above 0.98 after examining only 10% of the mentions.
Table IV presents the results obtained from the experiments
conducted on all datasets. The outcomes are consistent with
the aforementioned discussion. Specifically, it is evident that
in the case of outlier datasets, such asRound3, even with less
than 30% of reviews, the F1 score surpasses 0.90, whereas
the performance of the highest-scoring participant in the
Challenge (refer to Table III) is achieved with 40% of reviews.
Moreover, for datasets with fewer typos, the threshold of
F1>0.90is attained much earlier. As an illustration, the
maximum F1 score of 0.98 is accomplished after reviewing
only 10% of the uncertain cases for theHardTableR2dataset.
The lessons learned from the experiments are that i) choos-
ing sample sets to review is not a valid alternative since F1 in-
creases linearly and almost all candidates need to be reviewed
to reach high values of F1; ii) the most relevant indicator
for uncertainty isS, since high results can be obtained with
k=1.0, which implies not considering⇢; but iii) considering
also⇢may correct specific situations where candidates with
high probability⇢could be ranked too low.
V. CONCLUSIONS AND FUTUREWORK
In this paper, we proposed a HITL approach to entity linking
on tabular data, aiming to improve quality and control through
user interactions. Our approach uses a neural network as a re-
ranker and score normalizer for candidate entities, on top of
off-the-shelves entity retrievers. It supportsunlinked-mention
prediction and incorporates a parameterized decision function
based on matching scores and confidence. The score used in
the decision function is also used as a signal of uncertainty to
prioritize mentions that require human revision. The proposed
approach can be easily integrated into existing applications for
interactive tabular data annotation and enrichment [4], [5]. In
future work, we plan to explore mechanisms to learn from the
user feedback by updating the network parameters wisely.
REFERENCES
[1]Y. Qian, E. Santus, Z. Jin, J. Guo, and R. Barzilay, “Graphie: A
graph-based framework for information extraction,”arXiv preprint
arXiv:1810.13083, 2018.
AUC on
HardTable-R2
TABLE III
F1FOR EACHSTEP IN THELINKINGWORKFLOW
Test Dataset
Retrieval
with
indexing
PN
ranking
PN + RN
ranking
with types
SemTab
Top
Scorer
F1 F1 F1 F1
RoundT2D 0.82 0.83 0.86 0.90
Round3 0.72 0.73 0.76 0.97
Round4 0.83 0.90 0.91 0.99
2T-2020 0.62 0.86 0.89 0.90
HardTableR2 0.90 0.91 0.93 0.98
HardTableR3 0.52 0.54 0.62 0.97
TABLE IV
F1WITHHITL INCREMENTAL PERCENTAGE OFREVIEWS
Test Datasetk
10% 20% 30% 40% 50%
F1 F1 F1 F1 F1
RoundT2D 0.40.910.950.970.980.98
Round3 0.50.820.870.940.970.98
Round4 0.10.950.970.980.990.99
2T-2020 0.90.930.940.950.960.98
HardTableR20.40.980.991.0 1.0 1.0
HardTableR30.40.680.750.810.860.90
the model’s predictive quality, irrespective of the chosen clas-
sification threshold. Fig. 2 shows the values of F1 calculated
for different percentages of links to be reviewed and different
values ofk. The embedded table reports the performance
measures AUC. The figure refers to the experiment with the
fold that excludes theHartTable-R2dataset.
The evidence is that we need to review at most 30% of
mentions of the training set to reach 0.98 for F1 and that
almost any value ofkproduces similar results. The best value
forkis 0.4 with AUC=0.9725.
Fig. 2. F1 and AUC computed for the training dataset.
The learned value ofkis finally applied to the test dataset
to compute the F1 score and confirm the effectiveness of
the method. The result for theHartTable-R2test dataset is
reported in Fig. 3 where the results obtained withk=0.0(i.e.,
considering only the⇢scores given by the model),k=1.0
(i.e., considering only theSvalues), and a random selection
of candidates for review are also displayed.
Fig. 3. F1 and AUC computed for the test dataset.
The results provide evidence that the learned value ofk
demonstrates even better performance on the test dataset,
achieving a remarkable AUC value of 0.9929 and an F1 score
above 0.98 after examining only 10% of the mentions.
Table IV presents the results obtained from the experiments
conducted on all datasets. The outcomes are consistent with
the aforementioned discussion. Specifically, it is evident that
in the case of outlier datasets, such asRound3, even with less
than 30% of reviews, the F1 score surpasses 0.90, whereas
the performance of the highest-scoring participant in the
Challenge (refer to Table III) is achieved with 40% of reviews.
Moreover, for datasets with fewer typos, the threshold of
F1>0.90is attained much earlier. As an illustration, the
maximum F1 score of 0.98 is accomplished after reviewing
only 10% of the uncertain cases for theHardTableR2dataset.
The lessons learned from the experiments are that i) choos-
ing sample sets to review is not a valid alternative since F1 in-
creases linearly and almost all candidates need to be reviewed
to reach high values of F1; ii) the most relevant indicator
for uncertainty isS, since high results can be obtained with
k=1.0, which implies not considering⇢; but iii) considering
also⇢may correct specific situations where candidates with
high probability⇢could be ranked too low.
V. CONCLUSIONS AND FUTUREWORK
In this paper, we proposed a HITL approach to entity linking
on tabular data, aiming to improve quality and control through
user interactions. Our approach uses a neural network as a re-
ranker and score normalizer for candidate entities, on top of
off-the-shelves entity retrievers. It supportsunlinked-mention
prediction and incorporates a parameterized decision function
based on matching scores and confidence. The score used in
the decision function is also used as a signal of uncertainty to
prioritize mentions that require human revision. The proposed
approach can be easily integrated into existing applications for
interactive tabular data annotation and enrichment [4], [5]. In
future work, we plan to explore mechanisms to learn from the
user feedback by updating the network parameters wisely.
REFERENCES
[1]Y. Qian, E. Santus, Z. Jin, J. Guo, and R. Barzilay, “Graphie: A
graph-based framework for information extraction,”arXiv preprint
arXiv:1810.13083, 2018.
TABLE III
F1FOR EACHSTEP IN THELINKINGWORKFLOW
Test Dataset
Retrieval
with
indexing
PN
ranking
PN + RN
ranking
with types
SemTab
Top
Scorer
F1 F1 F1 F1
RoundT2D 0.82 0.83 0.86 0.90
Round3 0.72 0.73 0.76 0.97
Round4 0.83 0.90 0.91 0.99
2T-2020 0.62 0.86 0.89 0.90
HardTableR2 0.90 0.91 0.93 0.98
HardTableR3 0.52 0.54 0.62 0.97
TABLE IV
F1WITHHITL INCREMENTAL PERCENTAGE OFREVIEWS
Test Datasetk
10% 20% 30% 40% 50%
F1 F1 F1 F1 F1
RoundT2D 0.40.910.950.970.980.98
Round3 0.50.820.870.940.970.98
Round4 0.10.950.970.980.990.99
2T-2020 0.90.930.940.950.960.98
HardTableR20.40.980.991.0 1.0 1.0
HardTableR30.40.680.750.810.860.90
the model’s predictive quality, irrespective of the chosen clas-
sification threshold. Fig. 2 shows the values of F1 calculated
for different percentages of links to be reviewed and different
values ofk. The embedded table reports the performance
measures AUC. The figure refers to the experiment with the
fold that excludes theHartTable-R2dataset.
The evidence is that we need to review at most 30% of
mentions of the training set to reach 0.98 for F1 and that
almost any value ofkproduces similar results. The best value
forkis 0.4 with AUC=0.9725.
Fig. 2. F1 and AUC computed for the training dataset.
The learned value ofkis finally applied to the test dataset
to compute the F1 score and confirm the effectiveness of
the method. The result for theHartTable-R2test dataset is
reported in Fig. 3 where the results obtained withk=0.0(i.e.,
considering only the⇢scores given by the model),k=1.0
(i.e., considering only theSvalues), and a random selection
of candidates for review are also displayed.
Fig. 3. F1 and AUC computed for the test dataset.
The results provide evidence that the learned value ofk
demonstrates even better performance on the test dataset,
achieving a remarkable AUC value of 0.9929 and an F1 score
above 0.98 after examining only 10% of the mentions.
Table IV presents the results obtained from the experiments
conducted on all datasets. The outcomes are consistent with
the aforementioned discussion. Specifically, it is evident that
in the case of outlier datasets, such asRound3, even with less
than 30% of reviews, the F1 score surpasses 0.90, whereas
the performance of the highest-scoring participant in the
Challenge (refer to Table III) is achieved with 40% of reviews.
Moreover, for datasets with fewer typos, the threshold of
F1>0.90is attained much earlier. As an illustration, the
maximum F1 score of 0.98 is accomplished after reviewing
only 10% of the uncertain cases for theHardTableR2dataset.
The lessons learned from the experiments are that i) choos-
ing sample sets to review is not a valid alternative since F1 in-
creases linearly and almost all candidates need to be reviewed
to reach high values of F1; ii) the most relevant indicator
for uncertainty isS, since high results can be obtained with
k=1.0, which implies not considering⇢; but iii) considering
also⇢may correct specific situations where candidates with
high probability⇢could be ranked too low.
V. CONCLUSIONS AND FUTUREWORK
In this paper, we proposed a HITL approach to entity linking
on tabular data, aiming to improve quality and control through
user interactions. Our approach uses a neural network as a re-
ranker and score normalizer for candidate entities, on top of
off-the-shelves entity retrievers. It supportsunlinked-mention
prediction and incorporates a parameterized decision function
based on matching scores and confidence. The score used in
the decision function is also used as a signal of uncertainty to
prioritize mentions that require human revision. The proposed
approach can be easily integrated into existing applications for
interactive tabular data annotation and enrichment [4], [5]. In
future work, we plan to explore mechanisms to learn from the
user feedback by updating the network parameters wisely.
REFERENCES
[1]Y. Qian, E. Santus, Z. Jin, J. Guo, and R. Barzilay, “Graphie: A
graph-based framework for information extraction,”arXiv preprint
arXiv:1810.13083, 2018.
Also: more interpretable
scores for human
interaction

ARTIFICIAL INTELLIGENCE @UNIMIB
3)
HITL FOR TEXTUAL DATA
ENRICHMENT
(LEGAL DOMAIN)
49

ARTIFICIAL INTELLIGENCE @UNIMIB
Entity Extraction from Legal Documents
50
KB
[...] A. Donati[...]
[...] Dott.sa Donati[...]
Anna
Donati
Evaluation of Incremental Entity Extraction with Background Knowledge and Entity Linking IJCKG’22, October 27-29, 2022, Hangzhou, China (hybrid)
Figure 2: Schema of the pipeline.
Table 2: Recall results of the linking from scratch experiment
(on AIDA [38]) with retrieval time and disk requirements.
R1R3R10R30#vectors time(s) disk(MB)
￿rst 73.7 85.6 91.9 95.8 4022 2.4 16
medoid79.5 91.1 95.9 98.7 4022 2.4 16
all 93.9 96.7 98.5 99.318319 44.1 72
requirements. Tests have been conduced on AIDA for the same
reasons of the NIL predictor.
Despite the latter strategy obtains the highest recall values, we
decided to use the medoid vector to save resources. In fact it takes 5%
of the time and it requires 22% of the disk space. This strategy indeed
represents a good trade-o￿between e￿ciency and e￿ectiveness.
Despite the human intervention, the problem of￿nding the
better representation for new entities still persists: a validator may
annotate an unrepresentative set of mentions for a given entity,
since embeddings are not interpretable, introducing a bias in next
batches.
4TRANSFORMING A NEL BENCHMARK INTO
AN INCREMENTAL ENTITY EXTRACTION
BENCHMARK
In order to test our pipeline in a realistic scenario, the chosen test
dataset should be representative of characteristics that we expect
to￿nd in the real world. For example, the entity frequency dis-
tribution should have a long right-tail: there are a few popular
(high-mentioned) entities, while most are not well-known; entities
in train and test data should belong to the same domains (domain
adaptation is an interesting problem, but is left out of this evalua-
tion); the dataset should be big enough to easily train data-hungry
models and to have a test set that can be split into several batches.
The most valuable candidate datasets for being adapted to the
incremental scenario are: AIDA [38], Zero-shot EL dataset [21],
KORE50 [10], TACKBP-2010 [13], and WikilinksNED Unseen-Mentions
(WNUM) [27]. In this paper we applied the following methodology
to WNUM, because it is the only one freely available with the above
mentioned features (e.g., AIDA is smaller than WNUM and has no
Table 3: Statistics about the dataset before and after thetrans-
plant.
mentions (NIL)entities (new)
train 2.2M (25744)86184 (17957)
dev 10k (316) 2397 (61)
test 10k (307) 2514 (63)
train2.008M (25365)81858 (17619)
dev 100k (501) 7105 (214)
test 100k (501) 6473 (248)
ground truth on NILs) and uses links to Wikipedia entities, as well
as most state-of-the-art NEL algorithms which use Wikipedia as
reference KB [3,6,35]. Observe that even if Wikipedia may be not
considered a proper KB, links between Wikipedia and Wikidata
or DBpedia exists, in such a way that information about most of
Wikipedia entities can be collected from proper KBs. In addition, it
is designed so that each of the sets (train, dev, test) never contains a
mention-entity pair that is present in another set, which is a highly
appreciated property for our task.
New entities.In order to simulate the presence of new entities,
we randomly￿ag some entities as NILs, preserving the ground
truth. This process is made in a proportional way with the number
of mentions of the entity itself in the train set and so that a certain
number of new entities can be chosen arbitrarily. For each entity
we calculate the score:
?*Sl(G)=?
#G
"2(0,+1] (1)
Where?is the desired percentage of NIL entities (we set it to
?=0.1[1]);"is the median of entity frequencies in the train
set (the mean provides inaccurate results due to the presence of
a long tail of low-frequencies entities), and#Gis the number of
mentions referring to the entityGin the train set. This function is
demonstrated to be monotonically decreasing, providing a higher
number of New Entities which are mentioned only once in the train
set, since, conceptually, there are a lot of new entities which are
not included due to data quality matters (low mentioned) and a
few entities which may become popular (in the meaning of “high
mentioned”) in a small span of time.
Then, we￿ag each entity as new (NIL) according to a Bernulli test
with probability?=?*Sl; entities￿agged as NILs are removed
from the BG-KB, since they become unknown. At this point, some
mentions of NIL entities aretransplantedfrom the train set to the
dev and test sets only to increase the number of new entities in
these latter sets (see Table 3), obtaining more robust metrics: this
step is made randomly so that both dev and test sets have 500 new
mentions each. Finally, we divide the test set in 10 batches with a
strati￿ed sampling on entity frequencies and NILs. Table 4 shows
several per-batch statistics about the dataset.
5 EVALUATION
To evaluate the I-NEL we need to consider that, given a mention<
that refers to an entity⇢, the correct behavior may depend also on
previous batches: in case⇢is not present in the BG-KB at timeC8,
<should be classi￿ed ass c; but, if one of the previous batches
NER
Enriched text
Enrichment
+
KG
construction
End-to-end
entity
extraction
with
background
KG (beyond
NER)

ARTIFICIAL INTELLIGENCE @UNIMIB
Target Applications
nCourt decisions (texts)
¨Semantic search
nE.g., find all decisions in controversies with [Money Bank] in [2008]
¨Anonymization
nE.g., replace all occurrences of persons with *****
¨Advanced statistics
nE.g., Count all decisions in controversies with banks in [2008]-[2018]
nCriminal investigations (texts + tabular data + ..)
¨Search on investigation files and report writing
nE.g., all paragraphs mentioning [J.Smith]
¨Analyze files hard to timely analyze today (chats, audio, files, …)
nE.g., all messages/chats where [J.Smith] wrote to [A.Black] about [L.Red]
51








Elicitazione dei bisogni informativi dei magistrati
nell’ambito del sistema di ricerca semantica e serialità


Progetto Datalake Giustizia e Progetto Next Generation UPP
(PON Governance e capacità istituzionale 2014-2020)



Documento informativo e traccia d’intervista










Elicitazione dei bisogni informativi dei magistrati
nell’ambito del sistema di ricerca semantica e serialità


Progetto Datalake Giustizia e Progetto Next Generation UPP
(PON Governance e capacità istituzionale 2014-2020)



Documento informativo e traccia d’intervista

ARTIFICIAL INTELLIGENCE @UNIMIB
Incremental Entity Extraction and Linking: Evaluation
52
IJCKG’22, October 27-29, 2022, Hangzhou, China (hybrid) Riccardo Pozzi, Federico Moiraghi, Fausto Lodi, and Ma ￿eo Palmonari
on the NER task or on the combination of NEL, NIL prediction and
NIL clustering [10, 13, 21, 27, 38].
However, the datasets proposed to evaluate NEL, NIL prediction,
and NIL clustering assume that the entity extraction task is executed
once on a given input corpus. Only recently it has been stressed that
in many application scenarios entity extraction must be applied to
a collection of documents that are ingested over time [4,20], in a
dynamic way. We refer to these scenarios asincremental entity ex-
traction. In these scenarios, entity extraction must be performed on
documents that are ingested incrementally in such a way that also
the KB is extended incrementally exploitingbatches of documents
as they are ingested. In [20], authors propose a similar task, where
entity coreference is applied to streams of documents, then they
propose a benchmark for the evaluation, and discuss challenges
that emerge in this continuous scenario. They explicitly contrast
this approach to entity extraction to approaches that use NEL and
BG-KBs. While we share the same arguments for motivating the
need of entity extraction solutions that operate incrementally, we
maintain that similar solutions can and should also be developed
by exploiting BG-KBs, which provide valuable support, and thus
consider the NEL, NIL prediction, and NIL clustering tasks in an
incremental scenario. Human in the loop (HITL) in entity extrac-
tion is particularly relevant for ethical concerns when information
from automatic systems is used to support decisions in sensitive
application domains, such as in the juridical context [4]. In incre-
mental entity extraction, we therefore assume that documents are
ingested and processed in batches, in such a way that HITL may
be incorporated at intermediate processing steps, may improve the
quality of the extended KB.
Incremental entity extraction with BG-KB can be considered, as
a whole, an incremental version of NEL (I-NEL), where NIL predic-
tion and NIL clustering are follow-up tasks. As the process unfolds,
we can consider the KB as consisting of two (virtual) parts:BG-KB,
which may contain millions of entities since the beginning, and
its extension, namedNEW-KB. NEW-KB is empty at the beginning
and updated incrementally as a new batch is processed: an exam-
ple of incremental update is shown in￿gure 1. A key feature of
I-NEL is that when processing the8-th batch, the NEL is expected
to link not only to entities stored in the BG-KG, but also to entities
stored in NEW-KB after processing previous batches. Thus, the
NEL algorithm must use limited information when trying to link
to entities stored in the NEW-KB after previous batches. The infor-
mation about new entities may gradually increase, but errors can
also propagate across batches. An example of the task is shown in
￿gure 1, where the entityJohn Smithis not present in the BG-KB,
however it is recognized as a new entity and added to the NEW-KB
at the end of the batch processing; in the following batches, we
expect that the mention ofJohn Smithis correctly linked to the -
now known - entity in the NEW-KB.
In addition to the de￿nition of theincremental entity extrac-
tion task with a BG-KB, the main contributions of this paper are
the following resources: 1) a methodology to evaluate the task by
adapting benchmark datasets for entity extraction to the incre-
mental (batch-based) scenario, 2) an incremental version of the
WikilinksNED Unseen-Mentions (WNUM) [27], 3) a pipeline that
combines strong baselines for each subtask, i.e., NEL, NIL predic-
tion and NIL clustering, and 4) an evaluation of the pipeline and a
Figure 1: Documents are processed in batches through time;
at each iteration, novel entities are added into the NEW-
KB and can be linked in following steps. Between each
step, a human validator can correct pipeline mistakes, split-
ting/merging clusters and￿xing links.
discussion of the challenges introduced by the incremental scenario.
The incremental version of WNUM, the solution to generate it, and
the baseline pipeline are resources that are documented and made
publicly available
1
.
The paper is organized as follows: we￿rst discuss related work
in section 2, then in section 3 we introduce a baseline pipeline
to help the reader understand the task that is evaluated with the
proposed methodology. Afterwards, we discuss the methodology
to create the dataset (Section 4) and the result of its application
to WNUM. Finally we discuss the evaluation on the incremental
datasets (Section 5) and how the error propagates through the
pipeline.
2 STATE OF THE ART
The conceptualization of the KB population with a BG-KG as a task
composed of four sub-tasks is not novel and can be tracked back
to the the TAC
2
(Text Analysis Conference), with its knowledge
base population track (TAC-KBP), and to several other works [12];
but our contribution focus on an incremental version of this task,
shortened as I-NEL. The importance of applying entity extraction
solutions in an incremental setting has been recently stressed in
prior works [20], which proposes methods without NEL. In this
paper, we focus on presenting anevaluation methodologyfor the
I-NEL task as well as an end-to-endbaselines, which consider error
propagation. In the present section, we￿rst review recent work
related to each subtask, to identify the best candidate components
for the proposed pipelines; then we discuss limitations of current
evaluation methodologies.
2.1 Named Entity Linking
The task of NEL consist in linking the correct entity4, taken from
an arbitrary KB, to a given mention<. Recent works rely on neural
networks of any kind, providing the most various strategies [29].
This work relies on the bi-encoder architecture [11], which uses
BERT self-attention to map the mention vector into the signi￿cance
1
https://github.com/rpo19/Incremental-Entity-Extraction
2
https://tac.nist.gov/
Pozzi, R., Moiraghi, F., Lodi, F., & Palmonari, M. (2022, October). Evaluation of Incremental Entity Extraction with Background Knowledge and
Entity Linking. In Proceedings of the 11th International Joint Conference on Knowledge Graphs (pp. 30-38). [IJCKG22]
Batches of documents acquired at
different time points
Background KB (e.g., Wikipedia)KB with NEW entities
*Use Case*
build a KB from criminal
investigation documents/data
Dataset
•Split of WikilinksNED
Unseen-Mentions in 10
batches
[Onoe&DurrettAAAI20]
•Injection/transplant of NIL
entities (~same overall %)
Main challenges
•Error propagation
•NIL Prediction
•Clustering
Similar
conclusions
as in
[Kassner&al.
ACL22]

ARTIFICIAL INTELLIGENCE @UNIMIB
Incremental Entity Extraction and Linking: Evaluation
53
IJCKG’22, October 27-29, 2022, Hangzhou, China (hybrid) Riccardo Pozzi, Federico Moiraghi, Fausto Lodi, and Ma ￿eo Palmonari
on the NER task or on the combination of NEL, NIL prediction and
NIL clustering [10, 13, 21, 27, 38].
However, the datasets proposed to evaluate NEL, NIL prediction,
and NIL clustering assume that the entity extraction task is executed
once on a given input corpus. Only recently it has been stressed that
in many application scenarios entity extraction must be applied to
a collection of documents that are ingested over time [4,20], in a
dynamic way. We refer to these scenarios asincremental entity ex-
traction. In these scenarios, entity extraction must be performed on
documents that are ingested incrementally in such a way that also
the KB is extended incrementally exploitingbatches of documents
as they are ingested. In [20], authors propose a similar task, where
entity coreference is applied to streams of documents, then they
propose a benchmark for the evaluation, and discuss challenges
that emerge in this continuous scenario. They explicitly contrast
this approach to entity extraction to approaches that use NEL and
BG-KBs. While we share the same arguments for motivating the
need of entity extraction solutions that operate incrementally, we
maintain that similar solutions can and should also be developed
by exploiting BG-KBs, which provide valuable support, and thus
consider the NEL, NIL prediction, and NIL clustering tasks in an
incremental scenario. Human in the loop (HITL) in entity extrac-
tion is particularly relevant for ethical concerns when information
from automatic systems is used to support decisions in sensitive
application domains, such as in the juridical context [4]. In incre-
mental entity extraction, we therefore assume that documents are
ingested and processed in batches, in such a way that HITL may
be incorporated at intermediate processing steps, may improve the
quality of the extended KB.
Incremental entity extraction with BG-KB can be considered, as
a whole, an incremental version of NEL (I-NEL), where NIL predic-
tion and NIL clustering are follow-up tasks. As the process unfolds,
we can consider the KB as consisting of two (virtual) parts:BG-KB,
which may contain millions of entities since the beginning, and
its extension, namedNEW-KB. NEW-KB is empty at the beginning
and updated incrementally as a new batch is processed: an exam-
ple of incremental update is shown in￿gure 1. A key feature of
I-NEL is that when processing the8-th batch, the NEL is expected
to link not only to entities stored in the BG-KG, but also to entities
stored in NEW-KB after processing previous batches. Thus, the
NEL algorithm must use limited information when trying to link
to entities stored in the NEW-KB after previous batches. The infor-
mation about new entities may gradually increase, but errors can
also propagate across batches. An example of the task is shown in
￿gure 1, where the entityJohn Smithis not present in the BG-KB,
however it is recognized as a new entity and added to the NEW-KB
at the end of the batch processing; in the following batches, we
expect that the mention ofJohn Smithis correctly linked to the -
now known - entity in the NEW-KB.
In addition to the de￿nition of theincremental entity extrac-
tion task with a BG-KB, the main contributions of this paper are
the following resources: 1) a methodology to evaluate the task by
adapting benchmark datasets for entity extraction to the incre-
mental (batch-based) scenario, 2) an incremental version of the
WikilinksNED Unseen-Mentions (WNUM) [27], 3) a pipeline that
combines strong baselines for each subtask, i.e., NEL, NIL predic-
tion and NIL clustering, and 4) an evaluation of the pipeline and a
Figure 1: Documents are processed in batches through time;
at each iteration, novel entities are added into the NEW-
KB and can be linked in following steps. Between each
step, a human validator can correct pipeline mistakes, split-
ting/merging clusters and￿xing links.
discussion of the challenges introduced by the incremental scenario.
The incremental version of WNUM, the solution to generate it, and
the baseline pipeline are resources that are documented and made
publicly available
1
.
The paper is organized as follows: we￿rst discuss related work
in section 2, then in section 3 we introduce a baseline pipeline
to help the reader understand the task that is evaluated with the
proposed methodology. Afterwards, we discuss the methodology
to create the dataset (Section 4) and the result of its application
to WNUM. Finally we discuss the evaluation on the incremental
datasets (Section 5) and how the error propagates through the
pipeline.
2 STATE OF THE ART
The conceptualization of the KB population with a BG-KG as a task
composed of four sub-tasks is not novel and can be tracked back
to the the TAC
2
(Text Analysis Conference), with its knowledge
base population track (TAC-KBP), and to several other works [12];
but our contribution focus on an incremental version of this task,
shortened as I-NEL. The importance of applying entity extraction
solutions in an incremental setting has been recently stressed in
prior works [20], which proposes methods without NEL. In this
paper, we focus on presenting anevaluation methodologyfor the
I-NEL task as well as an end-to-endbaselines, which consider error
propagation. In the present section, we￿rst review recent work
related to each subtask, to identify the best candidate components
for the proposed pipelines; then we discuss limitations of current
evaluation methodologies.
2.1 Named Entity Linking
The task of NEL consist in linking the correct entity4, taken from
an arbitrary KB, to a given mention<. Recent works rely on neural
networks of any kind, providing the most various strategies [29].
This work relies on the bi-encoder architecture [11], which uses
BERT self-attention to map the mention vector into the signi￿cance
1
https://github.com/rpo19/Incremental-Entity-Extraction
2
https://tac.nist.gov/
Pozzi, R., Moiraghi, F., Lodi, F., & Palmonari, M. (2022, October). Evaluation of Incremental Entity Extraction with Background Knowledge and
Entity Linking. In Proceedings of the 11th International Joint Conference on Knowledge Graphs (pp. 30-38). [IJCKG22]
Batches of documents acquired at
different time points
Background KB (e.g., Wikipedia)KB with NEW entities
*Use Case*
build a KB from criminal
investigation documents/data
Dataset
•Split of WikilinksNED
Unseen-Mentions in 10
batches
[Onoe&DurrettAAAI20]
•Injection/transplant of NIL
entities (~same overall %)
Main challenges
•Error propagation
•NIL Prediction
•Clustering
Similar
conclusions
as in
[Kassner&al.
ACL22]
Certain application domains require HITL
end-to-end entity extraction to achieve
production-level quality

ARTIFICIAL INTELLIGENCE @UNIMIB
Dave: Semantic Search + HITL Annotation
54All visible names in this text are made up as other PI information. None of
the facts mentioned in this decision refer to the names referred therein.

ARTIFICIAL INTELLIGENCE @UNIMIB
4)
CONCLUSIONS
AND FUTURE WORK
55

ARTIFICIAL INTELLIGENCE @UNIMIB
Conclusions & Future Work
nConclusions
¨Data linking + data extension: core semantic data enrichment tasks
¨Tabular data and textual data
nSimilar tasks: annotations >> KG construction | enriched data
nStill several challenges
¨NIL prediction and entity clustering
¨Incremental KB construction from tables and text
¨HITL approach
nInteractive data enrichment to overcome intrinsic limitations
nEnrichment at scale while controling the quality
nFuture work
¨Full-fledged HITL: learning from the user feedback
¨Combining Generative AI and data enrichment algorithms for dialogical data enrichment
56

ARTIFICIAL INTELLIGENCE @UNIMIB
THANKS! QUESTIONS?
57
ThisworkpresentedinthispresentationhasreceivedfundingfromtheEuropean
Union’sHorizon2020researchandinnovationprogramundergrantagreementsNo
732590-EW-Shopp-andNo732003–euBusinessGraph-andfromtheEuropean
Union’sHorizonEuroperesearchandinnovationprogramundergrantagreementsNo
101070284-enRichMyData.








Elicitazione dei bisogni informativi dei magistrati
nell’ambito del sistema di ricerca semantica e serialità


Progetto Datalake Giustizia e Progetto Next Generation UPP
(PON Governance e capacità istituzionale 2014-2020)



Documento informativo e traccia d’intervista


Funding acknowledgements