A BERT model for Humanitarian Document Geolocation

kkalimeri 7 views 17 slides Feb 26, 2025
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Best paper award in GoodIT 2024. this paper presents the geolocation biases and proposes a new Bert model that improves the geolocation for humanitarian documents.


Slide Content

Leave no Place Behind: Improved Geolocation in Humanitarian Documents Enrico M. Belliardo , Kyriaki Kalimeri , Yelena Mejova ISI Foundation, Turin, Italy GoodIT , September 6, 2023

Zaatari refugee camp world’s largest camp for Syrian refugees in Jordan Opened in July 2012 Now permanent settlement 2

if I search Zaatari on Google maps, I find a car wash in Italy 3

Information overload in humanitarian sector DEEP – a collaborative analysis platform for effective aid responses 4

Geolocation extraction from text Geographic locations can be ambiguous and written in many ways and languages Location databases (gazetteers) are Western-biased https://unsdg.un.org/2030-agenda/universal-values/leave-no-one-behind https://www.theguardian.com/news/datablog/2015/apr/28/the-hidden-biases-of-geodata 5

Geolocation extraction from text geotagging the extraction of text fragments that may be a location (“toponyms”) geocoding the disambiguation of the toponym to a specific geographic location 6

Data Download humanitarian documents and reports listed in HumSet Convert HTML & PDF to text 15,661 documents from 45 projects 33 countries We annotate a sample for geotagging geocoding 7

Geotagging (finding toponyms in text) 469 English-language documents coded by DEEP annotators Using Label Studio app Sample stratified by country, filtered to have enough text Pre-annotated with a union of Spacy en_core_web_md roBERTa xlm - roberta -base- wikiann - ner “Literal” vs. “associative” toponyms (as defined by Gritta et al.) Literal: “latest events in central Syria ” Associative: “ Syria Red Cross aided border regions” Total of 11,025 toponyms Gritta , Milan, Mohammad Taher Pilehvar , and Nigel Collier. "A pragmatic guide to geoparsing evaluation: Toponyms, Named Entity Recognition and pragmatics."  Language resources and evaluation  54 (2020): 683-712. 8

Geotagging (finding toponyms in text) 9

Geocoding (identifying geolocations/GPS) Relating toponyms to the unique GeoNames ID Custom-built tool Pre-matched using search engine built on GeoNames location names, selecting only administrative division (AD), populated place (PPL), mountain (MT), sea (SEA), lake (LK), island (ISL) and airport (AIR) 561 unique document/toponym match pairs from 39 documents, with 474 having non-empty matches, spanning 78 countries 10

Geocoding (identifying geolocations/GPS) 11

Annotations available at: https://github.com/embelliardo/HumSet_geolocation_annotations (see paper) 12

Improving geotagging Tuning Spacy and roBERTa models on new data exact matches also partial matches strict: test on unseen country 13

Improving geotagging Introducing FeatureRank Search for candidate locations (using exact match or Okapi BM25F) Compute country distribution of all guesses in the document Rank candidates by features including whether it is a capital or country, administration level, population, and document country distribution 14

Extracting locations in HumSet Annotate 6733 documents, extracting 13,967 distinct locations 15

Next steps Expand to event detection Quantitative extraction Time extraction Entity grouping into events Summarization Analysis 16

Enrico M. Belliardo Yelena Mejova [email protected] Kyriaki Kalimeri 17 Dataset
Tags