A BERT model for Humanitarian Document Geolocation
kkalimeri
7 views
17 slides
Feb 26, 2025
Slide 1 of 17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
About This Presentation
Best paper award in GoodIT 2024. this paper presents the geolocation biases and proposes a new Bert model that improves the geolocation for humanitarian documents.
Size: 5.69 MB
Language: en
Added: Feb 26, 2025
Slides: 17 pages
Slide Content
Leave no Place Behind: Improved Geolocation in Humanitarian Documents Enrico M. Belliardo , Kyriaki Kalimeri , Yelena Mejova ISI Foundation, Turin, Italy GoodIT , September 6, 2023
Zaatari refugee camp world’s largest camp for Syrian refugees in Jordan Opened in July 2012 Now permanent settlement 2
if I search Zaatari on Google maps, I find a car wash in Italy 3
Information overload in humanitarian sector DEEP – a collaborative analysis platform for effective aid responses 4
Geolocation extraction from text Geographic locations can be ambiguous and written in many ways and languages Location databases (gazetteers) are Western-biased https://unsdg.un.org/2030-agenda/universal-values/leave-no-one-behind https://www.theguardian.com/news/datablog/2015/apr/28/the-hidden-biases-of-geodata 5
Geolocation extraction from text geotagging the extraction of text fragments that may be a location (“toponyms”) geocoding the disambiguation of the toponym to a specific geographic location 6
Data Download humanitarian documents and reports listed in HumSet Convert HTML & PDF to text 15,661 documents from 45 projects 33 countries We annotate a sample for geotagging geocoding 7
Geotagging (finding toponyms in text) 469 English-language documents coded by DEEP annotators Using Label Studio app Sample stratified by country, filtered to have enough text Pre-annotated with a union of Spacy en_core_web_md roBERTa xlm - roberta -base- wikiann - ner “Literal” vs. “associative” toponyms (as defined by Gritta et al.) Literal: “latest events in central Syria ” Associative: “ Syria Red Cross aided border regions” Total of 11,025 toponyms Gritta , Milan, Mohammad Taher Pilehvar , and Nigel Collier. "A pragmatic guide to geoparsing evaluation: Toponyms, Named Entity Recognition and pragmatics." Language resources and evaluation 54 (2020): 683-712. 8
Geotagging (finding toponyms in text) 9
Geocoding (identifying geolocations/GPS) Relating toponyms to the unique GeoNames ID Custom-built tool Pre-matched using search engine built on GeoNames location names, selecting only administrative division (AD), populated place (PPL), mountain (MT), sea (SEA), lake (LK), island (ISL) and airport (AIR) 561 unique document/toponym match pairs from 39 documents, with 474 having non-empty matches, spanning 78 countries 10
Geocoding (identifying geolocations/GPS) 11
Annotations available at: https://github.com/embelliardo/HumSet_geolocation_annotations (see paper) 12
Improving geotagging Tuning Spacy and roBERTa models on new data exact matches also partial matches strict: test on unseen country 13
Improving geotagging Introducing FeatureRank Search for candidate locations (using exact match or Okapi BM25F) Compute country distribution of all guesses in the document Rank candidates by features including whether it is a capital or country, administration level, population, and document country distribution 14