Mattingly "AI & Prompt Design: Named Entity Recognition"

BaltimoreNISO 684 views 29 slides May 14, 2024
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

This presentation was provided by William Mattingly of the Smithsonian Institution, during the fifth segment of the NISO training series "AI & Prompt Design." Session Five: Named Entity Recognition with LLMs, was held on May 2, 2024.


Slide Content

Prompt Design 05: Named Entity Recognition

Named Entity Recognition (NER) as a Concept Rules-Based Approaches to NER Supervised Learning NER Unsupervised Learning NER Transformer-Based NER GliNER Large Language Models NER Goals

What is NER?

John went to Paris on 1 August 2023.

Named Entity Recognition John went to Paris on 1 August 2023 . John => PERSON Paris => LOCATION 1 August 2023 => DATE

Non-LLM Approaches to NER

Traditional Approaches Rules-Based Task-Specific Machine Learning Model Unsupervised Learning GliNER (Brand new!)

Rules-Based NER

Traditional NER Gazetteer Linguistic Rules Nested Conditions RegEx Rules-Based

Rules-Based List of Entities Concentration Camps: Auschwitz Bergen-Belsen Buchenwald … Gazetteer

Rules-Based Leverages the linguistic data of a text to assign an entity. Use an NLP framework, like spaCy or NLTK Nearly two hundred of them were taken to Berlin. Verb of movement followed by a proposition(s) [to, towards, away to] and a location. Linguistic Rules

Rules-Based Find conditions in which things occur to then assign a label. We were taken to the Warsaw Ghetto. If an entity is a LOCATION and the word “ghetto” appears within a context of 5 tokens, change entity to GHETTO. Nested Conditions

Rules-Based Regular Expressions is a complex way of doing fuzzy string matching. Hic pagus unus, cum domo exisset, patrum nostrorum memoria L. Cassium consulem interfecerat et eius exercitum sub iugum miserat. Lucius Cassius (?:[A-Z]\.\s)?Cassi(?:us|um|i|o|orum|is) RegEx

Machine Learning NER

Machine Learning { "text": "John Doe was a prisoner at Auschwitz during World War II.", "entities": [ { "type": "PERSON", "value": "John Doe", "start_pos": 0, "end_pos": 8 }, { "type": "CONC_CAMP", "value": "Auschwitz", "start_pos": 20, "end_pos": 30 } ] } Supervised Learning

Machine Learning Vectorize all multi-word tokens Plot them to identify patterns Exercise: https://wjbmattingly.com/unsupervised-ner/ Uns upervised Learning

Machine Learning GliNER => A transformer architecture that allows you to pass a text and your own labels to a model without any training. Example: https://huggingface.co/spaces/tomaarsen/gliner_medium-v2.1 Zero-Shot NER

Large Language Models

LLMs Contextual Understanding Less Manual Effort Adaptability Improved Accuracy Multilingual Capability Benefits

LLMs Resource Intensity (and Cost) Data Privacy Concerns Black Box Models Training Data Bias Generalization Challenges Latency Issues Hallucinations Consistency Limitations

LLMs Thinking through your methodology for NER Assisting in certain steps of NER (RegEx) Zero-Shot NER Few-Shot NER How to use LLMs

Exercise 1: Use an LLM to help develop a solution(s) to identify gender-specific people in a text. Discuss the options as a group and judge their merits. Consider the ethical implications of the proposed solutions.

Mrs. Jessica Monica Kapitan works at the office. Mrs. Kapitan is a lawyer. She is also friends with Mrs. Thompson and Miss. Smith. Sometimes Miss. Smith will miss her train.

Exercise 2: Capture all examples of Miss. and Mrs. in the text with their corresponding names using an LLM to generate RegEx https://regex101.com/r/TLfbGE/1

Exercise 1: One Solution \b(Mrs\.|Miss\.)\s+([A-Z][a-z]*(?:\s+[A-Z][a-z]*)*)

Mr. Thomas and Dr. Jessica Davis went to the store. They met Mrs. Stevens who works at a nearby office. They are all friends with Colonel Jackson. Col. Jackson is known to her friends by her first name, Terry. They all know Mr. and Mrs. Kapitan.

Exercise 3: Capture all examples [Honorific Entity] in the text with their corresponding names using an LLM to generate RegEx https://regex101.com/r/FYcO8C/1

Exercise 3: One Solution \b(Mr\.|Mrs\.|Miss\.|Dr\.|Colonel|Col\.)\s+([A-Z][a-z]*(?:\s+[A-Z][a-z]*)*)

Exercise 4: Use an LLM to identify the people in the following text. Think through an ethical way to use an LLM to assign potential gender in these contexts. Dr. Tracey Jordan works at the Smithsonian where he develops methods to identify named entities. Mrs. Alex Jackson leads the team. She was trained in machine learning at Stanford. While Tracey functions as the domain expert, Alex Jackson designs the experiments. They have another colleague, Leslie Peters.