Phonetic Matching with Apache Solr

256 views 35 slides Jan 04, 2023
Slide 1
Slide 1 of 35
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35

About This Presentation

In this session, I'll talk about phonetic matching with Apache Solr. We start off with an explanation of various Soundex algorithms and learn their shortcomings, then move on to the characteristics of the Beider-Morse algorithm for phonetic matching. I'll demonstrate how to integrate Beider-...


Slide Content

with Apache Solr
Markus Günther
Freelance Softw are Engineer / Architect
| |
Phonetic Matching
[email protected]@markus_guenther

Phonetic matching is concerned with searching for spelling variations in large databases.
Age-old problem
Algorithmic solutions da te back to the pre-computer era
Soundex was invented by Russell and Odell in 1912
Compute a phonetic value for a given name
Names that sound the same share the same phonetic value
Variation: American Soundex
Variation: Daitch-Mokotoff (DM) Soundex
Soundex tends to generate many false hits which lowers precision
© 2022 Markus Günther IT-Beratung

American Soundex is a reasonably simple algorithm.
Rules
1. Replace the fir st letter of the name and drop all occurences of a, e, i, o, u ,y, h, w.
2. Replace consonants with digits as suggested by mapping table.
3. Retain only the fir st letter for two or more adjacent letters mapped to the same number.
4. Retain only the fir st letter for two letters mapped to the same number that are separated
by h, w, or y.
5. Trim the encoded numbers to a total of three. Pad with 0 if there are less than three.
© 2022 Markus Günther IT-Beratung

American Soundex is a reasonably simple algorithm.
public class AmericanSoundex {
private static final String MAPPING = "01230120022455012623010202";
public static String encode(final String term) {
char code[] = { term.charAt(0), '0', '0', '0'};
char previousDigit = encode(code[0]);
int count = 1;

for (int i = 1; i < term.length() && count < code.length; i++) {
final char ch = term.charAt(i);
if (ch == 'H' || ch == 'W' || ch == 'Y') continue;
final char digit = encode(ch);
if (digit != '0' && digit != previousDigit) {
code[count++] = digit;
}
previousDigit = digit;
}
return String.valueOf(code);
}
}
© 2022 Markus Günther IT-Beratung

Let's take a look at a couple of examples.
Name Phonetic value
RobertR163
RupertR163
RubinR150
AshcraftA261
AshcroftA261
© 2022 Markus Günther IT-Beratung

American Soundex is not optimized for Eastern European names.
Name Phonetic value
SchwarzeneggerS625
ShwarzeneggerS625
SchwartseneggerS632
A search application would not be able to find a match with that misspelling.
© 2022 Markus Günther IT-Beratung

Daitch-Mokotoff Soundex has a solution f or this.
Name Phonetic values
Schwarzenegger474659, 479465
Shwarzenegger474659, 479465
Schwartsenegger479465
Given a pair of names, we have a phonetic match if at least one of their codes match.
© 2022 Markus Günther IT-Beratung

Soundex suffers from a focus on the anlaut for short names leading to false-positiv es.
Phonetic valueNames
S300 Scott, Seth, Sadie, Satoya, ...
C500 Connie, Cheyenne, Conway, ...
T200 Tasha, Tessa, Tekia, ...
© 2022 Markus Günther IT-Beratung

This isn't always the case, though.
Phonetic valueNames
M622 Marcus, Marcos, Marques, Markus, Marquice, Marquisa, ...
F652 Frank, Francisco, Francis, Franklin, Francois, ...
C150 Chevonne, Chavon, Chavonne, Chivon, Cobin, ...
© 2022 Markus Günther IT-Beratung

Beider-Morse Phonetic Matching

Instead of focusing on spelling, Beider-Morse factors in linguistic properties of a language.
Of limited interest for common nouns, adjectiv es, adverbs and verbs
Good strategy for proper nouns (i.e., names)
History: Started off primarily for matching surnames of Ashkenazic Jews
Example: Consider variations of Schwarz (standard German spelling)
Schwartz (alternate German spelling)
Shwartz, Shvartz, Shvarts (Anglicized spelling)
Szwarc (Polish), Szwartz (blended German-Polish)
Svarc (Hungarian), Chvartz (blended French-German)
© 2022 Markus Günther IT-Beratung

Step 1: Identifying the language
BMPM includes about 200 rules for determining the language
Some are general, some need context
Examples Inferred Language(s)
tsch, final mann or witzGerman
final and initial cs or zsHungarian
cz, cy, initial rz or wl, ...Polish
ö and ü German, Hungarian
Allows to specify a language explicitly
© 2022 Markus Günther IT-Beratung

Step 2: Calculating the exact phonetic value
Forms of surnames used by women diff er in some languages
Affects Slavic languages, Polish, Russian, Lithuanian, Latvian
Masculine endingsFeminine endings
Suchy Sucha
Novikov Novikova
BMPM replaces feminine endings with masculine ones
© 2022 Markus Günther IT-Beratung

Step 2: Calculating the exact phonetic value
1. Replace feminine endings with masculine ones.
2. Identify the exact phonetic value of all letters.
1. Transcribe letters into a phonetic alphabet.
Applies language-specific rule se t in case of one possible language.
Applies generic rule set in case of multiple possible languag es.
2. Apply phonetic rules that are common to many languages.
e.g. final devoicing, regressive assimilation
3. At the end, the algorithm yields the exact phonetic value.
© 2022 Markus Günther IT-Beratung

Step 2: What do language-specific rules look like?
BMPM applies roughly 80 mapping rules for German
sch maps to S
s at the start and s between two vowels maps to z
w maps to v
© 2022 Markus Günther IT-Beratung

Step 2: What do language-agnostic rules look like?
BMPM uses more than 300 generic rules
a final tz maps to ts
Some generic rules might be applicable to specific languag es only
step 1 rules out certain languages
rule is applied if it complies with the remaining possible languages
© 2022 Markus Günther IT-Beratung

Step 3: Calculating the approximate phonetic value
Some sounds can be interchangeable in specific c ontexts
beginning / end of word
previous next / letter
LanguageExample Sounds alike
Russianunstressed o is pronounced as aMostov, Mastov
Germann before b is close to m Grinberg, Grimberg
Spanishphonetic equivalence of n and mGrinberg, Grimberg
Rules can be language-agnostic or -specific
© 2022 Markus Günther IT-Beratung

Step 4: Searching for matches
1. BMPM generates the exact and approximate phonetic value for a given name.
2. We have an exact match if two names match on their exact phonetic value.
This might be too aggressive for your use-case.
3. We have an approximate match if two names match on their approximate phonetic value.
Matches done by BMPM are not necessarily commutative.
© 2022 Markus Günther IT-Beratung

Integration with Apache Solr

Apache Solr supports a variety of phonetic matching algorithms.
Beider-Morse Phonetic Matching
Daitch-Mokotoff Soundex
Double Metaphone
Metaphone
Soundex
© 2022 Markus Günther IT-Beratung

Refined Soundex
Caverphone
Cologne Phonetic
NYSISS

Add a field type tha t works with the phonetic matching algorithm.
Admissible values for ruleType are: APPROX and EXACT
They map to the semantics of approximate matches resp. exact matches
<fieldType name="phonetic_names" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer>
<filter class="solr.BeiderMorseFilterFactory"
nameType="GENERIC"
ruleType="APPROX"
concat="true"
languageSet="auto"></filter>
</analyzer>
</fieldType>
© 2022 Markus Günther IT-Beratung

Add an index field using the resp. field type.
You probably already have a name field of sorts for basic name searches.
Use a copyField-directive to source name_phonetic from that field.
<field name="name_phonetic"
type="phonetic_names"
indexed="true"
stored="false"
multiValued="false"></field>
<copyField source="name" dest="name_phonetic"></copyField>
© 2022 Markus Günther IT-Beratung

Execute queries against that field.
Query for mustermann
(name_phonetic:mustermann)
© 2022 Markus Günther IT-Beratung

Evaluation

Let's do a couple of experiments with diff erent parameters for BMPN.
Dataset: Large enterprise naming directory, approx. 340k individual persons
Naive implementation using phonetic matching incl. wildcards and N-Gram backed
fields yields approx. 3k results for a popular surname
Queries:
Large result set: q=(name_phonetic:meier)
Small result set: q=(name_phonetic:<some-unique-name>)
© 2022 Markus Günther IT-Beratung

Experiment 1: Querying for a popular name
VariantruleTypelanguageSet q=(phonetic_name:meier)
Naive- - 2997
1 APPROX auto 1279
2 EXACT auto 1228
3 APPROX german,english1261
4 EXACT german,english1216
Restricting languages to pre-dominantly ones of the corpus removes non-intuitive matches
Almost no noticeable diff be tween APPROX and EXACT wrt. result quality
© 2022 Markus Günther IT-Beratung

Few ordering issues, meier almost everytime before phonetic variations

Experiment 2: Querying for a unique name with spelling variations
VariantruleTypelanguageSet Correct Var. 1Var. 2
Naive- - 7 0 30 (non-intuitive)
1 APPROX auto 1 5 14 (no match, not
intuitive)
2 EXACT auto 1 0 1 (no match)
3 APPROX german,english7 (top
match)
5
(match)
25 (no match,
intuitive)
4 EXACT german,english1 0 1 (no match)
Variant 3: Precision is good, recall could be better (i.e. one-off -corrections)
© 2022 Markus Günther IT-Beratung

Adding one-off -corrections using Damerau-Levensthein distance complements BMPM.
Prerequisites
name index field that stores <first name> <middle-initial> <surname>
name index field uses n-grams
Refine the query
Can be applied within phrases as well to allow for displacements
"Mustermann Max" should yield the same results as "Max Mustermann"
(name_phonetic:mustermann) OR (name:mustermann~1)
© 2022 Markus Günther IT-Beratung

Adding a boost on fir st name and surname for direct matches.
Influence ordering a bit to always prefer direct matches before phonetic variations.
Prerequisites:
firstname index field that stores <first name> (non-analyzed, lowercased)
surname index field that stores <surname> (non-analyzed, lowercased)
Refine the query
bq=firstname:("mustermann")surname:("mustermann")
© 2022 Markus Günther IT-Beratung

Tuning BMPM using additional mechanisms yields w ell-grounded phonetic matches.
What have we done?
Test the effect of BMPM parameterizations on your dataset
Add one-off -corrections to mitig ate spelling mistakes that phonetics won't catch
Allow for displacement of max. two terms within a phrase
Boost on fir st and surname separately to influence relevance sorting
© 2022 Markus Günther IT-Beratung

Tuning BMPM using additional mechanisms yields w ell-grounded phonetic matches.
Achievements
Good trade-off be tween precision and recall
usually top match on search for unique names
Result sets are explainable
Relevance ordering feels natural
direct matches, phonetic variations, one-off c orrections
© 2022 Markus Günther IT-Beratung

Questions?