Talking about the challenges of supporting autocomplete (instant) search in different languages. Search configuration in Solr, scoring, tokenization, custom components and testing issues are discussed.
Size: 848.08 KB
Language: en
Added: Oct 14, 2016
Slides: 24 pages
Slide Content
OCTOBER 11-14, 2016 • BOSTON, MA
Autocomplete Multi-Language Search Using Ngram
and EDismax Phrase Queries
Ivan Provalov
Sr Software Engineer, Netflix
•Netflix launched globally in January 2016
•190 countries
•Currently support 23 languages
Going Global at Netflix
Use Case
•Video titles, person's names, genre names
•Shorter documents should be ranked higher
•Autocomplete
•Recall over precision for lexical matches (click
signal corrects this)
Configuration
•Solr 4.6.1
•Edismax: boosting, simple syntax, max field
field score
•Phrase: prevents from cross field search
•Ngram: character ngram search
Scoring
•Skewed data distribution (e.g. one field
sparsely populated)
•Doc length normalization
•Unigram language model
•Term Frequency / Terms in Doc
•Log to avoid underflow errors
•Negative score (5.5.2 Dismax Scorer breaks)
Language Challenges
•Multiple Scripts
–Japanese: Kanji, Hiragana, Katakana, Romaji
•No token delimiters: Japanese, Chinese
•Korean character composition
•Stopwords and autocomplete
•Stemming
•Prefix Removal
–Arabic لا (alef lam)
•Suffix folding
–Japanese ァ (katakana small a) => ア (a)
•Character decomposition
–Korean ᅟᅰ (jungseong we) => ㅜ (u) and ㅔ
(e)
Character Mapping Filter Cases
Character Mapping Filter Cases
•Stemmer implementation, or extension
–Character mapper reference implementation of
the Russian stemmer
•Patch to Lucene
–LUCENE-7321
Query Testing Framework
•Open source project
•Google Spreadsheets based UI
•Unit tests for languages queries
•Regression testing after changes, upgrades
•20K queries
•7K titles
Query testing framework
Chris Manning IR Book, LM Chapter
Trey Grainger’s presentation on Semantic & Multilingual
Strategies in Lucene/Solr
Character Mapping Patch and Documentation
Java Internationalization, March 25, 2001, by David Czarnecki,
Andy Deitsch
References