My Paris Tech Talk #7 slides, April 2014. Architecture of a search engine, full-text search from my technical point of view.
Size: 1.33 MB
Language: en
Added: Apr 28, 2014
Slides: 19 pages
Slide Content
Architecture of a Search Engine
Paris Tech Talks #7 - April ’14
@sylvainutard - @algolia
•Today Search means Google
•Search is a daily activity
•Search is complex
•DB are (probably) not handling text queries
•Speed and relevance are keys
•Fuzzy matching: typos!
2
Search
•Databases
•Optimized for INSERT/UPDATE/DELETE/
SELECT (that's a lot)
•Strong query syntax (mostly SQL)
•Some operations scan all your documents
(missing index?)
3
Why Search engines?
•Search engines
•HIGHLY optimized for “SELECT” (only)
•Full-text queries: understand what is a word
•Query execution time driven by the number of
matching documents
•And obviously, “LIKE '%foo bar%’" is not full-
text search
4
Why Search engines?
5
Why Search engines?
Search
Push data
periodically or
in realtime
Full-text search
Primary storage
(DB, files, ...)
Search engine
Application
•Input = documents
•Composed by multiple attributes (textual,
numerical, geo)
•Output = documents
•Full-text query and/or numerical !lters
•Understandable results: match score (ranking) +
highlighting
6
How it works
•2 distinct processes
•Indexing: storing documents in a highly
optimized way to answer queries
•Query
•Matching documents
•Ranking matched documents
7
Implementation
•Indexing means building an “index“ or “inverted
lists“
•A dedicated data structure optimized for search
•Input = a set of documents containing words
•Output = a set of words associated to
documents
8
Implementation: Indexing process
9
Implementation: Indexing process
foo bar baz
Doc 1
bar foo
Doc 2
baz baz qux
Doc 3
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Indexing
Inverted lists
Documents Index
•Queries
•Goal = Retrieve all documents matching a user
query
•Order results from the highest ranked to the
lowest
10
Implementation: Query process
11
Implementation: Query process
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz"
Sort matching
documents
Pagination
•1-word query = inverted lists intersection
•But how do you handle typing mistakes?
•Edit-distance algorithms (ex: Levenshtein)
!
•levenshtein(bar, baz) = 1 (substitution)!
•levenshtein(bar, br) = 1 (deletion)!
•levenshtein(bar, foobar) = 3 (addition)!
•Comparing a word with all known words
would be too costly
13
Implementation: Query process
14
Implementation: Query process
•The words dictionary is stored in a TRIE to enable
Levenshtein-based lookups (recursive-based traversal)
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
15
Implementation: Query process
Example: faz
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
faz (distance=1)
faz (distance=0)
faz (distance=1)
faz (distance=1)
faz (distance=2) faz (distance=1)
faz (distance=2)
faz (distance=3)
•How are the matching documents ranked?
•Number of match occurrences? TF-IDF ?
•Numerical value re"ecting popularity?
•Number of typing mistakes?
•Proximity between matched words?
•…
16
Implementation: Query process
17
Several implementations
•What I didn’t speak about:
•Numerical/Geo queries (Including operators)
•Advanced query syntax (boolean operators, proximity
operators)
•Faceting & Aggregations (Categorization)
•Sharding (Horizontal scalability)
•Incremental indexing (Generational data structures)
•… (see u next time)
18
Missing subjects