Architecture of a search engine

SylvainUtard 9,150 views 19 slides Apr 28, 2014
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

My Paris Tech Talk #7 slides, April 2014. Architecture of a search engine, full-text search from my technical point of view.


Slide Content

Architecture of a Search Engine
Paris Tech Talks #7 - April ’14
@sylvainutard - @algolia

•Today Search means Google
•Search is a daily activity
•Search is complex
•DB are (probably) not handling text queries
•Speed and relevance are keys
•Fuzzy matching: typos!
2
Search

•Databases
•Optimized for INSERT/UPDATE/DELETE/
SELECT (that's a lot)
•Strong query syntax (mostly SQL)
•Some operations scan all your documents
(missing index?)
3
Why Search engines?

•Search engines
•HIGHLY optimized for “SELECT” (only)
•Full-text queries: understand what is a word
•Query execution time driven by the number of
matching documents
•And obviously, “LIKE '%foo bar%’" is not full-
text search
4
Why Search engines?

5
Why Search engines?
Search
Push data
periodically or
in realtime
Full-text search
Primary storage
(DB, files, ...)
Search engine
Application

•Input = documents
•Composed by multiple attributes (textual,
numerical, geo)
•Output = documents
•Full-text query and/or numerical !lters
•Understandable results: match score (ranking) +
highlighting
6
How it works

•2 distinct processes
•Indexing: storing documents in a highly
optimized way to answer queries
•Query
•Matching documents
•Ranking matched documents
7
Implementation

•Indexing means building an “index“ or “inverted
lists“
•A dedicated data structure optimized for search
•Input = a set of documents containing words
•Output = a set of words associated to
documents
8
Implementation: Indexing process

9
Implementation: Indexing process
foo bar baz
Doc 1
bar foo
Doc 2
baz baz qux
Doc 3
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Indexing
Inverted lists
Documents Index

•Queries
•Goal = Retrieve all documents matching a user
query
•Order results from the highest ranked to the
lowest
10
Implementation: Query process

11
Implementation: Query process
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz"
Sort matching
documents
Pagination
•1-word query = inverted lists intersection

12
Implementation: Query process
•N-words query = inverted lists intersection
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz qux"
Sort matching
documents
Intersect inverted
lists
Pagination

•But how do you handle typing mistakes?
•Edit-distance algorithms (ex: Levenshtein)
!
•levenshtein(bar, baz) = 1 (substitution)!
•levenshtein(bar, br) = 1 (deletion)!
•levenshtein(bar, foobar) = 3 (addition)!
•Comparing a word with all known words
would be too costly
13
Implementation: Query process

14
Implementation: Query process
•The words dictionary is stored in a TRIE to enable
Levenshtein-based lookups (recursive-based traversal)
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f

15
Implementation: Query process
Example: faz
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
faz (distance=1)
faz (distance=0)
faz (distance=1)
faz (distance=1)
faz (distance=2) faz (distance=1)
faz (distance=2)
faz (distance=3)

•How are the matching documents ranked?
•Number of match occurrences? TF-IDF ?
•Numerical value re"ecting popularity?
•Number of typing mistakes?
•Proximity between matched words?
•…
16
Implementation: Query process

17
Several implementations

•What I didn’t speak about:
•Numerical/Geo queries (Including operators)
•Advanced query syntax (boolean operators, proximity
operators)
•Faceting & Aggregations (Categorization)
•Sharding (Horizontal scalability)
•Incremental indexing (Generational data structures)
•… (see u next time)
18
Missing subjects

Q/A
Now or later [email protected]