Architecture of a search engine

SylvainUtard 9,150 views 19 slides Apr 28, 2014

Slide 1 of 19

About This Presentation

My Paris Tech Talk #7 slides, April 2014. Architecture of a search engine, full-text search from my technical point of view.

Size: 1.33 MB

Language: en

Added: Apr 28, 2014

Slides: 19 pages

Slide Content

Architecture of a Search Engine
Paris Tech Talks #7 - April ’14
@sylvainutard - @algolia

•Today Search means Google
•Search is a daily activity
•Search is complex
•DB are (probably) not handling text queries
•Speed and relevance are keys
•Fuzzy matching: typos!
2
Search

•Databases
•Optimized for INSERT/UPDATE/DELETE/
SELECT (that's a lot)
•Strong query syntax (mostly SQL)
•Some operations scan all your documents
(missing index?)
3
Why Search engines?

•Search engines
•HIGHLY optimized for “SELECT” (only)
•Full-text queries: understand what is a word
•Query execution time driven by the number of
matching documents
•And obviously, “LIKE '%foo bar%’" is not full-
text search
4
Why Search engines?

5
Why Search engines?
Search
Push data
periodically or
in realtime
Full-text search
Primary storage
(DB, ﬁles, ...)
Search engine
Application

•Input = documents
•Composed by multiple attributes (textual,
numerical, geo)
•Output = documents
•Full-text query and/or numerical !lters
•Understandable results: match score (ranking) +
highlighting
6
How it works

•2 distinct processes
•Indexing: storing documents in a highly
optimized way to answer queries
•Query
•Matching documents
•Ranking matched documents
7
Implementation

•Indexing means building an “index“ or “inverted
lists“
•A dedicated data structure optimized for search
•Input = a set of documents containing words
•Output = a set of words associated to
documents
8
Implementation: Indexing process

9
Implementation: Indexing process
foo bar baz
Doc 1
bar foo
Doc 2
baz baz qux
Doc 3
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Indexing
Inverted lists
Documents Index

•Queries
•Goal = Retrieve all documents matching a user
query
•Order results from the highest ranked to the
lowest
10
Implementation: Query process

11
Implementation: Query process
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz"
Sort matching
documents
Pagination
•1-word query = inverted lists intersection

12
Implementation: Query process
•N-words query = inverted lists intersection
foo
bar
baz
qux
Doc 1, Doc 2
Doc 1, Doc 3
Doc 1, Doc 2
Doc 3
Inverted lists
Index
User query "baz qux"
Sort matching
documents
Intersect inverted
lists
Pagination

•But how do you handle typing mistakes?
•Edit-distance algorithms (ex: Levenshtein)
!
•levenshtein(bar, baz) = 1 (substitution)!
•levenshtein(bar, br) = 1 (deletion)!
•levenshtein(bar, foobar) = 3 (addition)!
•Comparing a word with all known words
would be too costly
13
Implementation: Query process

14
Implementation: Query process
•The words dictionary is stored in a TRIE to enable
Levenshtein-based lookups (recursive-based traversal)
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f

15
Implementation: Query process
Example: faz
Doc 1 (pos=1, 3)
Doc 2 (pos=3)
Doc 1 (pos=2)
Doc 3 (pos=1)
Index
Doc 1 (pos=4)
Doc 3 (pos=2)
b c
a o
r z o
f
faz (distance=1)
faz (distance=0)
faz (distance=1)
faz (distance=1)
faz (distance=2) faz (distance=1)
faz (distance=2)
faz (distance=3)

•How are the matching documents ranked?
•Number of match occurrences? TF-IDF ?
•Numerical value re"ecting popularity?
•Number of typing mistakes?
•Proximity between matched words?
•…
16
Implementation: Query process

17
Several implementations

•What I didn’t speak about:
•Numerical/Geo queries (Including operators)
•Advanced query syntax (boolean operators, proximity
operators)
•Faceting & Aggregations (Categorization)
•Sharding (Horizontal scalability)
•Incremental indexing (Generational data structures)
•… (see u next time)
18
Missing subjects

Architecture of a search engine

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Architecture of a search engine

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......