Introduction to Apache Solr

skillupevent 98 views 18 slides Apr 20, 2018
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

«مقدمه‌ای بر آپاچی سولار» ارائه شده توسط حسن نصر در اسکیل‌آپ هشتم


Slide Content

Introduction to
Apache Solr
Hassan Nasr Esfahani

Topics
–What we need from a text search engine
–What is Solr?
–Why Solr?
–Concepts And Architecture
–Usage
–Special Features
–Competitors

Text Retrieval vs Database
Retrieval
–Information and Query
–Unstructured vs Structured
–Ambiguous vs Well defined
–Answers
–Relevant documents (ambiguous) vs matched
documents

What we want from text search
engine
Basic Search Features:
–Store some documents with some fields
–Query for documents
Text Search Features
–Find most relevant docs
–Handle Natural language Complications (stop words, stem, tokenizing … )
–Highlight text
–…

Problems with Text Search
SampleProblem
مور‌یمقداص‌دمحم‌،‌شباتک‌،Tokenization
ي‌و‌یDifferent Letter representation
دوریم‌،یوریم‌،‌موریمSimilar words
راگزومآ‌و‌ملعمSynonymous words
ریشWord ambiguity
‌،تفر‌،هب‌،‌تسا‌،اب...Stop words
شراذگSpell errors
نونSpoken language

What is Solr?
–An Open Search Engine
–Written in Java
–Wrapping Apache Lucene
–With REST API
–Fault tolerant
–Scalable
–Distributable

SolrSimple Architecture
Apache Lucene
Query Documents
Analyze
Queue
1
2
3
Analyze
Queue
1
2
3’
Schema.
xml

SolrFeatues
–Advanced Search Method
–Language knowledge
–Scoring/Boosting
–Grouping
–Highlighting
–Nested Documents
–Realtime index update

How It Works
–SolrServer Contains Some Core ( similar to datebasein
DBMS )
–Each Core specified by schema.xml + …
–Fields
–Data Types
–Analyzers

Field List
Field Attributes:
Type
Indexed
Stored
Multivalued

Data Types
–Int, float , long, double
–Date
–String
–Text ( configurable )
–Location

Communicating with Solr
–REST API
–Client Libraris
–JAVA
–Ruby
–PHP
–C#
–Python
–…
–Data Import Handlers
–Direct SQL query

Query Format
–DirfferentQuery Parsers:
–Standard(Lucene)
–Dismax
–Edismax
–Block Join Query Parser
–…

Standard Query Format
–field:Value
–Phrase search : field:"wordlist"
–Wildcard search : wor?d, word*
–Fuzzy Searches : roam~ matches all terms like foam or foams (max 2 edit distance)
–Proximity Searches (words with maximum distance): "jakartaapache"~10
–Range Searches: [52 TO 10000]or {Aida TO Carmen}
–Bossting: jakarta^4 apache
–Boolean Operators : AND (&&) , OR(||) , NOT(!) , + , -, ( )
–Filter Query

Relevancy
–∝Term Frequency
–∝Inverse Document Frequency
–Query Expansion
–…

Weakness
–No Transactions
–No join query
–Use as secondary database
–No partial record modification

Alternative
–Elasticsearch based on Search
–Mostly towards Analytic Usage
–More popular
–Easier to start
–Less Documented