Apache Solr

minhkiller 7,701 views 46 slides Jul 13, 2011
Slide 1
Slide 1 of 46
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46

About This Presentation

This will introduce you what Apache SOLR could do and apply it for your project


Slide Content

Enterprise search with Solr Minh Tran

Why does search matter? Then: Most of the data encountered created for the web Heavy use of a site ‘s search function considered a failure in navigation Now: Navigation not always relevant Less patience to browse Users are used to “navigation by search box ” Confidential 2

What is SOLR Open source enterprise search platform based on Apache Lucene project. REST-like HTTP / XML and JSON APIs Powerful full-text search, hit highlighting, faceted search Database integration, and rich document (e.g., Word, PDF) handling Dynamic clustering, distributed search and index replication Loose Schema to define types and fields Written in Java5, deployable as a WAR Confidential 3

Public Websites using Solr Mature product powering search for public sites like Digg , CNet , Zappos , and Netflix See here for more information: http:// wiki.apache.org/solr/PublicServers Confidential 4

Architecture Confidential 5 Solr Core Lucene Admin Interface Standard Request Handler Disjunction Max Request Handler Custom Request Handler Update Handler Caching XML Update Interface Config Analysis HTTP Request Servlet Concurrency Update Servlet XML Response Writer Replication Schema

Starting Solr We need to set these settings for SOLR: solr.solr.home : SOLR home folder contains conf /solrconfig.xml solr.data.dir : folder contains index folder Or configure a JNDI lookup of java:comp / env / solr /home to point to the solr directory. For e.g : java - Dsolr.solr.home =./ solr - Dsolr.data.dir =./ solr /data -jar start.jar (Jetty) Other web server, set these values by setting Java properties Confidential 6

Web Admin Interface Confidential 7

Menu name Description Statistics Information about when the index was loaded and how many documents are in it. Usage information on the SolrRequestHandlers used to service queries. Data covering the indexing process, including the number of additions, deletions, commits, etc. Cache implementation and hit/miss/eviction information. Info Details the version of running Solr and classes used in the current implementation for queries, updates, and caching Distribution Displays information about index distribution and replication Ping Issues a ping request to the server, consisting of the query specified in the admin section of the solrconfig.xml file. Logging Allows you to change the logging level of the current application Java properties Displays all of the Java system properties in use by the current system Thread dump The thread dump option displays stack trace information for all the threads running in the JVM. Confidential 8

How Solr Sees the World An index is built of one or more Documents A Document consists of one or more Fields Documents are composed of fields A Field consists of a name, content, and metadata telling Solr how to handle the content . You can tell Solr about the kind of data a field contains by specifying its field type Confidential 9

Field Analysis Field analyzers are used both during ingestion, when a document is indexed, and at query time An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes. Tokenizers break field data into lexical units, or tokens Example: Setting all letters to lowercase Eliminating punctuation and accents, mapping words to their stems, and so on “ram”, “Ram” and “RAM” would all match a query for “ram” Confidential 10

Schema.xml schema.xml file located in ../ solr / conf schema file starts with <schema > tag Solr supports one schema per deployment The schema can be organized into three sections : Types Fields Other declarations 11

Example for TextField type Confidential 12

Filter explanation StopFilterFactory : Tokenize on whitespace, then removed any common words WordDelimiterFilterFactory : Handle special cases with dashes, case transitions, etc . LowerCaseFilterFactory : lowercase all terms. EnglishPorterFilterFactory : Stem using the Porter Stemming algorithm. E.g : “runs , running, ran”  its elemental root " run" RemoveDuplicatesTokenFilterFactory : Remove any duplicates: Confidential 13

Field Attributes Indexed: Indexed Fields are searchable and sortable. You also can run Solr 's analysis process on indexed Fields, which can alter the content to improve or change results. Stored: The contents of a stored Field are saved in the index. This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search. For example, many applications store pointers to the location of contents rather than the actual contents of a file. Confidential 14

Field Definitions Field Attributes: name, type, indexed, stored, multiValued , omitNorms Dynamic Fields, in the spirit of Lucene ! < dynamicField name ="*_i" type =" sint “ indexed ="true" stored ="true"/> < dynamicField name ="*_s" type ="string“ indexed ="true" stored ="true"/> < dynamicField name ="*_t" type ="text“ indexed ="true" stored ="true"/> 15

Other declaration < uniqueKey > url </ uniqueKey >: url field is the unique identifier, is determined a document is being added or updated defaultSearchField : is the Field Solr uses in queries when no field is prefixed to a query term For e.g : q= title:Solr , If you entered q= Solr instead, the default search field would apply Confidential 16

Indexing data Using curl to interact with Solr : http:// curl.haxx.se/download.html Here are different data formats: Solr's native XML CSV (Character Separated Value) Rich documents through Solr Cell JSON format Direct Database and XML Import through Solr's DataImportHandler Confidential 17

Add / Update d ocuments HTTP POST to add / update <add> <doc boost=“2”> <field name=“article”>05991</field> <field name=“title”>Apache Solr </field> <field name=“subject”>An intro...</field> <field name=“category”>search</field> <field name=“category”> lucene </field> <field name=“body”> Solr is a full...</field> </doc> </add> Confidential 18

Delete documents Delete by Id <delete><id>05591</id></delete> Delete by Query (multiple documents) <delete >< query> manufacturer:microsoft </query ></ delete> Confidential 19

Commit / Optimize <commit/> tells Solr that all changes made since the last commit should be made available for searching. < optimize/> same as commit. M erges all index segments. Restructures Lucene 's files to improve performance for searching. Optimization is generally good to do when indexing has completed If there are frequent updates, you should schedule optimization for low-usage times An index does not need to be optimized to work properly. Optimization can be a time-consuming process. Confidential 20

Index XML documents Use the command line tool for POSTing raw XML to a Solr Other options: - Ddata =[ files |args|stdin ] - Durl = http:// localhost:8983/solr/update - Dcommit = yes (Option default values are in red) Example: java - jar post.jar *. xml java -Ddata=args -jar post.jar " < delete><id>42</id></delete > " java - Ddata = stdin -jar post.jar java - Dcommit = no - Ddata = args - jar post.jar "< delete><query> *:* </query></delete>" Confidential 21

Index CSV file using HTTP POST curl command does this with data-binary and an appropriate content-type header reflecting that it's XML . Example: using HTTP-POST to send the CSV data over the network to the Solr server: curl http:// localhost:9090/solr/update - H " Content-type:text / xml;charset =utf-8" --data-binary @ ipod_other.xml Confidential 22

Index CSV using remote streaming Uploading a local CSV file can be more efficient than sending it over the network via HTTP. Remote streaming must be enabled for this method to work. Change enableRemoteStreaming =" true“ in solrconfig.xml: < requestParsers enableRemoteStreaming =" false“ multipartUploadLimitInKB ="2048 "/> Example: java - Ddata = args - Durl =http://localhost:9090/solr/update -jar post.jar "<commit/>" curl http:// localhost:9090/solr/update/csv - F " stream.file =d:/Study/ Solr /apache-solr-1.4.1/example/ exampledocs /books.csv" -F " commit=true" –F "optimize=true" -F " stream.contentType =text/ plain;charset =utf-8" curl " http ://localhost:9090/solr/update/csv?overwrite=false&stream.file=d:/ Study/ Solr /apache-solr-1.4.1/example/ exampledocs / books.csv& commit = true &optimize =true " Confidential 23

Index rich document with Solr Cell Solr uses Apache Tika , framework for wrapping many different format parsers like PDFBox , POI, and others Example: curl "http:// localhost:9090/ solr /update/extract?literal.id=doc1&commit=true " -F " myfile [email protected]" curl "http ://localhost:9090/ solr /update/extract?literal.id= doc1 &uprefix= attr _& fmap.content = attr_content&commit =true" -F myfile =@ tutorial.html (index html) Capture <div> tags separate, and then map that field to a dynamic field named foo_t : curl "http://localhost:9090/ solr /update/extract?literal.id= doc2 &captureAttr= true&defaultField = text&fmap.div = foo_t&capture =div" -F [email protected] (index pdf ) Confidential 24

Updating a Solr Index with JSON The JSON request handler needs to be configured in solrconfig.xml < requestHandler name="/update/ json " class=" solr.JsonUpdateRequestHandler "/> Example: curl "http://localhost:8983/ solr /update/ json?commit =true" --data-binary @ books.json -H " Content-type:application / json " Confidential 25

Searching Spellcheck Editorial results replacement Scaling index size with distributed search Confidential 26

Default Query Syntax Lucene Query Syntax [; sort specification] mission impossible; releaseDate desc +mission +impossible – actor:cruise “mission impossible” – actor:cruise title:spiderman^10 description:spiderman description:“ spiderman movie”~10 +HDTV +weight:[0 TO 100] Wildcard queries: te?t , te *t, test* Confidential 27

Default Parameters Query Arguments for HTTP GET/POST to /select Confidential 28 param default description q The query start Offset into the list of matches rows 10 Number of documents to return fl * Stored fields to return qt standard Query type; maps to query handler df schema Default field to search hl false Highlight terms matched in doc

Search Results http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price <response>< responseHeader ><status>0</status> < QTime >1</ QTime ></ responseHeader > <result numFound ="16173" start="0"> <doc> < str name="name">Apple 60 GB iPod with Video</ str > <float name="price">399.0</float> </doc> <doc> < str name="name">ASUS Extreme N7800GTX/2DHTV</ str > <float name="price">479.95</float> </doc> </result> </response> 29

Query response writers query responses will be written using the ' wt ' request parameter matching the name of a registered writer . The "default" writer is the default and will be used if ' wt ' is not specified in the request E.g.: http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true Confidential 30

Caching IndexSearcher’s view of an index is fixed Aggressive caching possible Consistency for multi-query requests filterCache – unordered set of document ids matching a query resultCache – ordered subset of document ids matching a query documentCache – the stored fields of documents userCaches – application specific, custom query handlers 31

Configuring Relevancy < fieldtype name="text" class=" solr.TextField "> <analyzer> < tokenizer class=" solr.WhitespaceTokenizerFactory "/> <filter class=" solr.LowerCaseFilterFactory "/> <filter class=" solr.SynonymFilterFactory " synonyms="synonyms.txt“/> <filter class=" solr.StopFilterFactory “ words=“stopwords.txt”/> <filter class=" solr.EnglishPorterFilterFactory " protected="protwords.txt"/> </analyzer> </ fieldtype > 32

Faceted Browsing Example 33

Faceted Browsing 34 DocList Search( Query,Filter [], Sort,offset,n ) computer_type:PC memory:[1GB TO *] computer price asc proc_manu:Intel proc_manu:AMD section of ordered results DocSet Unordered set of all results price:[0 TO 500] price:[500 TO 1000] manu:Dell manu:HP manu:Lenovo intersection Size() = 594 = 382 = 247 = 689 = 104 = 92 = 75 Query Response

Index optimization Confidential 35 MergeFactor MinMergeDocs MaxMergeDocs Document number Time consumed (seconds) 10 10 Interger.MAX_VALUE 10,000 423 100 10 Interger.MAX_VALUE 10,000 270 100 100 Interger.MAX_VALUE 10,000 213 100 100 100 10,000 220 1000 1000 Interger.MAX_VALUE 10,000 194

updates admin queries Solr Searchers High Availability Load Balancer Appservers Solr Master DB Updater updates Index Replication admin terminal HTTP search requests Dynamic HTML Generation

Distributed and replicated Solr architecture Confidential 37

Index by using SolrJ Confidential 38

Query with SolrJ Confidential 39

Distributed and replicated Solr architecture (cont.) At this time, applications must still handle the process of sending the documents to individual shards for indexing The size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patterns Typically the number of documents a single machine can hold is in the range of several million up to around 100 million documents. Confidential 40

Advance Functionality Structure Data Store Data with the Data Import Handler (JDBC, HTTP, File, URL) Support for other programming languages ( .Net , PHP, Ruby, Perl, Python,…) Support for NoSQL database like MongoDB , Cassandra? 41

Other open source server Sphinx Elastic Search Confidential 42

Resources http:// wiki.apache.org/solr/UpdateCSV http :// wiki.apache.org/solr/ExtractingRequestHandler http://lucene.apache.org/tika/ http://wiki.apache.org/solr/ Solr 1.4 Enterprise Search Server. 43

Resources (cont.) http://www.ibm.com/developerworks/java/library/j-solr2/ http://www.ibm.com/developerworks/java/library/j-solr1 / http:// en.wikipedia.org/wiki/Solr Apache Conf Europe 2006 - Yonik Seeley LucidWorks Solr Reference Guide Confidential 44

Confidential 45

Confidential 46