A.M.T COLLEGE DEPARTMENT OF INFORMATION TECHNOLOGY Information Storage and Retrieval
CHAPTER TWO TEXT/DOCUMENT OPERATION AND AUTOMATIC INDEXING
CHAPTER TWO TEXT/DOCUMENT OPERATION AND AUTOMATIC INDEXING The main contents of this chapter are the following. Index term selection ( Zipf’s law and L uhn’s selection) Document pre-processing (lexical analysis, stop word elimination, Stemming) Term extraction (Term weighting and Similarity measures).
2.1 index term selection/ ማውጫ ቃል ምርጫ An index language is the language used to describe documents and requests. The elements of index language are index terms.
C ont … Some words are not good for representing documents, use of all words have computational cost, increase searching time and storage requirements and using the set of all አንዳንድ ቃላቶች ሰነዶችን ለመወከል ጥሩ አይደሉም ፣ የሁሉም ቃላት አጠቃቀም ስሌት ዋጋ አላቸው ፣ የፍለጋ ጊዜን እና የማከማቻ መስፈርቶችን ይጨምሩ እና የሁሉንም ስብስብ ለመጠቀም ።
words in a collection to index document generates too much noise for the retrieval task, therefor, term selection is very important. በክምችት ውስጥ ያሉ ቃላት ወደ መረጃ ጠቋሚ ሰነድ በጣም ብዙ ድምጽ ያመነጫሉ , ስለዚህ የቃላት ምርጫ በጣም አስፈላጊ ነው .
The main objectives of term selection are: Represent textual documents by a set of keywords called index terms or simply terms. Increase efficiency by extracting from the resulting document a selected set of terms to be used for indexing the document.
If full text representation is adopted then all words are used for indexing Index terms is called keyword or is a word(a single word) or phrase(multiword).
indexing/ መረጃ ጠቋሚ Is the art of organizing information Is an association of descriptors(keywords, concepts) to document s in view of Act of assigning index terms to a document.
Is the process of storing data in a particular way in order to locate and retrieve the data. Is a way of identify important information and represent it in a useful way.
why indexing? Need some representation of content Can not use the full document for search
indexing used in: Find documents by topic Define topic areas, relate documents to each other Predict relevance between documents and information need To allow easy identification of documents
T here are two ways of indexing 1 . Manual indexing Indexers decide which keywords to assign to documents based on controlled vocabulary(human indexers assign index terms to documents).
The indexers analyse and represent the content of a document through keywords which is based on intellectual judgment and semantic interpretation of (concepts, themes) of indexers.
The ff are important in manual indexing Terms that will be used by the user Indexing vocabulary Collection characteristics
Indexers are normally provided with guidelines(input sheets, manuals and printed thesaurus ) to determine the contents of a given document and are usually done in the library environment.
A dvantage of manual indexing Ability to perform abstraction (conclude what the subject is) and determine additional related terms. Ability to judge the value of concepts
Disadvantage of manual indexing Slow and expensive (significant cost) -cost of professional indexers is very expensive.
High probability off inconsistency or low consistency among indexers(maintaining consistency is difficult ). Labor intensive
2. Automatic indexing Automatic indexing is the assignment of content identifiers, with the help of modern computing technology.
A computer system is used to record the descriptors generated by the human and the system extracts “typical”/”significant” terms . The original texts of information items are used as basis of indexing.
An automatic indexing is necessary because of the ff reason : Information overload -enormous amount of information is being generated from day to day activity.
Explosion of machine readable text -massive information available in electronic format and on internet. Cost effective - human indexing is expensive and labor intensive
P rocedures for automatic indexing Generating document representatives through automatic indexing involves Lexical analysis Use of stop list Noun identification(optional) Phrase formation ( optional)
Use of conflation procedures(stemming, optional) Selection of index terms Weighting the resulting terms(optional)
A dvantage of automatic indexing Reduced processing time(Fast) Reduced cost (inexpensive) Easy to maintain Improved consistency Better retrieval(achieved)
Disadvantage of automatic indexing Mechanical execution of algorithm, with no intelligent interpretation(of aboutness /relevance)
2.1.1 Zipf’s law in IR and Luhn’s selection 2.1.1.1 Zipf’s law Zipf’s law states that given a corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.
The rank-frequency distribution is an inverse relation . 2 most frequent words ( e.g “ the ”, “to ”) can account for about 10% of words documents. Eg . The word “the” is the most frequently occurring
Z ipf’s law example The table shows the most frequently occurring words from 336,310 document collection containing 125,720,891 total words; out of which 508,209 unique words.
Frequent word Number of occ t he----------------------7,398,934 o f------------------------3,893,790 t o------------------------3,364,653 a nd----------------------3,320,687 i n------------------------2,311,785 i s------------------------1,559,147
f or-------------------------1,313,561 The-----------------------1,144,860 t hat----------------------1,066,503 Said----------------------1,027,713
2.1.1.2 Luhn’s analysis Luhn Idea (1958): the frequency of word occurrence in a text provides a useful measurement of word significance.
He suggested that both extremely common and extremely uncommon words were not very useful for document representation and indexing.
Therefore, the most important words for indexing are those which occur with intermediate frequencies. Thus, according to Luhn medium frequency term are better candidates for indexing.
He states proposed that the frequency of word of occurrence in an article furnishes a useful measurement of word significance.
2.2 Document Pre-processing Preprocessing is the process of controlling the size of the vocabulary or the number of distinct words used as index terms.
Text operation is the process of text transformations into a logical representation. 5 main operations/transformations selecting index terms.
A. Lexical analysis of the text generate a set of words from text collection With the objective of treating digits, hyphens, punctuations marks, and the cases of letter.
Digits (1999), Case ( Republican vs. republican) HYPHEN Eg . MS-DOS, B-49, PUNCTUATION WWW.WSU.EDU.ET
B. Elimination of stop-words . F ilter out words which are not useful in the retrieval process.
C. Stemming : of the remaining words with the objective of removing affixes( i.e suffixes and prefixes) and allowing the retrieval documents containing syntactic variation of query terms( e.g connect,connected,connecting etc..
D. Selection of index terms: To determine which words/ stems are or groups of words will be used as an indexing elements.
E. Construction of term categorization Structures such as thesaurus, to capture relationship for allowing the expansion of the original query with related terms.
T ext processing system Tokenization is one of the step used to convert text of the documents into a sequence of words.
Elimination of stop words Stop words are extremely common words across document collections that have no discriminatory power. Eg . Articles, Pronouns, Prepositions, Conjunction/ connectors
Normalization It is in a way standardization of text. E.g U.S.A vs USA
Case folding Often best to lower case everything Eg . Fasil vs. fasil vs. FASIL Stemming The process involves removal of affixes. Eg.Boy -boys, cut-cutting, creation-create