NAME : DEEPALI RAIKAR ROLL NO : 11150157 MSC.IT(PART – I )
SIGNATURE FILES
Typically “SIGNATURE FILE” is just a “BAG OF WORDS” Signature files is a technique applied for “Document Retrieval”. The main idea behind Signature Files is to create a quick link to the documents which match the queries passed by the user. This is done by creating a signature for each document.
A signature is created as an “abstraction” of a document. A signature is a compressed version of a database. All signatures that represent the documents are kept in a file called “SIGNATURE FILES”. The signatures created are stored in the form of “HASH TABLES” to make it easy for retrieving the documents.
Characteristics of signature file Word oriented index structure Low overhead Suitable for not very large text Suitable for conventional databases For most applications inverted files outperform the signature file.
There are various types of signatures, namely : Word signatures Is a fixed-length bit-string representation of word Document Signatures Query Signatures
How Word Signatures are generated Using “TRIPLETS” of word. Each word is divided into the overlapping triplet of characters triplet is given some numeric value Use the number as the input to the Hash Function The hash function produces a number which represents the bit position of the triplet in the word signature
Example of a word signature 111000111001 is a signature created for word “SIGNATURE” *SI SIG IGN GNA NAT ATU TUR URE RE * 12 3 7 3 2 9 1 12 8 Numeric value of each triplet 111000111001 final word signature generated using hash function
Document signature Can be created using two methods Concatenation of word signature Superimposed coding Characteristics of Document signatures The length can vary A fixed number of bits may precede Fixing the length of the document signature is possible The length can be set to the longest document in the collection For shorter documents extra “0” can be added.
Example of signature file Term Hash string cold 1000000000100100 days 0010010000001000 hot 0000101000000000 in 0000100100100000 it 0000100010000010 like 0100001000000001 nine 0010100000000100 old 1000100001000000
Which is better inverted file or signature file Inverted Files Accurate Easy to maintain Slow retrieval Inverted files is the most popular storage structure for “INFORMATION RETRIEVAL”