Similarity Metrics for Textual Data.pptx

SrishtiSharma740264 13 views 11 slides Oct 12, 2024
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

Natural Language Processing- Similarity metrics for text


Slide Content

Similarity Metric

Contents What is Similarity metrics? String similarity metrics Types of similarity metrics Levenshtein Edit distance Cosine Similarity Jaccard Coefficient Similarity on basis of Information content in context of a corpus Resnik Similarity Jaing-Conrath Similarity Lin Similarity Path Similarity Path Similarity Leachowk-Chodorow Simiarity Wu-Palmer Similarity Zipf’s Law

Levenshtein Edit distance reference= 'meet me at the airport tomorrow' test = 'meat me at the aeroport 2morrw' reference= 'meet me at the airport tomorrow'.split () test = 'meat me at the aeroport 2morrw'.split() print("length of reference sentence =" , len (reference)) print("/*****calculating edit distance using Levenshtein ****/") summ =0 for i in range(0,6): dist = edit_distance (reference[ i ], test[ i ]) print("edit distance for word", i , "=", dist ) summ = summ + dist print("total corrections =", summ ) print("average corrections per word =", summ / len (reference)) Levenshtein distance  (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The  distance  is the number of deletions, insertions, or substitutions required to transform s into t. 

Cosine Similarity def get_cosine (vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x]**2 for x in vec1.keys()]) sum2 = sum([vec2[x]**2 for x in vec2.keys()]) denominator = math.sqrt (sum1) * math.sqrt (sum2) if not denominator: return 0.0 else: return float(numerator) / denominator def text_to_vector (text): words = WORD.findall (text) return Counter(words) text1 = 'meet me at the airport tomorrow ' text2 = 'meat me at the aeroport 2morrw ' vector1 = text_to_vector (text1) vector2 = text_to_vector (text2) cosine = get_cosine (vector1, vector2) print('Cosine:', cosine)

Jaccard Coefficient def compute_jaccard_index (set_1, set_2): n = len (set_1.intersection(set_2)) return n / float( len (set_1) + len (set_2) - n) Refer_set : 'meet me at the airport tomorrow’ Test_set : ‘ meat me at the aeroport 2morrw’ print(" Jaccard Coefficient: ", compute_jaccard_index ( refer_set,test_set ))   The  Jaccard index , also known as the  Jaccard similarity  coefficient  is a statistic used for comparing the similarity and diversity of sample sets.

Similarity on basis of Information content(in reference to a corpus) Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node") Sim res (c 1 , c 2 ) = IC (LCS (c 1 , c 2 )) Jiang Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets . The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC( lcs )). Lin similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets . The relationship is given by the equation 2 * IC( lcs ) / (IC(s1) + IC(s2)).")

Example for Information content

Example

Code: print("Load an Information Content file: wordnet_ic ") brown_ic = wordnet_ic.ic ('ic-brown.dat') semcor_ic = wordnet_ic.ic ('ic-semcor.dat') print(" Resnik similarity in reference to Brown Corpus for dog,cat : ", dog.res_similarity (cat, brown_ic )) print(" Resnik similarit y in reference to SemCor Corpus for dog,cat : ", dog.res_similarity (cat, semcor_ic )) print(" Jiang Conrath similarity for dog and cat in reference to brown corpus : ", dog.jcn_similarity (cat, brown_ic )) print(" Jiang Conrath similarity for dog and cat in reference to SemCor corpus : ", dog.jcn_similarity (cat, semcor_ic )) print(" Lin Similarity for cat and dog in reference to Brown corpus : ", dog.lin_similarity (cat, brown_ic )) print(" Lin Similarity for cat and dog in reference to SemCor corpus : ", dog.lin_similarity (cat, semcor_ic ))

Path Similarity Path similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a ( hypernym / hypnoym ) taxonomy. The score is in the range 0 to 1. Leachowk-Chodrow Similarity: The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth. Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node.

Code: dog_synset = wn.synsets ('dog') dog= wn.synset ('dog.n.01') cat= wn.synset ('cat.n.01') print( dog_synset ) print( 'path similarity of dog with cat=', dog.path_similarity (cat)) print(" Leachock-Chodorow Similarity for (Dog, cat)", dog.lch_similarity (cat)) print(" Wu-Palmer similarity for ( dog,cat )=", dog.wup_similarity (cat))
Tags