SrishtiSharma740264
13 views
11 slides
Oct 12, 2024
Slide 1 of 11
1
2
3
4
5
6
7
8
9
10
11
About This Presentation
Natural Language Processing- Similarity metrics for text
Size: 203.59 KB
Language: en
Added: Oct 12, 2024
Slides: 11 pages
Slide Content
Similarity Metric
Contents What is Similarity metrics? String similarity metrics Types of similarity metrics Levenshtein Edit distance Cosine Similarity Jaccard Coefficient Similarity on basis of Information content in context of a corpus Resnik Similarity Jaing-Conrath Similarity Lin Similarity Path Similarity Path Similarity Leachowk-Chodorow Simiarity Wu-Palmer Similarity Zipf’s Law
Levenshtein Edit distance reference= 'meet me at the airport tomorrow' test = 'meat me at the aeroport 2morrw' reference= 'meet me at the airport tomorrow'.split () test = 'meat me at the aeroport 2morrw'.split() print("length of reference sentence =" , len (reference)) print("/*****calculating edit distance using Levenshtein ****/") summ =0 for i in range(0,6): dist = edit_distance (reference[ i ], test[ i ]) print("edit distance for word", i , "=", dist ) summ = summ + dist print("total corrections =", summ ) print("average corrections per word =", summ / len (reference)) Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t.
Cosine Similarity def get_cosine (vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x]**2 for x in vec1.keys()]) sum2 = sum([vec2[x]**2 for x in vec2.keys()]) denominator = math.sqrt (sum1) * math.sqrt (sum2) if not denominator: return 0.0 else: return float(numerator) / denominator def text_to_vector (text): words = WORD.findall (text) return Counter(words) text1 = 'meet me at the airport tomorrow ' text2 = 'meat me at the aeroport 2morrw ' vector1 = text_to_vector (text1) vector2 = text_to_vector (text2) cosine = get_cosine (vector1, vector2) print('Cosine:', cosine)
Jaccard Coefficient def compute_jaccard_index (set_1, set_2): n = len (set_1.intersection(set_2)) return n / float( len (set_1) + len (set_2) - n) Refer_set : 'meet me at the airport tomorrow’ Test_set : ‘ meat me at the aeroport 2morrw’ print(" Jaccard Coefficient: ", compute_jaccard_index ( refer_set,test_set )) The Jaccard index , also known as the Jaccard similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets.
Similarity on basis of Information content(in reference to a corpus) Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node") Sim res (c 1 , c 2 ) = IC (LCS (c 1 , c 2 )) Jiang Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets . The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC( lcs )). Lin similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets . The relationship is given by the equation 2 * IC( lcs ) / (IC(s1) + IC(s2)).")
Example for Information content
Example
Code: print("Load an Information Content file: wordnet_ic ") brown_ic = wordnet_ic.ic ('ic-brown.dat') semcor_ic = wordnet_ic.ic ('ic-semcor.dat') print(" Resnik similarity in reference to Brown Corpus for dog,cat : ", dog.res_similarity (cat, brown_ic )) print(" Resnik similarit y in reference to SemCor Corpus for dog,cat : ", dog.res_similarity (cat, semcor_ic )) print(" Jiang Conrath similarity for dog and cat in reference to brown corpus : ", dog.jcn_similarity (cat, brown_ic )) print(" Jiang Conrath similarity for dog and cat in reference to SemCor corpus : ", dog.jcn_similarity (cat, semcor_ic )) print(" Lin Similarity for cat and dog in reference to Brown corpus : ", dog.lin_similarity (cat, brown_ic )) print(" Lin Similarity for cat and dog in reference to SemCor corpus : ", dog.lin_similarity (cat, semcor_ic ))
Path Similarity Path similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a ( hypernym / hypnoym ) taxonomy. The score is in the range 0 to 1. Leachowk-Chodrow Similarity: The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth. Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node.
Code: dog_synset = wn.synsets ('dog') dog= wn.synset ('dog.n.01') cat= wn.synset ('cat.n.01') print( dog_synset ) print( 'path similarity of dog with cat=', dog.path_similarity (cat)) print(" Leachock-Chodorow Similarity for (Dog, cat)", dog.lch_similarity (cat)) print(" Wu-Palmer similarity for ( dog,cat )=", dog.wup_similarity (cat))