Similarity Metrics for Textual Data.pptx

SrishtiSharma740264 13 views 11 slides Oct 12, 2024

Slide 1 of 11

About This Presentation

Natural Language Processing- Similarity metrics for text

Size: 203.59 KB

Language: en

Added: Oct 12, 2024

Slides: 11 pages

Slide Content

Similarity Metric

Contents What is Similarity metrics? String similarity metrics Types of similarity metrics Levenshtein Edit distance Cosine Similarity Jaccard Coefficient Similarity on basis of Information content in context of a corpus Resnik Similarity Jaing-Conrath Similarity Lin Similarity Path Similarity Path Similarity Leachowk-Chodorow Simiarity Wu-Palmer Similarity Zipf’s Law

Levenshtein Edit distance reference= 'meet me at the airport tomorrow' test = 'meat me at the aeroport 2morrw' reference= 'meet me at the airport tomorrow'.split () test = 'meat me at the aeroport 2morrw'.split() print("length of reference sentence =" , len (reference)) print("/*****calculating edit distance using Levenshtein ****/") summ =0 for i in range(0,6): dist = edit_distance (reference[ i ], test[ i ]) print("edit distance for word", i , "=", dist ) summ = summ + dist print("total corrections =", summ ) print("average corrections per word =", summ / len (reference)) Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t.

Cosine Similarity def get_cosine (vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x]**2 for x in vec1.keys()]) sum2 = sum([vec2[x]**2 for x in vec2.keys()]) denominator = math.sqrt (sum1) * math.sqrt (sum2) if not denominator: return 0.0 else: return float(numerator) / denominator def text_to_vector (text): words = WORD.findall (text) return Counter(words) text1 = 'meet me at the airport tomorrow ' text2 = 'meat me at the aeroport 2morrw ' vector1 = text_to_vector (text1) vector2 = text_to_vector (text2) cosine = get_cosine (vector1, vector2) print('Cosine:', cosine)

Jaccard Coefficient def compute_jaccard_index (set_1, set_2): n = len (set_1.intersection(set_2)) return n / float( len (set_1) + len (set_2) - n) Refer_set : 'meet me at the airport tomorrow’ Test_set : ‘ meat me at the aeroport 2morrw’ print(" Jaccard Coefficient: ", compute_jaccard_index ( refer_set,test_set )) The Jaccard index , also known as the Jaccard similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets.

Similarity on basis of Information content(in reference to a corpus) Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node") Sim res (c 1 , c 2 ) = IC (LCS (c 1 , c 2 )) Jiang Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets . The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC( lcs )). Lin similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets . The relationship is given by the equation 2 * IC( lcs ) / (IC(s1) + IC(s2)).")

Example for Information content

Example

Code: print("Load an Information Content file: wordnet_ic ") brown_ic = wordnet_ic.ic ('ic-brown.dat') semcor_ic = wordnet_ic.ic ('ic-semcor.dat') print(" Resnik similarity in reference to Brown Corpus for dog,cat : ", dog.res_similarity (cat, brown_ic )) print(" Resnik similarit y in reference to SemCor Corpus for dog,cat : ", dog.res_similarity (cat, semcor_ic )) print(" Jiang Conrath similarity for dog and cat in reference to brown corpus : ", dog.jcn_similarity (cat, brown_ic )) print(" Jiang Conrath similarity for dog and cat in reference to SemCor corpus : ", dog.jcn_similarity (cat, semcor_ic )) print(" Lin Similarity for cat and dog in reference to Brown corpus : ", dog.lin_similarity (cat, brown_ic )) print(" Lin Similarity for cat and dog in reference to SemCor corpus : ", dog.lin_similarity (cat, semcor_ic ))

Path Similarity Path similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a ( hypernym / hypnoym ) taxonomy. The score is in the range 0 to 1. Leachowk-Chodrow Similarity: The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth. Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node.

Code: dog_synset = wn.synsets ('dog') dog= wn.synset ('dog.n.01') cat= wn.synset ('cat.n.01') print( dog_synset ) print( 'path similarity of dog with cat=', dog.path_similarity (cat)) print(" Leachock-Chodorow Similarity for (Dog, cat)", dog.lch_similarity (cat)) print(" Wu-Palmer similarity for ( dog,cat )=", dog.wup_similarity (cat))

Similarity Metrics for Textual Data.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Similarity Metrics for Textual Data.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......