Near Duplicate Image Detection: min-Hash and tf-idf weighting

yosrghozzi2023 8 views 29 slides Mar 06, 2025
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

near duplicate image


Slide Content

Near Duplicate Image Detection:
min-Hash and tf-idf weighting
Ondřej Chum
Center for Machine Perception
Czech Technical University in Prague
co-authors: James Philbin and Andrew Zisserman

Outline
•Near duplicate detection and large databases
(find all groups of near duplicate images in a database)
•min-Hash review
•Novel similarity measures
•Results on TrecVid 2006
•Results on the University of Kentucky database
(Nister & Stewenius)
•Beyond near duplicates

Scalable Near Duplicate Image Detection
•Images perceptually (almost) identical but not identical
(noise, compression level, small motion, small occlusion)
•Similar images of the same object / scene
•Large databases
•Fast – linear in the number of duplicates
•Store small constant amount of data per image

Image Representation
0
4
0
2
...
Feature detector SIFT descriptor [Lowe’04]
Visual vocabulary
Vector quantization

0
1
0
1
...
Bag of
words
Set of
words

min-Hash
A
1
∩ A
2
A
1 U A
2
A
1A
2
Image similarity measured as a set overlap (using min-Hash algorithm)
Spatially related images share visual words
Min-Hash is a locality sensitive hashing (LSH) function m that selects
elements m(A
1
) from set A
1
and m(A
2
) from set A
2
so that
P{m(A
1
) == m(A
2
)} = sim (A
1
, A
2
)

145263
0.630.880.550.940.310.19
0.070.750.590.220.900.41
min-Hash
A CDEB F
Vocabulary
A CB CDB AEF
f
1: C C F
f
2: 453621 A B A
f
3: 546123 C C A
f
4: 216534
B B E
Set A Set B Set C
Ordering min-Hash
overlap (A,B) = 3/4 (1/2)overlap (A,C) = 1/4 (1/5) overlap (B,C) = 0 (0)
~ Un (0,1)
~ Un (0,1)

min-Hash Retrieval
Q
A
E
V
Y
J
}}
}}
}}
C
A
E
V
Q
Z
}}
}}
}}
k hash tables
sketch
s-tuple of
min-Hashes
Sketch collisionAB
......
sim(A, B)
s
1 – (1 - sim(A, B)
s
)
k
s – size of the sketch
k – number of hash tables
Probability of sketch collision
Probability of retrieval
(at least one sketch collision)

Probability of Retrieving an Image Pair
similarity (set overlap)
Near duplicate images
Images of the same object
p
r
o
b
a
b
ilit
y

o
f

r
e
t
r
ie
v
a
l
s = 3, k = 512
Unrelated images

More Complex Similarity Measures

Document / Image / Object Retrieval
idf
W
= log
# docs containing X
W
# documents
0
4
0
2
...
t
Term Frequency – Inverse Document Frequency (tf-idf) weighting scheme
[1] Baeza-Yates, Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999.
[2] Sivic, Zisserman. Video Google: A text retrieval approach to object matching in
videos. ICCV’03.
[3] Nister, Stewenius. Scalable recognition with a vocabulary tree. CVPR’06.
[4] Philbin, Chum, Isard, Sivic, Zisserman. Object retrieval with large vocabularies and
fast spatial matching. CVPR’07.
Words common to many documents
are less informative
Frequency of the words is recorded
(good for repeated structures,
textures, etc…)

• Bag of words representation
(frequency is recorded)
• Histogram intersection similarity measure
• Different importance of visual words
importance d
w of word X
w
More Complex Similarity Measures
• Set of words representation
• Different importance of visual words
importance d
w
of word X
w

Word Weighting for min-Hash
all words X
w
have the same chance to be a min-Hash
For hash function (set overlap similarity)
For hash function
the probability of X
w being a min-Hash is proportional to d
w
A Q VE RJC ZA U B: Y
d
Ad
Cd
E d
Vd
Jd
Q d
Yd
Zd
R

Histogram Intersection Using min-Hash
Idea: represent a histogram as a set, use min-Hash set machinery
A
1
C
1
B
1
A
2
C
2
C
3
C
1
D
1
B
1
B
2
C
2
C
3
A
1 C
1 D
1B
1A
2 B
2 C
2
C
3
min-Hash vocabulary:
Bag of words A / set A’ Bag of words B / set B’
A
1
C
1
D
1
B
1
A
2
B
2 C
2
C
3
A’ U B’:
Set overlap of A’ of B’ is a histogram intersection of A and B
A CDB
Visual words:
t
A = (2,1,3,0) t
B = (0,2,3,1)

Results
• Quality of the retrieval
• Speed – the number of documents considered as near-duplicates

TRECVid Challange
•165 hours of news footage, different channels, different countries
•146,588 key-frames, 352×240 pixels
•No ground truth on near duplicates

Min-Hash on TrecVid
• DoG features
• vocabulary of 64,635 visual words
• 192 min-Hashes, 3 min-Hashes per a sketch, 64 sketches
• similarity threshold 35%
• Examples of images with 24 – 45 near duplicates
• # common results / set overlap only / weighted set overlap only
• Quality of the retrieval appears to be similar

Comparison of Similarity Measures
Images only sharing
uninformative visual words
do not generate sketch
collisions for the proposed
similarity measures
N
u
m
b
e
r

o
f

s
k
e
t
c
h

c
o
llis
io
n
s
Image pair similarity
Set overlap
Weighted set overlap
Weighted histogram

University of Kentucky Dataset
• 10,200 images in groups of four
• Querying by each image in turn
• Average number of correct retrievals in top 4 is measured

Evaluation
Vocabulary sizes 30k and 100k
Number of min-Hashes 512, 640, 768, and 896
2 min-Hashes per sketch
Number of sketches 0.5, 1, 2, and 3 times the number of min-Hashes
Score on average:
weighted histogram intersection 4.6 % better than weighted set overlap
weighted set overlap 1.5 % better than set overlap
Number of considered documents on average:
weighted histogram intersection 1.7 times less than weighted set overlap
weighted set overlap 1.5 times less than set overlap
Absolute numbers for weighted histogram intersection:
min-Hashessketches score 30k score 100kdocs 30k docs 100k
Usable 640 640 2.928 2.889 488.2 117.6
Best 896 2688 3.090 3.166 1790.8 452.8
Retrieval tf-idf flat scoring [Nister & Stewenius] score 3.16
Number of considered documents (non-zero tf-idf) 10,089.9 (30k) and 9,659.4 (100k)

Query Examples
Query image:
Results
Set overlap, weighted set overlap, weighted histogram intersection

Beyond Near Duplicate Detection

Discovery of Spatially Related Images
Find and match ALL groups (clusters) of spatially related
images in a large database, using only visual information,
i.e. not using (flicker) tags, EXIF info, GPS, ….
Chum, Matas: Large Scale Discovery of Spatially Related Images, TR May 2008
available at http://cmp.felk.cvut.cz/~chum/Publ.htm

Probability of Retrieving an Image Pair
similarity (set overlap)
Near duplicate images
Images of the same object
p
r
o
b
a
b
ilit
y

o
f

r
e
t
r
ie
v
a
l

Image Clusters as Connected Components
Randomized clustering method:
1. Seed Generation – hashing (fast, low recall)
characterize images by pseudo-random numbers stored in a
hash table time complexity equal to the sum of second
moments of Poisson random variable -- linear for database
size D ≈ 2
40

2. Seed Growing – retrieval (thorough – high recall)
complete the clusters only for cluster members c << D,
complexity O(cD)

Clustering of 100k Images
H er tfo r d
K ebl e
M ag da le n
P itt R iv er s
R adc li ffe C am e r a
Al l S ou l' s
As hm ol ea n
Ba ll i ol
Bo dl ei an
C hr is t C hu rch
C or nm a rk et
Images downloaded from FLICKR
Includes 11 Oxford Landmarks with manually labelled ground truth

Results on 100k Images
Number of images: 104,844
Timing: 17 min + 16 min = 0.019 sec / image
Component Recall (CR)
GoodOKUnrelatedCR
All Souls 24 54 0 97.44
Ashmolean 12 13 0 68.00
Balliol 5 7 0 33.33
Bodleian 13 11 1 95.83
Christ Church 51 27 0 89.74
Cornmarket 5 4 0 66.67
Hertford 35 19 1 96.30
Keble 6 1 0 85.71
Magdalen 13 41 0 5.56
Pitt Rivers 3 3 0 100
Radcliffe Camera 105116 0 98.64
Chum, Matas
TR, May 2008

Results on 100k Images
GoodOKUnrelatedCR
All Souls 24 54 0 97.44
Ashmolean 12 13 0 68.00
Balliol 5 7 0 33.33
Bodleian 13 11 1 95.83
Christ Church 51 27 0 89.74
Cornmarket 5 4 0 66.67
Hertford 35 19 1 96.30
Keble 6 1 0 85.71
Magdalen 13 41 0 5.56
Pitt Rivers 3 3 0 100
Radcliffe Camera 105116 0 98.64
CR
96
60
33
96
71
67
65
57
20
100
98
Philbin, Sivic, Zisserman
BMVC 2008
Chum, Matas
TR, May 2008
5,062
?
Number of images: 104,844
Timing: 17 min + 16 min = 0.019 sec / image
Component Recall (CR)

Conclusions
•New similarity measures were derived for the min-Hash
framework
–Weighted set overlap
–Histogram intersection
–Weighted histogram intersection
•Experiments show that the similarity measures are
superior to the state of the art
–in the quality of the retrieval (up to 7% on University of Kentucky
dataset)
–in the speed of the retrieval (up to 2.5 times)
•min-Hash is a very useful tool for randomized image
clustering

Thank you!