14 2 – Definition and Terminology
value is the actual character sequence that appears in the code, for instance a variable
name. Token types without a value are for example keywords or braces.
Clone Fragment
A clone fragment is a continuous code passage that appears cloned to some degree of
similarity in another location of the source code. Depending on the clone type, which
we will discuss in Section 2.2, clone fragments may contain gaps, that are not part of
the other cloned fragment.
In the terms of token based clone detection a clone fragmentf(file, s, l) is a
continuous sequence of tokens within one code file, which itself is a stream of tokens.
A fragment starts at indexsof the stream and has a length ofltokens. Code clones
consist of at least two code fragments that are similar.
Clone Pair
When two fragments are clones of each other according to some degree of similarity they
represent aclone pair. The clone pair representation has the disadvantage that it causes
a high volume of data, because every pair is reported as a distinct entity. The number
of clone pair relations needed to represent the clone pairs created when a fragmentf
is copiedntimes grows quadratically withn. This is the case because every fragment
fiis part of a clone pair with all the other fragmentsfi+1, :::, fn. The number of clone
pairs can be computed using an adaption of Gauss’ sum formula for integers:
n
2
−n
2
.
1
Although the information represented by clone pairs is precise, it is not suitable
for studies as we conduct in this thesis. Besides the sheer amount of reported clone
pairs, another problem is the lack of grouping. A fragment that was copied four times
will be reported as six separate clone pairs. This makes it impractical to infer higher
relationships form the data.
In the terms of our token based clone detection a clone pair is a triplecp(t, fn, fm)
wherefnandfmare the two cloned code fragments. The typetof the clone
relation expresses the degree of similarity offnandfnand is further explained
in Section 2.2. If a fragment appears exactly cloned (type 1) in three locations
f1,f2, andf3there are three clone pairscp1(1, f1, f2),cp1(1, f1, f3),cp1(1, f2, f3).
Among clone pairs of the types 1 and 2 (see Section 2.2) the clone pair relation is
transitive (cpi(tx, fn, fm)∧cpj(tx, fm, fo)⇒cpk(tx, fn, fo)). It is also symmetrical
(cpi(tx, fn, fm) =cpj(tx, fm, fn)). By our definition a clone detector will never report
both versions of a symmetrical relation, but only one. The clone pair relation is
irreflexive, that is, a fragment will never be a clone of its own.
1
Gauss’ original formula
n
2
+n
2
sums all integers from 1 ton. Compared with this, we sum only 1
ton−1, because the first fragment is not a clone of its own.