214 Chapter 7 Web Mining
S1 and S3. Also, since support is measured not by transactions but by users, a sequence
is large if it is contained in at least one customer' s sequence. After the sort step, we
have that D = (S1 = {U1, (A, B, C)}, S3 = {U1, (B, C, E)}, S2 = {U2, (A, C)}, S4 =
{U3, (A, C, D, C, E)}. We find L1 {{A}, {B}, {C}, {D}, {E}} since each page is referenced
by at least one customer. The following table outlines the steps taken by AprioriAll:
There are variations of this algorithm and several techniques used to improve the
performance. The set of customer sequences is reformatted after L 1 is found. Each trans
action is replaced with one that consists only of pages from L 1 . Candidates may be pruned
before counting by removing any candidates that have subsequences that are not large.
Variations on AprioriAll are proposed to avoid generating so many candidates. In effect,
these improvements are used only to avoid generating sequences that are not maximal.
C1 ={(A), (B), (C), (D), (E)}
L1 ={(A), (B), (C), (D), (E)}
C2 ={(A,, B), (A, C), (A, D), (A, E), (B, A), (B, C), (B, D),
(B, E)", (C, A), (C, B), (C, D), (C, E), (D, A), (D, B),
(D, C), (D, E), (E, A), (E, B), (E, C), (E, D)}
L2 ={(A, B), (A, C), (A, D), (A, E), (B, C),
(B, E), (C, B), (C, D), (C, E), (D, C), (D, E)}
C3 ={(A, B, C), (A, B, D), (A, B, E), (A, C, B), (A, C, D), (A, C, E),
(A, D, B), (A, D, C), (A, D, E), (A, E, B), (A, E, C), (A, E, D),
(B, C, E), (B, E, C), (C, B, D), (C, B, E), (C, D, B), (C, D, E),
(C, E, B), (C, E, D), (D, C, B), (D, C, E), (D, E, C)}
L3 ={(A, B, C), (A, B, E), (A, C, B), (A, C, D), (A, C, E), (A, D, C),
(A, D, E), (B, C, E), (C, B, E), (C, D, E), (D, C, E)}
C4 ={(A, B, C, E), (A, B, E, C), (A, C, B, D), (A, C, B, E),
(A, C, D, B), (A, C, D, E), (A, C, E, B), (A, C, E, D),
(A, D, C, E), (A, D, E, C)}
L4 ={(A, B, C, E), (A, C, B, E), (A, C, D, E), (A, D, C, E)}
Cs = 0
The WAP-tree (web access pattern) has been proposed to facilitate efficient count
ing. This tree is used to store the sequences and their counts. Once the tree is built, the
original database of patterns is not needed. Each node in the tree is associated with an
event (a page found at a particular time by a user). The node is labeled with the event and
a count that is associated with the pattern prefix that ends at that event. Only individual
frequent events are added to the tree.
Frequent Episodes. Episodes, which originally were proposed for telecommu
nication alarm analysis, can also be applied to Web logs. All pages (corresponding to
events) are ordered by their access time, and the users usually need not be identified (i.e.,
no sessions). By definition, an episode is a partially ordered set of pages [MTV95]. In
Section 7.4 Web Usage Mining 215
addition, the individual page accesses must occur within a particular time frame. A serial
episode is an episode in which the events are totally ordered. Note that they need not be
contiguous, however. A parallel episode is a set of events where there need not be any
particular ordering. They still do need to satisfy the time constraint, however. Finally, a
general episode is one where the events satisfy some partial order. Note that even though
these seem similar to the idea of sequential patterns and association rules, the added con
straint of a time window does make an episode different from either of these. The original
definition has no concept of user, but, of course, the idea of an episode could be applied
to events by one user or across users. In addition, episodes need not be maximal.
Example 7.7 illustrates the concept of episodes applied to the data in Example 7.4.
Here we keep the original ordering of the events.
EXAMPLE 7.7
The XYZ Corporation maintains a set of five Web pages {A, B, C, D, E}. Assume that the
data in Example 7.2 have the following sequence (independent of user): (A, B, C, A, C,
B, C, A, C, D, C, E). To find episodes, we must know the time window, so the follow
ing sequence shows each event with an integer timestamp: ((A, 1), (B, 2), (C, 2) , (A, 7),
(C, 10), (B, 10), (C, 12), (A, 12), (C, 13), (D, 14), (C, 14), (E, 20)). Suppose that we
wish to find general episodes where the support threshold is 30% and the time win
dow is 3. This means that only events that occur within a time window of 3 are valid.
We assume that the time window is the difference between the last event and the first
event in the episode. To illustrate episodes, Figure 7.7 illustrates the ordering of events
as shown in a DAG (directed acyclic graph) where arcs are used to represent temporal
ordering. The arcs are labeled with the time between successive events. Starting at the
first event and looking at the maximum window size of 3, we see that we have two serial
episodes: AC and AB. B and C occur as parallel episodes. Stmting at the event looking
at time 12, we have the following serial episodes: ACD, ACC, CCC, CCD, AC, CC,
CC, CD. We have two parallel episodes: A and C, and C and D. There also is a general
episode that can be seen as the subgraph from time 12 to time 14. When taking the
frequency into account, an episode must occur a certain number of times in all windows.
Maximal Frequent Forward Sequences. One approach to mining log traversal
patterns is to remove any backward traversals [CPY98]. Each raw session is transformed
into forward reference (i.e., removes the backward traversals and reloads/refreshes), from
which the traversal patterns are then mined using improved level-wise algorithms. For
FIGURE 7.7: DAG for episodes.