WAVELET TRANSFORM

SaifKabirPEng 291 views 5 slides Dec 23, 2015
Slide 1
Slide 1 of 5
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5

About This Presentation

No description available for this slideshow.


Slide Content

ON THE USE OF WAVELET TRANSFORM FOR PRIVACY PRESERVING
DATA MINING


S. Asif Kabir
1
, A. M. Youssef
2
and A. K. Elhakeem
1

1
Department of Electrical and Computer Engineering
2
Concordia Institute for Information System Engineering
Concordia University, Montreal, Quebec, Canada
{sm_asif,youssef,ahmed}@ece.concordia.ca.



ABSTRACT
Data mining is the process of automatically searching
large amount of data to extract useful information and
patterns using tools such as classification, association and
rule mining. Data mining often involves data that contains
private information such as healthcare or financial records
and there has been growing concern about the chance of
misusing the personal information extracted from such
data. In particular, the increasing ability to trace and
collect large amount of data with the use of current
technology has led to an interest in the development of
data mining algorithms which preserve user privacy. Data
perturbation is one of the well known techniques for
privacy preserving data mining.
In this paper, we investigate the use of the Discrete
Wavelet Transform (DWT) with truncation for data
perturbation. Our experimental results show that the
proposed method is effective in concealing the sensitive
information while preserving the performance of data
mining techniques after the data distortion.

KEY WORDS
Privacy preserving data mining, DWT


1. Introduction

Data mining or knowledge discovery in databases [1] is
the process of searching for useful and understandable
patterns in large volumes of data using tools such as
classification and association rule mining. Several data
mining applications deal with privacy-sensitive data such
as financial transactions, and healthcare records. Because
of the increasing ability to trace and collect large amount
of personal data, privacy preserving in data mining
applications has become an important concern.
Data can be either collected in a centralized location or
collected and stored at distributed or scattered locations.
According to the collection procedure, there exist
different privacy concerns. For example, in a distributed
database situation, the preeminent purpose is to maintain
the independence of the distributed data ownership which
is related to the issue of data mining in a distributed
environment. On the other hand, for the centralized
storage of data, the major privacy issue is to protect the
exact value of the attributes from the data analysts.
Among the techniques that are used for privacy
preserving data mining are: query restriction, secure
multi-party computation, data swapping, distributed data
mining, and data perturbation [2]-[5]. In this work, we
focus on the latter approach, i.e., data perturbation. The
objective of data perturbation is to distort the individual
data values while preserving the underlying statistical
distribution properties. Theses data perturbation
techniques are usually assessed in terms of both their
privacy parameters [6] as well as its associated utility
measure. While the privacy parameters present the ability
of these techniques to hide the original data values, the
data utility measures assess whether the dataset keeps the
performance of data mining techniques after the data
distortion.
The primary focus of this work is to explore a new data
perturbation approach for privacy preserving data mining.
In particular, we investigate the use of the discrete
wavelet transform (DWT) with truncation for data
perturbation. Our primary experimental results show that
the proposed method is effective in concealing the
sensitive information while preserving the performance of
data mining techniques after the data distortion.
The rest of the paper is organized as follows. In section 2,
we briefly review the Haar wavelet transform. The data
distortion and the utility measures used in this work are
reviewed in section 3 and section 4 respectively. The
proposed data perturbation approach is described in
section 5. Some of the experimental results obtained when
using the proposed algorithm on some real world datasets
are presented in section 6. Finally, the conclusions and
future works are given in section 7.

2. Haar Wavelet Transform

Wavelet transforms [9]-[11] are mathematical tools for
hierarchically decomposing functions. They allow a
function to be described in terms of a coarse overall level,
plus details that range from broad to narrow. Wavelets
offer an elegant technique for representing the levels of
detail present. A wavelet transformation converts data
from an original domain to a wavelet domain by
expanding the raw data in an orthonormal basis generated
by dilation and translation of a father and mother wavelet.
Wavelet transformation preserves the structure of data. A
contracted, high frequency wavelet performs temporal
analysis comparing to dilated, low frequency of the same
wavelet that performs frequency analysis.
Unlike the Fourier transform that uses sine and cosine as
basis functions, wavelet transform contains more
complicated basis functions. While individual wavelet
functions are localizing in space, Fourier sine and cosine
function are not. Wavelet transform do not have a single
set of basis function like Fourier transform. The
localization property of wavelet makes many function and
operators sparse when transform into the wavelet domain.
This sparseness has been utilized in various applications
such as data compression, and removing noise from data.
In this work, we use the Haar wavelt transform which is
based on one of the simplest possible wavelets, the Haar
wavelet, which that be described by a step function

()









££
£
-=
otherwise
x
x
x 1
2
1
2
1
0
0
1
1
f

In order to avoid unnecessary mathematical notation, we
will describe the Haar wavelet transform by providing a
simple example. The one dimensional Haar transform of a
function
f can be viewed as a series of averaging and
differencing operations on a discrete function. We
compute the averages and differences between every two
adjacent values of
f(x). The procedure to find the Haar
transform of a discrete function
f(x) =[a 0 a1 a2 a3] is
shown in Table 1. In this example, resolution 4 is the full
resolution.

Resolution Approximation Detail coefficients
4 [a
0 a
1 a
2 a
3]
2 [b
0 b
1]=
[(a
0+a
1)/2,( a
2+ a
3)/2]
[(a
0-a
1)/2,( a
2- a
3)/2]
1 [(b
0+b
1)/2] [(b
0-b
1)/2]

Table 1 Illustration of one-dimensional Haar wavelet
transform

For example, if
f(x)=[7 5 1 9] , then we have

Resolution Approximation Detail coefficients
4 [7 5 6 2]
2 [6 4] [1 2]
1 5 1

Table 2 Illustration of one-dimensional Haar wavelet
transform to
f(x)=[7 5 1 9]

Note that other definition of Haar wavelet transform that
is different from the above definition by a factor of
2also exists. However, this constant factor is irrelevant
to our work since the distorted dataset is obtained by
applying the inverse transform again (see section 5). It
can also be shown that the above transform is equivalent
to multiplying the vector presentation of
f by an integer
(wavelet) matrix, which can be computed more efficiently
than the analogous Fourier matrix.

Multi-dimensional wavelets are usually defined via the
tensor product. The two-dimensional wavelet basis
consists of all possible tensor products of one-
dimensional basis function. Applying the Haar wavelet
transform to a 2 dimensional matrix will result in four sets
of coefficients: the approximation coefficients matrix,
cA, and horizontal. vertical and diagonal detail
coefficients matrices (called cH, cV, and cD respectively).
For example, when applying single level Haar wavelet
transform to the matrix






89
53
we obtain
cA (the overall average)= (3 + 5 + 9 + 8)/4 = 6.25,
cH (the average of the difference of the summations of the
rows) = (( (3 + 5)- (9 + 8)) /4= -2.25,
cV (the difference of the summations of the columns) ((3
+ 9)-(5+8))/4 = -0.25 and
cD (the average of the difference of the summations of the
diagonal)= ((3+8)(9+5))/4 = 0.75.
We can think of the approximation coefficients as a
coarser resolution version of the original signal, and the
detail (differences) coefficients as the higher resolution
details.

3. Data distortion measure

Throughout this work, we adopt the same set of privacy
parameters proposed in [6]. The value difference (VD)
parameter is used as a measure for value difference after
the data distortion algorithm is applied to the original data
matrix. Let
Vand
V denote the original and distorted
data matrices respectively. Then, VD is given by
||||/|||| VVVVD -=,
where
||||
×denotes the Frobenius norm of the enclosed
argument.
After a data distortion, the order of the value of the data
elements also changes. Several metrics are used to
measure the position difference of the data elements. For
a dataset
V with
n data object and m attributes, let

i
j
Rank denote the rank (in ascending order) of the j
th

element in attribute
i. Similarly, let
i
j
Rank denote the
rank of the corresponding distorted element. The RP
parameter is used to measure the position difference. It
indicates the average change of rank for all attributes after
distortion and is given by
.
1
11

==
-=
m
i
n
j
i
j
i
j
RankRank
nm
RP
RK represents the percentage of elements that keeps
their rank in each column after distortion and is given
by

==
=
m
i
n
j
i
j
Rk
nm
RK
11
1

where 1
=
i
j
Rk If an element keeps its position in the
order of values, otherwise
0=
i
j
Rk .

Similarly, the CP parameter is used to measure how the
rank of the average value of each attributes varies after
the data distortion. In particular, CP defines the change
of rank of the average value of the attributes and is
given by

=
-=
m
i
ii
RankVVRankVV
m
CP
1
1
,
where
iRankVV, and
iRankVV denote the rank of the
average value of the
i
th
attribute before and after the
data distortion, respectively.
Similar to RK, CK is used to measure the percentage of
the attributes that keep their ranks of average value
after distortion.
From the data privacy perspective, a good data
distortion algorithm should result in a high values for
the RP and CP parameters and low values for the RK
and CK parameters.


4. Utility measure

The data utility measures assess whether the dataset keeps
the performance of data mining techniques after the data
distortion. Throughout this work, we use the accuracy of a
simple K-nearest neighborhood (KNN) [8] as our data
utility measure. We also used the Euclidian distance as a
measure for the similarity between vectors.


5. Proposed Approach

The proposed data distortion approach (see Figure 1) can
be summarized as follows:

1. Perform the 2D Haar wavelet transform on the original
dataset.
2. Truncate the detail coefficients that are below a pre-
specified threshold value.
3. Perform the inverse transform to obtain the perturbed
dataset.
4. Iterate the above steps until satisfactory privacy
parameters and utility measures are obtained.





Figure 1 Proposed data distortion Technique


6. Experimental results

In order to test the performance of our proposed method,
we conducted a series of experiments on some real world
datasets. In this section, we present a sample of the results
obtained when applying our technique to the original
Wisconsin Breast Cancer and Ionosphere databases
downloaded from UCI machine Learning Repository [7].

For the Breast Cancer database, we used 569 observations
and 30 attributes to perform our experiment. For the
classification task, 80% of the data was used for training
and the other 20% was used for testing. Throughout our
experiments, we set K=30 for the KNN classifier. The
corresponding classification accuracy on the original
dataset is 92.11%.

Table 3 shows how the accuracy and privacy parameters
vary when the truncation process is applied to one of the
detail coefficients (cV, cH or cD) in the Haar wavelet
transform and then the dataset is reconstructed by
applying the inverse transform using this truncated
coefficients.



Truncated
coefficients

RP RK CP CK
VD ACC%
cV 60.4 0.013 1.73 0.2 0.60 96.49
cH 91.9 0.009 0 1 0.16 98.25
cD 60.6 0.014 0 1 0.14 98.25

Table 3 Influence of truncating the detail coefficients
Original
Dataset Haar
DWT Detail
Coefficient

Truncation
Haar IDWT
Perturbed
Dataset

on the privacy and accuracy parameters (breast cancer
database)


It should be noted that the classification accuracy in all
these three cases is higher than the corresponding
accuracy of the original dataset. On the other hand, for
this particular dataset, truncating cH and cD resulted in
CP=0, i.e., the rank of the average value of each attribute
did not change in the distorted dataset which is some what
undesirable feature.
Table 4 shows the corresponding results when a pair of
the detail coefficients (i.e., cH and cV; cH and cD; or cV
and CD ) in the Haar wavelet transform are truncated to
zero and then the dataset is reconstructed by applying the
inverse transform to this truncated coefficients.
Again, high level of classification accuracy is maintained
in all three cases. On the other hand, it is clear that
truncated both cH and cV (denoted as cHV in the table)
results in a better privacy parameter. This conclusion
should be interpreted with care since the above results
may change when the same technique is applied to a
different dataset.



Truncated
coefficients
RP RK CP CK VD ACC
cHV 77.51 0.017 1.8 0.26 0.63 98.25%
cHD 74.7 0.007 0 1 0.23 98.25%
cVD 50.22 0.06 1.7 0.26 0.63 92.11%

Table 4 Influence of truncating pairs of detail
coefficients on the privacy and accuracy parameters
(breast cancer datasbase).

Tables 5 and 6 show the corresponding results when we
run our algorithm on the Ionosphere database from UCI
repository [7]. This dataset has 351 instance and 35
attributes including class attribute. We used 200 instances
as training data and the other 151 are as test data. We also
set K=13 (for which the classification accuracy of the
original dataset was 93.38%). The results obtained
clearly show some tradeoff between the accuracy and the
privacy parameters.

Detail
coefficient
RP RK CP CK VD ACC%
cV 50.3 0.019 8.76 .03 0.56 90.07
cH 33.8 0.016 0 1 0.33 92.72
cD 34.7 0.014 0 1 0.35 91.39

Table 5
Influence of truncating the detail coefficients
on the privacy and accuracy parameters (Ionosphere
databaset)



coefficient RP RK CP CK VD ACC
cHV 61.33 0.039 8.9 0.05 0.64 87.42%
cHD 47.26 0.039 0 1 0.242 93.38%
cVD 61.5 0.03 9.1 0.03 0.65 91.39%

Table 6
Influence of truncating pairs of detail
coefficients on the privacy and accuracy parameters
(Ionosphere datasbase)


7. Conclusion

Privacy advocates often view privacy versus obtaining
knowledge from data as an either-or situation. Privacy
preserving data mining techniques that demonstrate how
knowledge can be obtained from databases while the
individuals’ privacy is preserved enables more reasonable
debate. In this work, we have presented a new algorithm
for privacy preserving data mining based on DWT with
truncated coefficients. Based on the presented
excremental results, the proposed method is effective in
concealing the sensitive information while preserving the
performance of data mining techniques after the data
distortion. On the other hand, while the privacy
parameters used in this work provide some indication on
the ability of these techniques to hide the original data
values, it is interesting to quantitatively relate these
parameters to the actual work required break these data
perturbation techniques.



References
[1] M. Chen, J. Han, and P. Yu, "Data Mining: An
Overview from a Database Prospective," IEEE
Trans. Knowledge and Data Engineering, 8,
1996.
[2] Z. Yang, S. Zhong, R. N. Wright, “Privacy-
preserving classification of customer data
without loss of accuracy,” In proceedings of the
5th SIAM International Conference on Data
Mining, Newport Beach, CA, April 21-23, 2005.
[3] R. Agrawal, and A. Evfimievski, “Information
sharing across private database,” Proceedings of
the 2003 ACM SIGMOD international
conference on management of data, San Diego,
CA, pp. 86-97.
[4] D. Agrawal and C. C. Aggawal, “On the design
and quantification of privacy preserving data
mining algorothms,” In Proceedings of the 20th
ACM SIMOD Symposium on Principles of

Database Systems, pages 247–255, Santa
Barbara, May 2001.
[5] Rakesh Agrawal and Ramakrishnan Srikant,
“Privacy-preserving data mining,” In Proceeding
of the ACM SIGMOD Conference on
Management of Data, pages 439–450, Dallas,
Texas, May 2000. ACM Press.
[6] Shuting Xu, Jun Zhang, Dianwei Han, and Jie
Wang, Data distortion for privacy protection in a
terrorist Analysis system. P. Kantor et al (Eds.):
ISI 2005, LNCS 3495, pp.459-464, 2005
[7] UCIMachine Learning Repository.
http://www.ics.uci.edu/mlearn/mlsummary.html.
[8] R. Duda, P. Hart, and D. Stork, “Pattern
Classification,” John Wiley and Sons, 2001.
[9] Amara Graps, “An introduction to Wavelets,”
IEEE Computational Science and Engineering,
vol 2, no. 2. IEEE Computer society press, 1995.
[10] T. Li, Q. Li, S. Zhu, and M Ogihara, A Survey
on Wavelet Application in Data Mining. ACM
SIGKDD Explorations. Vol 4, issue 2, pp. 49-68,
2002.
[11] C. Taswell, Handbook of Wavelet Transform
Algorithms. Boston, MA: Birkhäuser, 1996.
Tags