4
ABSTRACT
The topic of this engineering diploma project is the implementation of the application
classifying text data and analysis of its performance during parallelization. This diploma
presents chosen theoretical and implementational issues concerning techniques of data
classification, operation dispersion and validation of gathered results. To this day, many
implementations of similar problems were presented, however this project focuses on one
chosen method, which promises to reach high performance and satisfactory classification
results.
In order to achieve satisfactory results without a big decrease in performance, a k-
nearest neighbors algorithm was used for classification of vectors describing articles. Distance
between objects is defined by cosine similarity. Custom modifications have been introduced to
the classification process. They are described in detail in subsection 2.1.3 and the results are
analyzed in chapter Evaluation.
The main purpose of this thesis is the analysis of influence of distributed computing
on classification performance. Software library called Apache Hadoop was used for task and
data distribution between cluster nodes. Processes that we have decided to distribute are
normalization of texts into data vectors, calculating cosine similarity, folding and cross
validation.
As part of the elaboration on the topic, a series of thorough tests of quality and
efficiency of classification were conducted, during which it was concluded that the optimal
quantity of nearest neighbors is about 1-5% of all the articles taking part in classification. What
is more, a bottleneck was found. Increase in efficiency can probably be achieved by
improvement of the process of folding and cross validation.
Application is stable and fully scalable. Classification can be done on any cluster, with
any number of nodes and any size of the training dataset. The only usage restraint of the
application is execution time of some stages of classification for large data sets (>10000
articles). Current state of the application is a solid base for future research work towards text
data classification, and the implementation is prepared in such a way, that any future quality or
efficiency enhancements will be easy.
Engineering diploma project has been carried out by four co-authors. Contribution in
thesis is as following:
● Wojciech Stanisławski - Contribution in chapters 2.2.1, 3.1, 3.2.1, 4 and individually in
chapters 3.3.2, 3.4.2, 3.5
● Marcin Goławski - Contribution in chapter 4 and individually in chapter 2.1
● Krzysztof Świeczkowski - Contribution in chapters 2.2.1, 3.1, 3.2.1, 4 and individually in
chapters 2.2.2, 3.2.2, 3.4.1
● Artur Peplinski - Contribution in chapters 2.1.4, 4 and individually in chapters 2.2.3,
2.2.4, 3.3.1
Keywords: Big data, Apache Hadoop, KNN classification, Wikipedia