Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark

thang19xx 1,071 views 52 slides Aug 20, 2017
Slide 1
Slide 1 of 52
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52

About This Presentation

In this undergraduate thesis, I provide a general view of communities and its the real life applications. In recent years, with the rapid growth of network scale, it is a difficult task to detect overlapping communities in large-scale networks for state of the art methods. This method is implemented...


Slide Content

UNDERGRADUATE THESIS DEFENSE
Community detection in large-scale networks
July 7, 2017
Advisor: Ph.D. Tran Vinh Duc
Student name: Nguyen Duc Thang
Student code: A22852
Computer science - Mathematics and Informatics
ThangLongUniversity

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Overview
.
In this thesis, I provide a general view of communities and its the
real life applications. I introduceBigCLAM modelsproposed by
Yang and Leskovec (2013), a popular model is used overlapping
community detection algorithm. In particular, I proposed a few
methods convex optimization and implemented BigCLAM in
Apache Sparkis evaluated as lightning-fast cluster computing to
able detect community in thelarge-scale networks.
2

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Overview
1.Large-scale Network & Community
2.BigCLAM
3.Grid computing
4.Demo & conclusion
3

LARGE-SCALE NETWORK &
COMMUNITY

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Network
Networks are a general language for describing complex systems.
Networks are everywhere!
→Friends & family
→Society
→media & information
→world economy
→roads
→human cell
→brain
5

Socialmedia
networks

World Wide Web
.

Biological
network.
.

Brain
network.
.

Article on Business Insider Australia, 2015
Data generated every 60s.
How are COMPLEX & LARGE-SCALE NETWORKS?

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
What are Community Networks?
.Network clusters or Communities.
→Cluster:A set of appropriately connected nodes.
→Community:Nodes with a shared latent property.
Many reasons and benefits for why communities form.
→InWorld Wide Weblink farms
can be easily detected with
community detection. A link farm is a
group of websites which hyperlink to
every other site in the group.
→Innetwork community of
customerwith similar interest can
be used to make recommender
system to enhance the business
→Communities will help us to
understand the structure ofsocial
network. Communities clarify the
properties and functions of network.
→Inbiological network
communities helps us to understand
basic mechanisms which control
normal cellular processes.
12

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Finding communities
Q:How and why do communities form?
A:Find weak ties and identify communities
Givan-Newman’s betweenness centrality, modularity and graph partitioning
methods are all based on this idea.
Q:
Can It find overlapping communities and
communities in large-scale network?
A:
No!It dont detect overlapping communities and
often has Big O((n+m)mn
2
)
13

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Overlapping communities
[McAuley, Leskovec: Discovering social circles in ego networks, 2012]
14

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Edge Probability
Nodesuandvsharekgroups.
What is edge prob
P(edgejk)as a func ofk?
[Stanford University: http://snap.stanford.edu/agm]
15

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Edge Probability
Nodesuandvsharekgroups.
What is edge prob
P(edgejk)as a func ofk?
Overlaps are
DENSER!
[Stanford University: http://snap.stanford.edu/agm]
16

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Overlaps are DENSER!
[Stanford University: http://snap.stanford.edu/agm]
17

BIGCLAM

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
BigCLAM
Memberships have strengths
[Stanford University: http://snap.stanford.edu/agm]
→FuA: The membership strength of nodeuto communityA
(FuA= 0: no membership)
→Each communityAlinks nodes independently:
PA(u;v) = 1exp(FuA:FvA)
19

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Factor matrixF
Community membership strength matrixF
PA(u;v) = 1exp(FuA:FvA)
P(u;v) = 1exp(Fu:F
T
v)
Probability of connection is
proportional to the product of
strengths.
Notice:If one node doesn’t belong
to the community (FuC=0) then
P(u;v) = 0
20

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
BigCLAM: Flexibility
BigCLAM canexpress variety of network structures:
Non-overlapping, Nested, Overlapping.
[Stanford University: http://snap.stanford.edu/agm]
21

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Maximum Likelihood Estimation of Graph
Task: Given a networkG(V;E), estimateF
WithF, thelikelihoodof the graph is computed by
P(GjF) =

(u;v)2E
p(u;v)

(u;v)̸=E
(1p(u;v)):
(1)
Wherep(u;v) = 1exp(Fu:F
T
v)
Goal: FindFthat maximizes the log-likelihood
^F=arg max
F0
l(F) (2)
Wherel(F) =logP(GjF)
l(F) =

(u;v)2E
log(1exp(FuF
T
v))

(u;v)̸=E
exp(FuF
T
v)(3)
22

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Convex optimization
Maximization as convex optimization problem. Equationl(F)can
be broken down into one sub-problem peru2 V.
^Fu=arg max
Fuc0;c2C
l(Fu) (4)
So,l(Fu)is computed by
l(Fu) =

v2N(u)
log(1exp(FuF
T
v))

v/2N(u)
FuF
T
v (5)
WhereN(u) =fv:v2 V;(u;v)2 Egis neighbours ofu
23

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Convex optimization
Compute gradient of a single rowFuofF:
▽(l(Fu)) =
@l(Fu)
@Fu
=

v2N(u)
Fv
exp(FuF
T
v)
1exp(FuF
T
v)


v/2N(u)
Fv(6)
Stochiate Gradient ascent method:
*Iterate over the rows ofF
1.Compute gradient▽(l(Fu))of rowu
2.Update the rowFu:Fu Fu+:▽(l(Fu))
3.ProjectFuback to a non-negative vector: IfFuC<0 :FuC= 0
This is slow! Computing▽(l(Fu))take linear time withO(jVj).
24

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Gradient descent method
25

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Learning rate
Large learning rate Small learning rate
Need good way to choose and adjust a “learning rate”!
26

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Backtracking line search
27

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
MBSGD method
28

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Complexity reduction
Two equations below, the time complexity is reduced toO(jN(u)j)
with (jN j ≪ jVj).

v/2N(u)
FuF
T
v=

v
FuF
T
vFuF
T
u

v2N(u)
FuF
T
v (7)
WhereN(u) =u[ N(u)is neighbourbood ofu

v/2N(u)
Fv=
(

v
Fv
)
Fu
0
@

v2N(u)
Fv
1
A (8)
29

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Stopping condition
Withl

(Fu)is the log-likelihood ofFforuafter update by gradient
ascent method. The iterative ascent process terminates if:
l

(Fu)l(Fu)
l(Fu)
<0:0001;8u2 V (9)
the log-likelihood increases by less than0:01%8u2 V
30

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Initialization ofF
LetFbe an affiliation weights matrix. The algorithm initialisesF
as follows:
F
(u

)(N(u))=
8
>
<
>
:
1ifu

2N(u)andN(u)is a locally
minimal neighbourhood ofu
0otherwise
(10)
WhereN(u)is said to be locally minimal if it has lower
conductance than allN(v);v2 N(u).
φ(N(u))φ(N(v));8v2 N(u).
φ(S) =

i2S;j/2S
Aij
min(A(S);A(
P
S))
(11)
WhereA(S) =

i2S

j2V
Aij
31

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Determining community affiliations
LetG(V;E),Fbe the most likely affiliation weights matrix under
G.u2 Vas a member ofc2 Cif:
FucK=

log(1ϵ) (12)
Whereϵ=
2jEj
jVj(jVj1)
32

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Determining number of communities
1.Select20%node pairs randomly formGas hold out setH
2.For everyKas a candidate on number of community:
→Run community detection algorithm, withKas the number of
communities.K2[minCom;maxCom]with stepsize:
=exp
(
log
(
maxCom
minCom
)

1
divCom
)
→Evaluate the likelihood of affiliation weights with respect to
H,lk(FH)
3.Obtain the most likely number of communities
^
K=arg maxKlk(FH)
33

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Experiments
[Stanford University: http://snap.stanford.edu/agm]
34

[Stanford University: http://snap.stanford.edu/agm]

GRID COMPUTING

Grid computing
is a type of parallel and distributed system or a computer network
in which each computer’s resources are shared with every other
computer in the system.
.

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Benefits & Challenges
Manybenefitsandchallengesof grid
computing.
Benefits
→Can solve larger, more
complex problems in a
shorter time
→Easier to collaborate with
other organizations
→Make better use of existing
hardware
→etc.
Challenges
→Some messages can be
lost in the network
system
→Security problem due to
sharing
→Schedule and
optimization problem
→etc.
38

Apache Hadoop
Apache Hadoop is a framework that allows for distributed
processing of large data sets across clusters of commodity
computers using a simple programming model.
.

Apache Spark
is an open-source luster-computing framework for real time
processing developed by the Apache Software Foundation.
Spark provides an interface for programming entire clusters with implicit data parallelism
and fault-tolerance.
It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently
use more types of computations.
.

Apache Spark Architecture

Resilient Distributed
Datasets
Transformations
map(), mapPartitions(),
filter(), …
Actions
count(), collect(), sortBy(),
reduce(), …

Graphx
GraphX is Apache Spark’s API for graphs and graph-parallel
computation.

Map-Reduce Model
.

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
The information of grid
Specs Number Role
OS: Ubuntu16:04LTS
HDD:95GB
RAM:4GB
CPU: intel Core i5-4590U CPU3:30GHzx4
1 master
OS: Ubuntu16:04LTS
HDD:95GB
RAM:4GB
CPU: intel Core i5-4590U CPU3:30GHzx4
9 workers
45

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Experiment and Performance
Table:The table shows detail network is used to experiment.
Name Nodes Edges
Ego-Facebook 4;039 88 ;234
My ego-Facebook 20;812 27 ;282
com-Amazon 334;863 925;872
com-Youtube 1;134;890 2;987;624
46

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Evaluation
AverageF1score: To compute the F1 score, we need to determine
whichCi2 Ccorresponds to whichC

i
2 C

. F1 score is computed:
1
2
0
@
1
K

Ci2C
F1(Ci;C

g(i)
) +
1
K


C

i
2C

F1(C
g

(i);C

i)
1
A
Where the best matchinggandg

is defined as follows:
g(i) =arg max
j
F1(Ci;C

j);g

(i) =arg max
j
F1(Cj;C

i)
Name F1 score
com-Amazon 0:49
com-Youtube 0:42
47

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
References
1.David F. Gleich and C. Seshadhri. Neighborhoods are good
communities.CoRR abs/1112.0031, 2011.
2.Jaewon Yang and Jure Leskovec. Overlapping community
detection at scale: a nonnegative matrix factorization
approach. In Proceedings of the sixth ACM international
conference on Web search and data mining, pages 587–596.
ACM, 2013
3.John Aldrich et al. Ra fisher and the making of maximum
likelihood 1912-1922. Statistical Science, 12(3):162–176, 1997
48

DEMO & CONCLUSION

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Main contributions
→Overlapping community detection algorithm using BigCLAM
model has successfully implemented.
→This algorithm able to detect community in large-scale on
Apache Spark.
→MBSGD method and backtracking line search method have
proposed working better than the author’s methods on
computing grid.
→This algorithm able to detect communities in the large-scale
network.
50

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Future work
→Research some techniques and redesign code to improve the
performance algorithm on computing grid system and
accuracy of algorithm.
→Design and organise cluster computer to create a
supercomputing grid system.
→Apply this algorithm to solve real-life problems.
→Build recommendation system in social network in education
for ThangLong University project. It will increase the student’s
performance in study and activities.
→Build effective advertising campaign by understanding
structure social.
→etc.
51

Thank you!
Thank you for watching and I hope you enjoyed my
presentation. If you have any questions, I’ll be happy to
answer them.