Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark

thang19xx 1,071 views 52 slides Aug 20, 2017

Slide 1 of 52

About This Presentation

In this undergraduate thesis, I provide a general view of communities and its the real life applications. In recent years, with the rapid growth of network scale, it is a difficult task to detect overlapping communities in large-scale networks for state of the art methods. This method is implemented...

Size: 11.91 MB

Language: en

Added: Aug 20, 2017

Slides: 52 pages

Slide Content

UNDERGRADUATE THESIS DEFENSE
Community detection in large-scale networks
July 7, 2017
Advisor: Ph.D. Tran Vinh Duc
Student name: Nguyen Duc Thang
Student code: A22852
Computer science - Mathematics and Informatics
ThangLongUniversity

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Overview
.
In this thesis, I provide a general view of communities and its the
real life applications. I introduceBigCLAM modelsproposed by
Yang and Leskovec (2013), a popular model is used overlapping
community detection algorithm. In particular, I proposed a few
methods convex optimization and implemented BigCLAM in
Apache Sparkis evaluated as lightning-fast cluster computing to
able detect community in thelarge-scale networks.
2

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Overview
1.Large-scale Network & Community
2.BigCLAM
3.Grid computing
4.Demo & conclusion
3

LARGE-SCALE NETWORK &
COMMUNITY

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Network
Networks are a general language for describing complex systems.
Networks are everywhere!
→Friends & family
→Society
→media & information
→world economy
→roads
→human cell
→brain
5

Socialmedia
networks

World Wide Web
.

Biological
network.
.

Brain
network.
.

Article on Business Insider Australia, 2015
Data generated every 60s.
How are COMPLEX & LARGE-SCALE NETWORKS?

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
What are Community Networks?
.Network clusters or Communities.
→Cluster:A set of appropriately connected nodes.
→Community:Nodes with a shared latent property.
Many reasons and benefits for why communities form.
→InWorld Wide Weblink farms
can be easily detected with
community detection. A link farm is a
group of websites which hyperlink to
every other site in the group.
→Innetwork community of
customerwith similar interest can
be used to make recommender
system to enhance the business
→Communities will help us to
understand the structure ofsocial
network. Communities clarify the
properties and functions of network.
→Inbiological network
communities helps us to understand
basic mechanisms which control
normal cellular processes.
12

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Finding communities
Q:How and why do communities form?
A:Find weak ties and identify communities
Givan-Newman’s betweenness centrality, modularity and graph partitioning
methods are all based on this idea.
Q:
Can It find overlapping communities and
communities in large-scale network?
A:
No!It dont detect overlapping communities and
often has Big O((n+m)mn
2
)
13

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Overlapping communities
[McAuley, Leskovec: Discovering social circles in ego networks, 2012]
14

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Edge Probability
Nodesuandvsharekgroups.
What is edge prob
P(edgejk)as a func ofk?
[Stanford University: http://snap.stanford.edu/agm]
15

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Edge Probability
Nodesuandvsharekgroups.
What is edge prob
P(edgejk)as a func ofk?
Overlaps are
DENSER!
[Stanford University: http://snap.stanford.edu/agm]
16

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Overlaps are DENSER!
[Stanford University: http://snap.stanford.edu/agm]
17

BIGCLAM

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
BigCLAM
Memberships have strengths
[Stanford University: http://snap.stanford.edu/agm]
→FuA: The membership strength of nodeuto communityA
(FuA= 0: no membership)
→Each communityAlinks nodes independently:
PA(u;v) = 1exp(FuA:FvA)
19

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Factor matrixF
Community membership strength matrixF
PA(u;v) = 1exp(FuA:FvA)
P(u;v) = 1exp(Fu:F
T
v)
Probability of connection is
proportional to the product of
strengths.
Notice:If one node doesn’t belong
to the community (FuC=0) then
P(u;v) = 0
20

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
BigCLAM: Flexibility
BigCLAM canexpress variety of network structures:
Non-overlapping, Nested, Overlapping.
[Stanford University: http://snap.stanford.edu/agm]
21

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Maximum Likelihood Estimation of Graph
Task: Given a networkG(V;E), estimateF
WithF, thelikelihoodof the graph is computed by
P(GjF) =
∏
(u;v)2E
p(u;v)
∏
(u;v)̸=E
(1p(u;v)):
(1)
Wherep(u;v) = 1exp(Fu:F
T
v)
Goal: FindFthat maximizes the log-likelihood
^F=arg max
F0
l(F) (2)
Wherel(F) =logP(GjF)
l(F) =
∑
(u;v)2E
log(1exp(FuF
T
v))
∑
(u;v)̸=E
exp(FuF
T
v)(3)
22

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Convex optimization
Maximization as convex optimization problem. Equationl(F)can
be broken down into one sub-problem peru2 V.
^Fu=arg max
Fuc0;c2C
l(Fu) (4)
So,l(Fu)is computed by
l(Fu) =
∑
v2N(u)
log(1exp(FuF
T
v))
∑
v/2N(u)
FuF
T
v (5)
WhereN(u) =fv:v2 V;(u;v)2 Egis neighbours ofu
23

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Convex optimization
Compute gradient of a single rowFuofF:
▽(l(Fu)) =
@l(Fu)
@Fu
=
∑
v2N(u)
Fv
exp(FuF
T
v)
1exp(FuF
T
v)

∑
v/2N(u)
Fv(6)
Stochiate Gradient ascent method:
*Iterate over the rows ofF
1.Compute gradient▽(l(Fu))of rowu
2.Update the rowFu:Fu Fu+:▽(l(Fu))
3.ProjectFuback to a non-negative vector: IfFuC<0 :FuC= 0
This is slow! Computing▽(l(Fu))take linear time withO(jVj).
24

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Gradient descent method
25

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Learning rate
Large learning rate Small learning rate
Need good way to choose and adjust a “learning rate”!
26

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Backtracking line search
27

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
MBSGD method
28

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Complexity reduction
Two equations below, the time complexity is reduced toO(jN(u)j)
with (jN j ≪ jVj).
∑
v/2N(u)
FuF
T
v=
∑
v
FuF
T
vFuF
T
u
∑
v2N(u)
FuF
T
v (7)
WhereN(u) =u[ N(u)is neighbourbood ofu
∑
v/2N(u)
Fv=
(
∑
v
Fv
)
Fu
0
@
∑
v2N(u)
Fv
1
A (8)
29

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Stopping condition
Withl
′
(Fu)is the log-likelihood ofFforuafter update by gradient
ascent method. The iterative ascent process terminates if:
l
′
(Fu)l(Fu)
l(Fu)
<0:0001;8u2 V (9)
the log-likelihood increases by less than0:01%8u2 V
30

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Initialization ofF
LetFbe an afﬁliation weights matrix. The algorithm initialisesF
as follows:
F
(u
′
)(N(u))=
8
>
<
>
:
1ifu
′
2N(u)andN(u)is a locally
minimal neighbourhood ofu
0otherwise
(10)
WhereN(u)is said to be locally minimal if it has lower
conductance than allN(v);v2 N(u).
φ(N(u))φ(N(v));8v2 N(u).
φ(S) =
∑
i2S;j/2S
Aij
min(A(S);A(
P
S))
(11)
WhereA(S) =
∑
i2S
∑
j2V
Aij
31

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Determining community afﬁliations
LetG(V;E),Fbe the most likely afﬁliation weights matrix under
G.u2 Vas a member ofc2 Cif:
FucK=
√
log(1ϵ) (12)
Whereϵ=
2jEj
jVj(jVj1)
32

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Determining number of communities
1.Select20%node pairs randomly formGas hold out setH
2.For everyKas a candidate on number of community:
→Run community detection algorithm, withKas the number of
communities.K2[minCom;maxCom]with stepsize:
=exp
(
log
(
maxCom
minCom
)

1
divCom
)
→Evaluate the likelihood of afﬁliation weights with respect to
H,lk(FH)
3.Obtain the most likely number of communities
^
K=arg maxKlk(FH)
33

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Experiments
[Stanford University: http://snap.stanford.edu/agm]
34

[Stanford University: http://snap.stanford.edu/agm]

GRID COMPUTING

Grid computing
is a type of parallel and distributed system or a computer network
in which each computer’s resources are shared with every other
computer in the system.
.

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Beneﬁts & Challenges
Manybenefitsandchallengesof grid
computing.
Benefits
→Can solve larger, more
complex problems in a
shorter time
→Easier to collaborate with
other organizations
→Make better use of existing
hardware
→etc.
Challenges
→Some messages can be
lost in the network
system
→Security problem due to
sharing
→Schedule and
optimization problem
→etc.
38

Apache Hadoop
Apache Hadoop is a framework that allows for distributed
processing of large data sets across clusters of commodity
computers using a simple programming model.
.

Apache Spark
is an open-source luster-computing framework for real time
processing developed by the Apache Software Foundation.
Spark provides an interface for programming entire clusters with implicit data parallelism
and fault-tolerance.
It was built on top of Hadoop MapReduce and it extends the MapReduce model to efﬁciently
use more types of computations.
.

Apache Spark Architecture

Resilient Distributed
Datasets
Transformations
map(), mapPartitions(),
ﬁlter(), …
Actions
count(), collect(), sortBy(),
reduce(), …

Graphx
GraphX is Apache Spark’s API for graphs and graph-parallel
computation.

Map-Reduce Model
.

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
The information of grid
Specs Number Role
OS: Ubuntu16:04LTS
HDD:95GB
RAM:4GB
CPU: intel Core i5-4590U CPU3:30GHzx4
1 master
OS: Ubuntu16:04LTS
HDD:95GB
RAM:4GB
CPU: intel Core i5-4590U CPU3:30GHzx4
9 workers
45

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Experiment and Performance
Table:The table shows detail network is used to experiment.
Name Nodes Edges
Ego-Facebook 4;039 88 ;234
My ego-Facebook 20;812 27 ;282
com-Amazon 334;863 925;872
com-Youtube 1;134;890 2;987;624
46

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Evaluation
AverageF1score: To compute the F1 score, we need to determine
whichCi2 Ccorresponds to whichC
′
i
2 C
′
. F1 score is computed:
1
2
0
@
1
K
∑
Ci2C
F1(Ci;C
′
g(i)
) +
1
K
′
∑
C
′
i
2C
′
F1(C
g
′
(i);C
′
i)
1
A
Where the best matchinggandg
′
is deﬁned as follows:
g(i) =arg max
j
F1(Ci;C
′
j);g
′
(i) =arg max
j
F1(Cj;C
′
i)
Name F1 score
com-Amazon 0:49
com-Youtube 0:42
47

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
References
1.David F. Gleich and C. Seshadhri. Neighborhoods are good
communities.CoRR abs/1112.0031, 2011.
2.Jaewon Yang and Jure Leskovec. Overlapping community
detection at scale: a nonnegative matrix factorization
approach. In Proceedings of the sixth ACM international
conference on Web search and data mining, pages 587–596.
ACM, 2013
3.John Aldrich et al. Ra ﬁsher and the making of maximum
likelihood 1912-1922. Statistical Science, 12(3):162–176, 1997
48

DEMO & CONCLUSION

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Main contributions
→Overlapping community detection algorithm using BigCLAM
model has successfully implemented.
→This algorithm able to detect community in large-scale on
Apache Spark.
→MBSGD method and backtracking line search method have
proposed working better than the author’s methods on
computing grid.
→This algorithm able to detect communities in the large-scale
network.
50

Overview Large-scale Network & Community BigCLAM Grid computing Demo & conclusion
Future work
→Research some techniques and redesign code to improve the
performance algorithm on computing grid system and
accuracy of algorithm.
→Design and organise cluster computer to create a
supercomputing grid system.
→Apply this algorithm to solve real-life problems.
→Build recommendation system in social network in education
for ThangLong University project. It will increase the student’s
performance in study and activities.
→Build effective advertising campaign by understanding
structure social.
→etc.
51

Thank you!
Thank you for watching and I hope you enjoyed my
presentation. If you have any questions, I’ll be happy to
answer them.

Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Overlapping community detection in Large-Scale Networks using BigCLAM model build on Apache Spark

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......