and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. 1 2 3 4 5 6 7 2 7
and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. For each edge, the edge betweenness is dened as the number of shortest paths passing through it. 1 2 3 4 5 6 7 9 24
and M. E. J. Newman, Proc. Natl. Acad. Sci. 99 (2002) The shortest path between two vertices is a sequence of connected vertices that minimizes the number of its constituent edges. For each edge, the edge betweenness is dened as the number of shortest paths passing through it. Algorithm for identifying the communities At each step, remove the edge with the highest betweenness. 1 2 3 4 5 6 7
size R is the number of inter-group edges (in this case, the number of edges between the two groups of vertices). The index vector h has elements hi = 1 if vertex i belongs to the rst group −1 if vertex i belongs to the second group Example 1 2 3 4 5 6 7 R = 1 h = 1, 1, 1, 1, −1, −1, −1
dene the Laplacian matrix L with elements lij = ai i = j −aij i = j The cut size R can then be written in terms of the index vector h and the Laplacian matrix L as R = 1 4 hTLh which is equivalent to R = 1 4 i α2 i λi where λi is the eigenvalue of L corresponding to the orthonormal eigenvector ui , and we assume 0 = λ1 ≤ λ2 ≤ . . . ≤ λn .
dene the Laplacian matrix L with elements lij = ai i = j −aij i = j The cut size R can then be written in terms of the index vector h and the Laplacian matrix L as R = 1 4 hTLh which is equivalent to R = 1 4 i α2 i λi where λi is the eigenvalue of L corresponding to the orthonormal eigenvector ui , and we assume 0 = λ1 ≤ λ2 ≤ . . . ≤ λn . h = i αiui αi = uT i h constraint i α2 i = n
modularity Q is dened as Q = 1 a i,j (aij − pij) δc (i, j) with pij = aiaj a where aij describes the actual number of edges between vertices vi and vj pij describes the expected number of edges between vertices vi and vj , as a function of their degrees ai and aj , respectively The matrix P = [pij] describes the so-called null model of the graph δc (i, j) = 1 if the two vertices vi and vj belong to the same community, and 0 otherwise
modularity Given a partition {SI, I = 1, . . . , K} of the network into K communities, dene the symmetric matrix E of order K with elements as follows eIJ = 1 a i∈SI j∈SJ aij eIJ is half the fraction of all edges connecting vertices in community SI to vertices in community SJ . Theorem (Community decomposition of the modularity) The modularity of the network can be written as Q = I eII − e2 I eII corresponds to the fraction of all edges within community SI . eI = P J eIJ corresponds to the sum of eII and P J=I eIJ .
E. J. Newman, Phys. Rev. E 69 (2004) Algorithm for identifying the communities At each step, compute the change in Q should any two communities be joined, denoted ∆QIJ . Then, join the pair producing the largest increase. Example (Dendrogram) Q
version) Aaron Clauset, M. E. J. Newman, and Cristopher Moore, Phys. Rev. E 70 (2004) Maintain and update a matrix ∆ with elements δIJ = ∆QIJ , indexing the largest element in each row and in the whole matrix for fast retrieval. Initialization rule for ∆ Initially, for all pairs of communities SI = SJ we have δIJ = 2aij a − 2aiaj a2 Updating rule for ∆ Upon joining communities SI and SJ into SI∪J , for all communities SK = SI, SJ we have δ(I∪J)K = δIK + δJK
modularity Self-loops, which we assume to be absent in the original graph, contribute negatively to the modularity Q, since aii = 0, aii a − aiai a2 δc (i, i) = − ai a 2 Denition (New null model matrix) We can apply our diagonal diusion operator to the matrix P, obtaining a new null model matrix R with null-diagonal elements and o-diagonal elements rij of the form rij = aiaj a + 1 n − 2 a2 i + a2 j a − 1 n − 1 i a2 i a , i = j The transformation from P to R preserves row (and column) sums.
modularity As for the classical modularity denition, we can decompose the new modularity at the single community level. Moreover, we can devise initialization and updating rules for the improved version of the agglomerative hierarchical clustering algorithm. Initialization rule for ∆ Initially, for all pairs of communities SI=i = SJ=j we have δIJ = 2aij a − 2aiaj a2 − 2 a2 (n − 1) (n − 2) (n − 1) a2 i + a2 j − i a2 i Updating rule for ∆ The same as before!
heuristics for the agglomerative hierarchical clustering algorithm than the classical one. Directions for future research Other modularity optimization methods Modularity as an objective function Dynamic complex networks
in Italy Data has been collected from two publicly accessible databases, namely: 1 Cerca Università 2 The DBLP Computer Science Bibliography The graph obtained has: n = 1741 vertices m = 3878 edges 89 connected components, with a giant central component encompassing more than 80% of the whole graph 66 small components of order 2 and 3
0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities.
0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities
0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs
0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs There are 38 communities containing authors from one single SSD, and there are 39 communities containing authors from two dierent SSDs.
0 10 20 30 40 50 0 10 20 30 40 50 Number of dierent universities Number of communities There are 12 communities containing authors from one single university, and there are 44 communities containing authors from two dierent universities. Around 45% of the communities contain authors from at most two universities 0 2 4 6 8 10 0 10 20 30 40 50 Number of dierent SSDs There are 38 communities containing authors from one single SSD, and there are 39 communities containing authors from two dierent SSDs. More than 60% of the communities contain authors belonging to at most two SSDs
can be computed in unweighted graphs in time O (mn) using the fast algorithm of Newman. Since this calculation has to be repeated once for the removal of each edge, the entire algorithm runs in worst-case time O m2n .
need only consider pairs of connected communities, of which there will be at any time at most m The change in Q upon joining two communities can be computed in constant time Following a join, we will need to update up to n of the matrix elements eIJ by adding together the rows and columns corresponding to the joined communities Thus, each step of the algorithm takes worst-case time O (m + n). There are a maximum of n − 1 join operations necessary to construct the complete dendrogram, and hence the entire algorithm runs in time O ((m + n) n), or O n2 on a sparse graph.
version) This version of the algorithm is extremely ecient, scaling as O (md log n), where d is the depth of the dendrogram. Furthermore, since many real-world networks are sparse and hierarchical, with m ∝ n and d ∝ log n, this reduces essentially to linear time, O n log2 n . This algorithm has been successfully applied in the study of a recommender network of books from a large on-line retailer, with more than 400000 vertices and 2 million edges.