Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020;109(11):2161-2193.
doi: 10.1007/s10994-020-05882-8. Epub 2020 Jul 27.

Embedding-based Silhouette community detection

Affiliations

Embedding-based Silhouette community detection

Blaž Škrlj et al. Mach Learn. 2020.

Abstract

Mining complex data in the form of networks is of increasing interest in many scientific disciplines. Network communities correspond to densely connected subnetworks, and often represent key functional parts of real-world systems. This paper proposes the embedding-based Silhouette community detection (SCD), an approach for detecting communities, based on clustering of network node embeddings, i.e. real valued representations of nodes derived from their neighborhoods. We investigate the performance of the proposed SCD approach on 234 synthetic networks, as well as on a real-life social network. Even though SCD is not based on any form of modularity optimization, it performs comparably or better than state-of-the-art community detection algorithms, such as the InfoMap and Louvain. Further, we demonstrate that SCD's outputs can be used along with domain ontologies in semantic subgroup discovery, yielding human-understandable explanations of communities detected in a real-life protein interaction network. Being embedding-based, SCD is widely applicable and can be tested out-of-the-box as part of many existing network learning and exploration pipelines.

Keywords: Community detection; Network analysis; Node embedding; Unsupervised learning.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Schematic representation of the proposed embedding-based SCD approach. The input network G (comprised of a set of nodes N) first gets mapped to a latent d-dimensional vector space via f. The vectors are clustered (via c) to obtain the final partition P(G)
Fig. 2
Fig. 2
In community-based semantic subgroup discovery, individual rules, such as Ri Rj and Rk, represent human-understandable descriptions, comprised of terms Ti of individual communities
Fig. 3
Fig. 3
Three examples of LFR networks. The largest example LFR network consists of 10,000 nodes and 302,160 edges. The mixing parameter for these networks is set to 0.1, indicating very well defined communities. We colored the first 100 communities by size (random colors)
Fig. 4
Fig. 4
E-mail ground truth communities (departments of senders)
Fig. 5
Fig. 5
Differences in the number of estimated communities (on LFR networks). The horizontal line represents optimal outcome with respect to the number of detected communities
Fig. 6
Fig. 6
Visualization of E-mail network communities obtained using different algorithms. Communities are colored by size. The LabelPropagation and SCD-PPR performed the worst, which is also apparent from the visualizations—LabelPropagation did not detect any communities, whereas SCD-PPR detected too few
Fig. 7
Fig. 7
Visualization of solutions found when different intervals of k are considered. The situation where each k is tested corresponds to γ=1. The larger markers denote optima found when different intervals of k are considered. It can be observed, all γ variants yield similar number of communities (50), which is close to the ground truth of 42 communities
Fig. 8
Fig. 8
Communities in the Human Affinome network
Fig. 9
Fig. 9
Fitting the closed-form solution for determining the γ parameter to simulated data
Fig. 10
Fig. 10
Non-normalized Silhouette accross dimensions on the considered social network
Fig. 11
Fig. 11
Normalized Silhouette across dimensions on the considered social network

Similar articles

Cited by

References

    1. Adhikari PR, Vavpetič A, Kralj J, Lavrač N, Hollmén J. Explaining mixture models through semantic pattern mining and banded matrix visualization. Machine Learning. 2016;105(1):3–39.
    1. Aranganayagi, S., & Thangavel, K. (2007). Clustering categorical data using silhouette coefficient as a relocating measure. In International conference on computational intelligence and multimedia applications (ICCIMA 2007) (vol. 2, pp. 13–17). IEEE.
    1. Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027–1035). Society for Industrial and Applied Mathematics.
    1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: Tool for the unification of biology. Nature Genetics. 2000;25(1):25–29. - PMC - PubMed
    1. Bachem, O., Lucic, M., Hassani, H., & Krause, A. (2016). Fast and provably good seedings for k-means. In Advances in neural information processing systems 29 (pp. 55–63). Curran Associates Inc.

LinkOut - more resources