Embedding-based Silhouette community detection

doi:10.1007/s10994-020-05882-8

. 2020;109(11):2161-2193.

doi: 10.1007/s10994-020-05882-8. Epub 2020 Jul 27.

Embedding-based Silhouette community detection

Blaž Škrlj^{1

2}, Jan Kralj^{1

3}, Nada Lavrač^{1

4}

Affiliations

¹ Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia.
² Jožef Stefan International Postgraduate School, Jamova 39, 1000 Ljubljana, Slovenia.
³ CosyLab, Gerbičeva ulica 64, 1000 Ljubljana, Slovenia.
⁴ University of Nova Gorica, Vipavska 13, 5000 Nova Gorica, Slovenia.

PMID: 33191975
PMCID: PMC7652809
DOI: 10.1007/s10994-020-05882-8

Embedding-based Silhouette community detection

Blaž Škrlj et al. Mach Learn. 2020.

. 2020;109(11):2161-2193.

doi: 10.1007/s10994-020-05882-8. Epub 2020 Jul 27.

Authors

Blaž Škrlj^{1

2}, Jan Kralj^{1

3}, Nada Lavrač^{1

4}

Affiliations

¹ Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia.
² Jožef Stefan International Postgraduate School, Jamova 39, 1000 Ljubljana, Slovenia.
³ CosyLab, Gerbičeva ulica 64, 1000 Ljubljana, Slovenia.
⁴ University of Nova Gorica, Vipavska 13, 5000 Nova Gorica, Slovenia.

PMID: 33191975
PMCID: PMC7652809
DOI: 10.1007/s10994-020-05882-8

Abstract

Mining complex data in the form of networks is of increasing interest in many scientific disciplines. Network communities correspond to densely connected subnetworks, and often represent key functional parts of real-world systems. This paper proposes the embedding-based Silhouette community detection (SCD), an approach for detecting communities, based on clustering of network node embeddings, i.e. real valued representations of nodes derived from their neighborhoods. We investigate the performance of the proposed SCD approach on 234 synthetic networks, as well as on a real-life social network. Even though SCD is not based on any form of modularity optimization, it performs comparably or better than state-of-the-art community detection algorithms, such as the InfoMap and Louvain. Further, we demonstrate that SCD's outputs can be used along with domain ontologies in semantic subgroup discovery, yielding human-understandable explanations of communities detected in a real-life protein interaction network. Being embedding-based, SCD is widely applicable and can be tested out-of-the-box as part of many existing network learning and exploration pipelines.

Keywords: Community detection; Network analysis; Node embedding; Unsupervised learning.

PubMed Disclaimer

Figures

**Fig. 1**
Schematic representation of the proposed embedding-based SCD approach. The input network G (comprised of a set of nodes N) first gets mapped to a latent d-dimensional vector space via f. The vectors are clustered (via c) to obtain the final partition P(G)

**Fig. 2**
In community-based semantic subgroup discovery, individual rules, such as $R_{i}$ $R_{j}$ and $R_{k}$ , represent human-understandable descriptions, comprised of terms $T_{i}$ of individual communities

**Fig. 3**
Three examples of LFR networks. The largest example LFR network consists of 10,000 nodes and 302,160 edges. The mixing parameter for these networks is set to 0.1, indicating very well defined communities. We colored the first 100 communities by size (random colors)

**Fig. 4**
E-mail ground truth communities (departments of senders)

**Fig. 5**
Differences in the number of estimated communities (on LFR networks). The horizontal line represents optimal outcome with respect to the number of detected communities

**Fig. 6**
Visualization of E-mail network communities obtained using different algorithms. Communities are colored by size. The LabelPropagation and SCD-PPR performed the worst, which is also apparent from the visualizations—LabelPropagation did not detect any communities, whereas SCD-PPR detected too few

**Fig. 7**
Visualization of solutions found when different intervals of k are considered. The situation where each k is tested corresponds to $γ = 1$ . The larger markers denote optima found when different intervals of k are considered. It can be observed, all $γ$ variants yield similar number of communities ( $\approx 50$ ), which is close to the ground truth of 42 communities

**Fig. 8**
Communities in the Human Affinome network

**Fig. 9**
Fitting the closed-form solution for determining the $γ$ parameter to simulated data

**Fig. 10**
Non-normalized Silhouette accross dimensions on the considered social network

**Fig. 11**
Normalized Silhouette across dimensions on the considered social network

See this image and copyright information in PMC

Cited by

PubMed-Scale Chemical Concept Embeddings Reconstruct Physical Protein Interaction Networks.
Škrlj B, Kokalj E, Lavrač N. Škrlj B, et al. Front Res Metr Anal. 2021 Apr 13;6:644614. doi: 10.3389/frma.2021.644614. eCollection 2021. Front Res Metr Anal. 2021. PMID: 33928210 Free PMC article.
Change in Threads on Twitter Regarding Influenza, Vaccines, and Vaccination During the COVID-19 Pandemic: Artificial Intelligence-Based Infodemiology Study.
Benis A, Chatsubi A, Levner E, Ashkenazi S. Benis A, et al. JMIR Infodemiology. 2021 Oct 14;1(1):e31983. doi: 10.2196/31983. eCollection 2021 Jan-Dec. JMIR Infodemiology. 2021. PMID: 34693212 Free PMC article.
Overlapping community detection in weighted networks via hierarchical clustering.
Prokop P, Dráždilová P, Platoš J. Prokop P, et al. PLoS One. 2024 Oct 28;19(10):e0312596. doi: 10.1371/journal.pone.0312596. eCollection 2024. PLoS One. 2024. PMID: 39466771 Free PMC article.

References

1. Adhikari PR, Vavpetič A, Kralj J, Lavrač N, Hollmén J. Explaining mixture models through semantic pattern mining and banded matrix visualization. Machine Learning. 2016;105(1):3–39.
1. Aranganayagi, S., & Thangavel, K. (2007). Clustering categorical data using silhouette coefficient as a relocating measure. In International conference on computational intelligence and multimedia applications (ICCIMA 2007) (vol. 2, pp. 13–17). IEEE.
1. Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027–1035). Society for Industrial and Applied Mathematics.
1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: Tool for the unification of biology. Nature Genetics. 2000;25(1):25–29. - PMC - PubMed
1. Bachem, O., Lucic, M., Hassani, H., & Krause, A. (2016). Fast and provably good seedings for k-means. In Advances in neural information processing systems 29 (pp. 55–63). Curran Associates Inc.

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

[1] Adhikari PR, Vavpetič A, Kralj J, Lavrač N, Hollmén J. Explaining mixture models through semantic pattern mining and banded matrix visualization. Machine Learning. 2016;105(1):3–39.

[2] Adhikari PR, Vavpetič A, Kralj J, Lavrač N, Hollmén J. Explaining mixture models through semantic pattern mining and banded matrix visualization. Machine Learning. 2016;105(1):3–39.

[3] Aranganayagi, S., & Thangavel, K. (2007). Clustering categorical data using silhouette coefficient as a relocating measure. In International conference on computational intelligence and multimedia applications (ICCIMA 2007) (vol. 2, pp. 13–17). IEEE.

[4] Aranganayagi, S., & Thangavel, K. (2007). Clustering categorical data using silhouette coefficient as a relocating measure. In International conference on computational intelligence and multimedia applications (ICCIMA 2007) (vol. 2, pp. 13–17). IEEE.

[5] Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027–1035). Society for Industrial and Applied Mathematics.

[6] Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027–1035). Society for Industrial and Applied Mathematics.

[7] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: Tool for the unification of biology. Nature Genetics. 2000;25(1):25–29. - PMC - PubMed

[8] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: Tool for the unification of biology. Nature Genetics. 2000;25(1):25–29. - PMC - PubMed

[9] Bachem, O., Lucic, M., Hassani, H., & Krause, A. (2016). Fast and provably good seedings for k-means. In Advances in neural information processing systems 29 (pp. 55–63). Curran Associates Inc.

[10] Bachem, O., Lucic, M., Hassani, H., & Krause, A. (2016). Fast and provably good seedings for k-means. In Advances in neural information processing systems 29 (pp. 55–63). Curran Associates Inc.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Embedding-based Silhouette community detection

Affiliations

Embedding-based Silhouette community detection

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources