This repository provides a novel methodology for the hierarchical analysis of musical similarity, specifically designed for symbolic folk music collections. Our approach leverages phylogenetic analysis to uncover relationships between different musical genres and traditions. The tools included allow for feature extraction, phrase clustering, multiple forms of similarity calculation, and visualization of results as phylogenetic trees and heatmaps.
- Phylo-Analysis of Folk Traditions
Our methodology, illustrated below, is a multi-step process:
- Feature Extraction: Melodic and rhythmic features (chromatic intervals, diatonic intervals, and rhythmic ratios) are extracted from musical scores.
- Phrase Clustering: The extracted musical phrases are grouped into clusters based on their similarity using a global alignment and QT Clustering.
- Similarity Calculation: We compute four types of similarity between scores:
- Global Similarity: A direct comparison of the entire musical score's features.
- Shared Phrases Similarity: Measures the proportion of shared phrase clusters between scores.
- Form Similarity: Compares the structural arrangement of phrase clusters within scores.
- Combined Similarity: A weighted combination of Form and Shared Phrases similarities.
- Phylogenetic Analysis: For each similarity method, a distance matrix is generated and used to construct a phylogenetic tree, visualizing the relationships between scores and genres.
The methodology extracts and analyzes different types of musical features:
- Diatonic (D): Diatonic interval sequences representing melodic patterns within the key.
- Chromatic (C): Chromatic interval sequences capturing all semitone relationships.
- Rhythmic (R): Rhythmic ratio patterns representing temporal relationships between notes.
- Diatonic + Rhythmic (DR): Combined diatonic intervals and rhythmic patterns.
- Chromatic + Rhythmic (CR): Combined chromatic intervals and rhythmic patterns.
These abbreviations (D, C, R, DR, CR) are used throughout the analysis results and visualizations.
- Extracts the annotated phrases from the scores.
- Computes the chromatic interval, diatonic interval and rhythmic ratio from each melody.
- Combines the chromatic and diatonic features with the rhythmic ratio.
- Stores all relevant data in the database.
For each feature:
- Computes a global alignment between phrases.
- Runs a QT Clustering with the 10th percentile of the distance distribution as the threshold.
- Saves the cluster assignment to the database and returns a folder with additional information artifacts.
For each extracted feature calculates the similarity between scores with different methods:
- Global Similarity (
note
in code): raw comparison with the whole score feature. - Shared Phrases (
shared_segments
in code): Euclidean distance between scores considering the shared clusters of phrases. - Form Similarity (
structure
in code): alignment between scores represented as clusters of their phrases. - Combined Similarity: combination of Form and Shared Phrases similarities with a given ponderation (min-max normalisation applied before).
The codebase uses specific naming conventions that differ from the paper terminology:
note
= Global Similaritystructure
= Form Similarity (abbreviated ass
in combined analysis)shared_segments
= Shared Phrases (abbreviated asss
in combined analysis)
For Combined Similarity, the weighting notation follows the pattern s{weight}_ss{weight}
:
s25_ss75
: 25% Form Similarity + 75% Shared Phrasess50_ss50
: 50% Form Similarity + 50% Shared Phrasess75_ss25
: 75% Form Similarity + 25% Shared Phrases
Note: In the paper, Form Similarity is abbreviated as f
and Shared Phrases as s
, but in the code implementation s
is for structure (form) and ss
for shared_segments (shared phrases).
The dataset consists of 600 pieces (300 from each tradition) in **kern and MusicXML format. The Irish scores were sourced from The Session (see also the data repository), and the Galician scores from Folkoteca Galega.
These scores have been processed to expand repetition marks and can be found in the folkroot/data/origin/
folder. In addition, their musical phrases have been annotated by experts and are available in folkroot/data/origin/gal_irl_dataset_segments.xlsx
.
The number of scores per genre is:
Tradition | Genre | Number of Scores |
---|---|---|
galician | alalas | 30 |
galician | foliadas | 30 |
galician | jotas | 30 |
galician | marchas | 30 |
galician | mazurcas | 15 |
galician | muineiras | 30 |
galician | pasacorredoiras | 30 |
galician | pasodobles | 30 |
galician | polca | 15 |
galician | rumbas | 30 |
galician | valses | 30 |
irish | barndance | 30 |
irish | hornpipe | 30 |
irish | jig | 30 |
irish | march | 30 |
irish | mazurka | 30 |
irish | polka | 30 |
irish | reel | 30 |
irish | slide | 30 |
irish | strathspey | 30 |
irish | waltz | 30 |
-
Clone the source repository:
git clone https://github.com/hromerovelo/folkroot.git
-
Generate the Docker image:
cd folkroot
docker build -t folkroot:1.0 .
-
Create and run a docker container:
docker run -d -p 2222:22 --name folkroot_container folkroot:1.0
-
Connect via SSH to the running container with user user and password user:
ssh user@localhost -p 2222
-
Access the folkroot folder:
cd folkroot
Before running a new general execution, it is necessary to execute the following command to delete all previously computed data:
bash cleanup_folk_root_processing.sh
Perform a general execution with all the proposed similarity methods for all the features:
bash folk_root_processing.sh --alignment all --level all --visualize --trees
Results will be available at segments_clustering and trees folders. You can also access the folkroot.db in the database folder and run any SQL query executing:
cd database
sqlite3 folkroot.db
You can find the results of running a general execution at benchmark folder.
Generate a phylogenetic tree for a specific feature and similarity method:
cd folkroot/trees
python3 generate_phylo_tree.py --feature chromatic --level structure
Available options:
- Features:
diatonic
,chromatic
,rhythmic
,diatonic_rhythmic
,chromatic_rhythmic
- Levels:
note
(Global),structure
(Form),shared_segments
(Shared Phrases),combined
For combined similarity with custom weights:
python3 generate_phylo_tree.py --feature chromatic_rhythmic --level combined --structure-weight 0.75
Generate trees for specific genres only:
python3 generate_phylo_tree.py --feature diatonic --level structure --genres "jig reel hornpipe"
Analyze a single phylogenetic tree to extract genre distances and generate heatmaps. This process also computes the Genre Separation Ratio (GSR), a metric that evaluates how well the tree separates musical genres. The GSR values are included in the final Excel reports.
cd folkroot/trees
python3 analyze_genre_distances.py --tree-file generated_trees/structure_level_chromatic_all_genres.nexus
Analyze multiple trees and generate comparison reports:
python3 analyze_genre_distances.py --directory generated_trees/
This generates:
- Genre distance matrices (Excel format)
- Distance heatmaps (PNG format)
- Genre-level phylogenetic trees
- Genre Separation Ratio (GSR) metrics
- Comparative analysis across all trees
The following figure shows the GSR metric results across different similarity methods, features, and genre combinations from the benchmark analysis:
Compare distance matrices between different similarity methods. Example results from these comparisons can be found in the benchmark folder.
cd folkroot/trees
python3 compare_heatmaps.py \
genre_analysis/genre_distances_structure_chromatic.xlsx \
genre_analysis/genre_distances_note_chromatic.xlsx \
--output correlation_structure_vs_global.png
Compare using original (non-normalized) distance matrices:
python3 compare_heatmaps.py \
genre_analysis/genre_distances_structure_chromatic.xlsx \
genre_analysis/genre_distances_shared_segments_chromatic.xlsx \
--original --output correlation_form_vs_phrases.png
This section provides tools to validate the Genre Separation Ratio (GSR) metric itself. The following analyses test its robustness and establish a baseline for comparison, ensuring that the genre separation measured is statistically significant.
Test how GSR changes with increasing noise in genre assignments:
cd folkroot/trees/gsr_study
python3 test_gsr_sensitivity.py
Generate GSR baseline from random phylogenetic trees:
python3 random_trees_baseline.py --iterations 1000
Run complete GSR evaluation including sensitivity testing and statistical significance:
python3 combined_gsr_analysis.py --iterations 500 --output combined_analysis_results
This comprehensive analysis:
- Tests GSR sensitivity to noise (0-50% genre assignment errors).
- Generates random tree baseline (configurable iterations).
- Calculates statistical significance.
- Creates combined visualization with confidence intervals.
- Provides z-score analysis for perfect classification vs. random chance.
The following figure shows the combined results of the GSR sensitivity analysis and random trees baseline study:
Generate PDF visualizations for all computed trees:
cd folkroot/trees
bash visualize_all_trees.sh
This creates PDF files organized by similarity method in the tree_visualizations_pdf/
directory.
Generate trees for specific genre combinations using the generate_all_phylo_trees.sh script. Edit the genre groups section in the script:
# Edit the script to define custom genre groups
cd folkroot/trees
# Uncomment and modify genre groups in generate_all_phylo_trees.sh:
# GENRE_GROUPS[0]="polca polka valse waltz"
# GENRE_GROUPS[1]="march marchas mazurka mazurcas"
bash generate_all_phylo_trees.sh
Calculate correlations between different distance matrices:
cd folkroot/trees/analysis_utils
python3 compute_matrix_correlation.py \
../genre_analysis/genre_distances_structure_chromatic.xlsx \
../genre_analysis/genre_distances_note_chromatic.xlsx \
--output correlation_analysis.png
A SQLite database named folkroot.db
is generated in the folkroot/database/
directory, containing all clustering and alignment results. You can query it for custom analysis. Please refer to the Entity-Relationship (ER) diagram below for further details on the database schema.
To access the database and run queries:
cd folkroot/database
sqlite3 folkroot.db
# Example: View all scores and their genres
SELECT score_id, genre, tradition FROM Score;
# Example: Count scores by genre
SELECT genre, COUNT(*) as count FROM Score GROUP BY genre ORDER BY count DESC;
# Example: View clustering results for a specific feature
SELECT feature, cluster_id, COUNT(*) as segments FROM SegmentCluster WHERE feature = 'chromatic' GROUP BY feature, cluster_id;
# Exit SQLite
.exit
- Genre: A conventional category that identifies pieces of music as belonging to a shared tradition or set of conventions.
- Level: Refers to the different methods of similarity calculation (
note
,structure
,shared_segments
,combined
). - Phylogenetic Tree: A branching diagram showing the inferred evolutionary relationships among various biological species or other entities—in this case, musical scores or genres.
- GSR (Genre Separation Ratio): A metric to evaluate how well a phylogenetic tree separates predefined groups (genres).
- Docker container not starting: Ensure Docker is running correctly on your system.
- SSH connection refused: Verify the container is running (
docker ps
) and the port mapping is correct. - Errors during script execution: Check that all dependencies are correctly installed in the Docker image.
- The
analyze_genre_distances.py
script supports parallel processing to speed up the analysis of multiple trees. Use the--processes
argument to specify the number of cores to use. - The
random_trees_baseline.py
script in the GSR analysis also supports parallelism via the--num-workers
argument. - When generating PDF visualizations, the
clustering_visualization.py
script (used byvisualize_all_trees.sh
) automatically uses multiple cores to speed up PDF generation.
If you use this methodology or dataset in your research, please cite our paper:
@inproceedings{romero2025phylo,
title = {Phylo-Analysis of Folk Traditions: A Methodology for the Hierarchical Musical Similarity Analysis},
author = {Hilda Romero-Velo and Gilberto Bernardes and Susana Ladra and Jos{'e} R. Param{'a} and Fernando Silva-Coira},
booktitle = {Proceedings of the 26th International Society for Music Information Retrieval Conference (ISMIR)},
year = {2025},
address = {Daejeon, South Korea},
publisher = {ISMIR}
}