Skip to content

⚗️🐄 Add proteochemometrics case study #1510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 79 commits into
base: master
Choose a base branch
from

Conversation

cthoyt
Copy link
Member

@cthoyt cthoyt commented Feb 20, 2025

This PR adds the proteochemometrics case study, which can be run with uv run --script notebooks/proteochemometrics.py from the root folder. It automatically download a few resources then start training

It depends on:

@cthoyt cthoyt changed the title Add proteochemometrics case study ⚗️🐄 Add proteochemometrics case study Feb 20, 2025
@cthoyt
Copy link
Member Author

cthoyt commented Feb 20, 2025

@mberr can you elaborate on how the token representation might be useful for the chemical fingerprints (cf #1509 (comment))?

We don't seem to have any examples and the documentation in those components is pretty specific towards usage in NodePiece

The chemical representations are 2048-bit Morgan Fingerprints, where (more or less) each bit corresponds to the existence of a substructure in the molecule.

@mberr
Copy link
Member

mberr commented Feb 21, 2025

@mberr can you elaborate on how the token representation might be useful for the chemical fingerprints (cf #1509 (comment))?

We don't seem to have any examples and the documentation in those components is pretty specific towards usage in NodePiece

The chemical representations are 2048-bit Morgan Fingerprints, where (more or less) each bit corresponds to the existence of a substructure in the molecule.

In TokenizationRepresentation you have each entity assigned to a fixed number of tokens. Each token is an ID that is used to obtain token representations. There is also a padding token index to allow for less than num_tokens of actual tokens.

In your case, if the number of possible ones in these fingerprints is reasonably small, you could treat the indices of the 1s as the token indices for a given molecule. This would give you an explicit representation for each of the substructures (in the token representation), and the representation of the molecule would be composed of the representations of its substructures. Overall, this requires more computation than a simple lookup, but you may get some additional benefit from the inductive bias of shared representations for shared chemical substructures.

P.S.: By default, you get an extra dimension by using the TokenizationRepresentation, i.e. if your substructure token representation is of the shape shape, the output of the tokenization representation will be of the shape (num_tokens, *shape); you will probably want to reduce this again by a transformation. For simple aggregations, PyTorch actually has a built-in torch.nn.EmbeddingBag that may be useful to add.

Edit: I added the embedding bag here: #1512

Comment on lines +177 to +178
df = pd.read_csv(path, sep="\t", dtype=str)
return df
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df = pd.read_csv(path, sep="\t", dtype=str)
return df
return pd.read_csv(path, sep="\t", dtype=str)

@cthoyt
Copy link
Member Author

cthoyt commented Mar 6, 2025

so at this point, I am still thinking I like the static representations rather than trying to learn an embedding for each token, at least for demo purposes. I think that I'd have to develop this further to have a proper evaluation where the choice between a feature-enriched static representation vs and embedding bag would make a difference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants