-
-
Notifications
You must be signed in to change notification settings - Fork 201
⚗️🐄 Add proteochemometrics case study #1510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This fixed two issues - first was that the base indexes didn't match to the bases passed to super().__init__() - solved with using start=1 in the enumeration Second, the construction of the assignment needed to have two parts, one for the base index and one for the arange
@mberr can you elaborate on how the token representation might be useful for the chemical fingerprints (cf #1509 (comment))? We don't seem to have any examples and the documentation in those components is pretty specific towards usage in NodePiece The chemical representations are 2048-bit Morgan Fingerprints, where (more or less) each bit corresponds to the existence of a substructure in the molecule. |
In In your case, if the number of possible ones in these fingerprints is reasonably small, you could treat the indices of the 1s as the token indices for a given molecule. This would give you an explicit representation for each of the substructures (in the token representation), and the representation of the molecule would be composed of the representations of its substructures. Overall, this requires more computation than a simple lookup, but you may get some additional benefit from the inductive bias of shared representations for shared chemical substructures. P.S.: By default, you get an extra dimension by using the Edit: I added the embedding bag here: #1512 |
This reverts commit dec6193.
df = pd.read_csv(path, sep="\t", dtype=str) | ||
return df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df = pd.read_csv(path, sep="\t", dtype=str) | |
return df | |
return pd.read_csv(path, sep="\t", dtype=str) |
so at this point, I am still thinking I like the static representations rather than trying to learn an embedding for each token, at least for demo purposes. I think that I'd have to develop this further to have a proper evaluation where the choice between a feature-enriched static representation vs and embedding bag would make a difference |
This PR adds the proteochemometrics case study, which can be run with
uv run --script notebooks/proteochemometrics.py
from the root folder. It automatically download a few resources then start trainingIt depends on: