⚗️🐄 Add proteochemometrics case study #1510

cthoyt · 2025-02-20T07:37:33Z

This PR adds the proteochemometrics case study, which can be run with uv run --script notebooks/proteochemometrics.py from the root folder. It automatically download a few resources then start training

It depends on:

👩‍👩‍👧‍👦⛽ Add Multi-backfill representation #1507
✨✨ Add feature-enriched embedding and MLP-transformed representation #1511
Switch to tokenization representation on top of morgan fingerprints (🪫🔈 Generalize low-rank representation #1509 (comment))
Change hidden dimensions (👩‍👩‍👧‍👦⛽ Add Multi-backfill representation #1507 (comment))

This fixed two issues - first was that the base indexes didn't match to the bases passed to super().__init__() - solved with using start=1 in the enumeration Second, the construction of the assignment needed to have two parts, one for the base index and one for the arange

cthoyt · 2025-02-20T21:36:47Z

@mberr can you elaborate on how the token representation might be useful for the chemical fingerprints (cf #1509 (comment))?

We don't seem to have any examples and the documentation in those components is pretty specific towards usage in NodePiece

The chemical representations are 2048-bit Morgan Fingerprints, where (more or less) each bit corresponds to the existence of a substructure in the molecule.

mberr · 2025-02-21T16:51:11Z

@mberr can you elaborate on how the token representation might be useful for the chemical fingerprints (cf #1509 (comment))?

We don't seem to have any examples and the documentation in those components is pretty specific towards usage in NodePiece

The chemical representations are 2048-bit Morgan Fingerprints, where (more or less) each bit corresponds to the existence of a substructure in the molecule.

In TokenizationRepresentation you have each entity assigned to a fixed number of tokens. Each token is an ID that is used to obtain token representations. There is also a padding token index to allow for less than num_tokens of actual tokens.

In your case, if the number of possible ones in these fingerprints is reasonably small, you could treat the indices of the 1s as the token indices for a given molecule. This would give you an explicit representation for each of the substructures (in the token representation), and the representation of the molecule would be composed of the representations of its substructures. Overall, this requires more computation than a simple lookup, but you may get some additional benefit from the inductive bias of shared representations for shared chemical substructures.

P.S.: By default, you get an extra dimension by using the TokenizationRepresentation, i.e. if your substructure token representation is of the shape shape, the output of the tokenization representation will be of the shape (num_tokens, *shape); you will probably want to reduce this again by a transformation. For simple aggregations, PyTorch actually has a built-in torch.nn.EmbeddingBag that may be useful to add.

Edit: I added the embedding bag here: #1512

This reverts commit dec6193.

mberr · 2025-02-21T17:24:46Z

notebooks/proteochemometrics/main.py

+            df = pd.read_csv(path, sep="\t", dtype=str)
+            return df


Suggested change

df = pd.read_csv(path, sep="\t", dtype=str)

return df

return pd.read_csv(path, sep="\t", dtype=str)

cthoyt · 2025-03-06T08:31:17Z

so at this point, I am still thinking I like the static representations rather than trying to learn an embedding for each token, at least for demo purposes. I think that I'd have to develop this further to have a proper evaluation where the choice between a feature-enriched static representation vs and embedding bag would make a difference

mberr and others added 30 commits February 16, 2025 19:57

Add multi-backfill representation

a4c6a1f

use more distinguishable variable name

59d8033

Create proteochemometrics.py

7c77a34

Update representation.py

40216ad

Switch to using spec class

058733c

Update logging

fab3d76

Specify chembl version

2408340

Add target predictions

76d516d

Merge branch 'master' into multi-back-fill

f36fc8b

Update representation.py

15c5b59

Move first

b1b993c

Update representation.py

858a7fd

Explicitly pass shape to backfill representation on construction

c9c4cd1

Reimplement single backfill with multibackfill

5129168

Remove todo

cf78c18

Update representation.py

d7b52b0

Handle scenario where partitioning is perfect

352696a

Rename variable

033ad17

Add feature-enriched embedding

d1daf64

Update proteochemometrics.py

a349a15

Update training pipeline

014e6c5

Update proteochemometrics.py

12aa472

Update representation.py

55f9c59

Update proteochemometrics.py

c9e4c7a

Update representation.py

a15f349

Rename

57a239b

Upstream MLP transformation

9c5c68e

Update representation.py

5baba16

Update README

b1af2fd

cthoyt added 12 commits February 19, 2025 10:47

Cleanup exceptions

15fcaea

Update test_representation.py

8e54d11

Update representation.py

46813f4

Merge branch 'master' into multi-back-fill

08669b8

Add tests

5e78e7a

Update representation.py

92c62f6

Don't add backfill when not necessary

c44c135

Update representation.py

9d315ab

Update representation.py

a2c51a3

Merge branch 'master' into proteochemometrics

9fcc8b5

Update representation.py

6f1ab13

Reorg

0788502

cthoyt changed the title ~~Add proteochemometrics case study~~ ⚗️🐄 Add proteochemometrics case study Feb 20, 2025

cthoyt added 6 commits February 20, 2025 16:54

Reuse chembl downloader code

68823e3

Shorten

e05ccee

Update main.py

2e49557

Update main.py

b5ddb44

Add asymmetry

d18e597

Increase number of trainable parameters

f38bcc3

cthoyt added 3 commits February 21, 2025 00:24

Expand to non-human

aa234f6

Update main.py

290ca54

Create README.md

819cea1

mberr added 2 commits February 21, 2025 18:14

Update README

dec6193

Revert "Update README"

38a49e1

This reverts commit dec6193.

mberr reviewed Feb 21, 2025

View reviewed changes

mberr mentioned this pull request Feb 21, 2025

➕🎒 Add embedding bag representation #1512

Merged

Merge branch 'master' into proteochemometrics

a7a6e5e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

⚗️🐄 Add proteochemometrics case study #1510

⚗️🐄 Add proteochemometrics case study #1510

Uh oh!

cthoyt commented Feb 20, 2025 •

edited

Loading

Uh oh!

cthoyt commented Feb 20, 2025

Uh oh!

mberr commented Feb 21, 2025 •

edited

Loading

Uh oh!

mberr Feb 21, 2025

Uh oh!

cthoyt commented Mar 6, 2025

Uh oh!

Uh oh!

	df = pd.read_csv(path, sep="\t", dtype=str)
	return df
	return pd.read_csv(path, sep="\t", dtype=str)

Uh oh!

⚗️🐄 Add proteochemometrics case study #1510

Are you sure you want to change the base?

⚗️🐄 Add proteochemometrics case study #1510

Uh oh!

Conversation

cthoyt commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cthoyt commented Feb 20, 2025

Uh oh!

mberr commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mberr Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

cthoyt commented Mar 6, 2025

Uh oh!

Uh oh!

cthoyt commented Feb 20, 2025 •

edited

Loading

mberr commented Feb 21, 2025 •

edited

Loading