Load pipeline results #1196

nicolafan · 2023-01-11T12:28:24Z

Link to the relevant feature request

Description of the Change

Add the classmethod load_from_directory to the class pykeen.pipeline.PipelineResult. The class method produces an instance of PipelineResult containing the information stored in the directory produced via save_to_directory.

Not all the instance attributes can be reloaded, since they're not saved to the directory. The attributes training_loop, training, model and random_seed have been made optional, with None default values. The stopper (if present) is not reloaded because of some missing information (it was already optional). Other attributes such as configuration, version and git_hash are not reloaded, but their default value has been kept.

Alternatives

I considered subclassing PipelineResult with a new class ReloadedResult which contained default values for the parameters that could've not been restored. However, I think this solution adds too much complexity.

Possible Drawbacks

Users should be warned about the missing information. Making the random_seed, training, model and training_loop as optional should not create any problem.

Verification Process

Performed tests with tox -e py and tox -e docs-test without any error.
Tried to save and reload the results obtained via:

pipeline_result = pipeline(
    dataset='Nations',
    model='TransE'
)
pipeline_result.save_to_directory('nations_transe')

with different configurations for the boolean flags, with success. Still have to write some unit tests for the new function.

Implement the class method `load_from_directory` to reload a `PipelineResult` object starting from the results saved using `save_to_directory`.

…pykeen into load-pipeline-results

mberr

First of all, thanks for the PR. It looks already promising!

I quickly went over the changes and left a few comments.

src/pykeen/pipeline/api.py

mberr · 2023-01-11T20:04:00Z

Users should be warned about the missing information.

You can do this in the load function via logging.warn (or better, a module-specific logger instance).

Making the random_seed, training, model and training_loop as optional should not create any problem.
[...]
Other attributes such as configuration, version and git_hash are not reloaded, but their default value has been kept.

Maybe we can try to modify the existing save_to_directory to safe some of the "easy" ones, i.e., random_seed, configuration, version and git_hash?

Let me know if you need any help.

Convert exists check to is_file Co-authored-by: Max Berrendorf <max.berrendorf@gmail.com>

Convert exists check to is_dir Co-authored-by: Max Berrendorf <max.berrendorf@gmail.com>

nicolafan · 2023-01-11T20:20:44Z

Hi @mberr! I think it is a good idea, we could add a new keyword argument to save_to_directory, called save_configuration and initialized to True, which creates a configuration.json file containing the attributes you've just indicated. Then, using the warnings, we could tell the users which are the attributes of PipelineResult which have been initialized to default values, based on the flags configuration they chose for save_to_directory. I will wait for your opinion :)

mberr · 2023-01-11T20:54:09Z

Hi @mberr! I think it is a good idea, we could add a new keyword argument to save_to_directory, called save_configuration and initialized to True, which creates a configuration.json file containing the attributes you've just indicated. Then, using the warnings, we could tell the users which are the attributes of PipelineResult which have been initialized to default values, based on the flags configuration they chose for save_to_directory. I will wait for your opinion :)

Sounds good.

There is already a pykeen.pipeline.api.pipeline_from_config, so we may need to either make sure that the "configuration" is in this format, or maybe give it another name, e.g., reduced_configuration or pipeline_configuration to avoid misunderstandings.

nicolafan · 2023-01-12T18:07:32Z

Hello @mberr . I've looked into PipelineResult.config and unfortunately it doesn't store all the possible parameters we could give to pipeline (and so to the dict of pipeline_from_config). So I would say that the two "configurations" are quite different and we should distinguish them as you suggested. I think a good name for saving the attributes config, random_seed, version and git_hash of PipelineResult could be pipeline_info, to avoid reusing the word "configuration".

By the way, I tried to store these attributes in a pipeline_info.json file, but it's not possible to store configuration since it contains functions and complex objects such as the initializers. Do you think we should save it as pipeline_info.pkl, to serialize these objects and restore them, or do you have any other idea?

mberr · 2023-01-13T20:28:46Z

Hello @mberr . I've looked into PipelineResult.config and unfortunately it doesn't store all the possible parameters we could give to pipeline (and so to the dict of pipeline_from_config). So I would say that the two "configurations" are quite different and we should distinguish them as you suggested. I think a good name for saving the attributes config, random_seed, version and git_hash of PipelineResult could be pipeline_info, to avoid reusing the word "configuration".

Sounds good 👍

By the way, I tried to store these attributes in a pipeline_info.json file, but it's not possible to store configuration since it contains functions and complex objects such as the initializers. Do you think we should save it as pipeline_info.pkl, to serialize these objects and restore them, or do you have any other idea?

Hm, that is quite nasty. I think the reason is that at that point we have already resolved names to the respective classes. Using pickle comes with some downsides compared to simple strings, e.g., that re-loading may fail if the code structure changed in between. However, if we keep reloading these parts optional, I would be fine with your proposed solution.

Add the possibility to save the pipeline_info, a set of attributes of PipelineResult that were not saved in the previous implementation of `save_to_directory`. Reload the pipeline_info.

nicolafan · 2023-01-15T17:51:27Z

Hi @mberr ! With these changes we would just miss configuration, training_loop and stopper in the case the user keeps True as a value for all the params to save_to_directory. In this way we would be able to not use pickle to save these three difficult params. Otherwise, if we decide that we want to store also these three, I thought that we could save them separately in an inner directory containing only pkl files, so that if new PipelineResult attributes will be added in the future they could be added as single pickle files. Maybe add a param full_save=False to save_to_directory, which manages the creation of this directory. I think it depends on how much users need to reuse these attributes.

cthoyt · 2023-01-15T17:56:59Z

src/pykeen/pipeline/api.py

@@ -292,6 +280,18 @@ class PipelineResult(Result):
    #: How long in seconds did evaluation take?
    evaluate_seconds: float

+    #: The model trained by the pipeline


Sorry I haven't been following discussion yet, but I'm pretty opposed to weakening the typing in this class. When people get results from the pipeline, they shouldn't have to check for if there are results there - it should be guaranteed.

This is definitely true for the model and seed. If the training triples don't get loaded properly, people get confused since they need to reference the training triples to do anything valuable. I don't want to make changes that will result in getting more github issues from confused users. Wrt the training loop, this could be a reasonable thing to make optional.

Hi @cthoyt ! Yes, it's perfectly understandable. So, maybe we should give up the implmentation of this feature or rethink it completely if users need to reload the PipelineResult, since the current implementation of save_to_directory saves only partial information of the object.

nicolafan added 5 commits January 11, 2023 12:41

Implement results reloading

af871eb

Implement the class method `load_from_directory` to reload a `PipelineResult` object starting from the results saved using `save_to_directory`.

Add documentation

1cd41cf

Merge branch 'pykeen:master' into load-pipeline-results

dc0e172

Make model and training optional

bcaabf7

Merge branch 'load-pipeline-results' of https://github.com/nicolafan/…

25d8fe6

…pykeen into load-pipeline-results

mberr reviewed Jan 11, 2023

View reviewed changes

src/pykeen/pipeline/api.py Outdated Show resolved Hide resolved

src/pykeen/pipeline/api.py Outdated Show resolved Hide resolved

src/pykeen/pipeline/api.py Outdated Show resolved Hide resolved

src/pykeen/pipeline/api.py Outdated Show resolved Hide resolved

nicolafan and others added 2 commits January 11, 2023 21:12

Update src/pykeen/pipeline/api.py

f3ef55f

Convert exists check to is_file Co-authored-by: Max Berrendorf <max.berrendorf@gmail.com>

Update src/pykeen/pipeline/api.py

0beadbe

Convert exists check to is_dir Co-authored-by: Max Berrendorf <max.berrendorf@gmail.com>

Convert exists checks to is_file

7269e91

nicolafan added 3 commits January 15, 2023 18:36

Save random_seed, version and git_hash

9453421

Add the possibility to save the pipeline_info, a set of attributes of PipelineResult that were not saved in the previous implementation of `save_to_directory`. Reload the pipeline_info.

Blackify

1919139

Modify documentation

284f4d2

cthoyt reviewed Jan 15, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Load pipeline results #1196

Load pipeline results #1196

Uh oh!

nicolafan commented Jan 11, 2023 •

edited

Loading

Uh oh!

mberr left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mberr commented Jan 11, 2023

Uh oh!

nicolafan commented Jan 11, 2023

Uh oh!

mberr commented Jan 11, 2023

Uh oh!

nicolafan commented Jan 12, 2023 •

edited

Loading

Uh oh!

mberr commented Jan 13, 2023

Uh oh!

nicolafan commented Jan 15, 2023 •

edited

Loading

Uh oh!

cthoyt Jan 15, 2023

Uh oh!

nicolafan Jan 15, 2023

Uh oh!

Uh oh!

Uh oh!

Load pipeline results #1196

Are you sure you want to change the base?

Load pipeline results #1196

Uh oh!

Conversation

nicolafan commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link to the relevant feature request

Description of the Change

Alternatives

Possible Drawbacks

Verification Process

Uh oh!

mberr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mberr commented Jan 11, 2023

Uh oh!

nicolafan commented Jan 11, 2023

Uh oh!

mberr commented Jan 11, 2023

Uh oh!

nicolafan commented Jan 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mberr commented Jan 13, 2023

Uh oh!

nicolafan commented Jan 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cthoyt Jan 15, 2023

Choose a reason for hiding this comment

Uh oh!

nicolafan Jan 15, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nicolafan commented Jan 11, 2023 •

edited

Loading

nicolafan commented Jan 12, 2023 •

edited

Loading

nicolafan commented Jan 15, 2023 •

edited

Loading