Skip to content

Load pipeline results #1196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

nicolafan
Copy link
Contributor

@nicolafan nicolafan commented Jan 11, 2023

Link to the relevant feature request

#1192

Description of the Change

Add the classmethod load_from_directory to the class pykeen.pipeline.PipelineResult. The class method produces an instance of PipelineResult containing the information stored in the directory produced via save_to_directory.

Not all the instance attributes can be reloaded, since they're not saved to the directory. The attributes training_loop, training, model and random_seed have been made optional, with None default values. The stopper (if present) is not reloaded because of some missing information (it was already optional). Other attributes such as configuration, version and git_hash are not reloaded, but their default value has been kept.

Alternatives

I considered subclassing PipelineResult with a new class ReloadedResult which contained default values for the parameters that could've not been restored. However, I think this solution adds too much complexity.

Possible Drawbacks

Users should be warned about the missing information. Making the random_seed, training, model and training_loop as optional should not create any problem.

Verification Process

Performed tests with tox -e py and tox -e docs-test without any error.
Tried to save and reload the results obtained via:

pipeline_result = pipeline(
    dataset='Nations',
    model='TransE'
)
pipeline_result.save_to_directory('nations_transe')

with different configurations for the boolean flags, with success. Still have to write some unit tests for the new function.

Implement the class method `load_from_directory` to
reload a `PipelineResult` object
starting from the results saved using `save_to_directory`.
Copy link
Member

@mberr mberr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks for the PR. It looks already promising!

I quickly went over the changes and left a few comments.

@mberr
Copy link
Member

mberr commented Jan 11, 2023

Users should be warned about the missing information.

You can do this in the load function via logging.warn (or better, a module-specific logger instance).

Making the random_seed, training, model and training_loop as optional should not create any problem.
[...]
Other attributes such as configuration, version and git_hash are not reloaded, but their default value has been kept.

Maybe we can try to modify the existing save_to_directory to safe some of the "easy" ones, i.e., random_seed, configuration, version and git_hash?

Let me know if you need any help.

nicolafan and others added 2 commits January 11, 2023 21:12
Convert exists check to is_file

Co-authored-by: Max Berrendorf <max.berrendorf@gmail.com>
Convert exists check to is_dir

Co-authored-by: Max Berrendorf <max.berrendorf@gmail.com>
@nicolafan
Copy link
Contributor Author

Hi @mberr! I think it is a good idea, we could add a new keyword argument to save_to_directory, called save_configuration and initialized to True, which creates a configuration.json file containing the attributes you've just indicated. Then, using the warnings, we could tell the users which are the attributes of PipelineResult which have been initialized to default values, based on the flags configuration they chose for save_to_directory. I will wait for your opinion :)

@mberr
Copy link
Member

mberr commented Jan 11, 2023

Hi @mberr! I think it is a good idea, we could add a new keyword argument to save_to_directory, called save_configuration and initialized to True, which creates a configuration.json file containing the attributes you've just indicated. Then, using the warnings, we could tell the users which are the attributes of PipelineResult which have been initialized to default values, based on the flags configuration they chose for save_to_directory. I will wait for your opinion :)

Sounds good.

There is already a pykeen.pipeline.api.pipeline_from_config, so we may need to either make sure that the "configuration" is in this format, or maybe give it another name, e.g., reduced_configuration or pipeline_configuration to avoid misunderstandings.

@nicolafan
Copy link
Contributor Author

nicolafan commented Jan 12, 2023

Hello @mberr . I've looked into PipelineResult.config and unfortunately it doesn't store all the possible parameters we could give to pipeline (and so to the dict of pipeline_from_config). So I would say that the two "configurations" are quite different and we should distinguish them as you suggested. I think a good name for saving the attributes config, random_seed, version and git_hash of PipelineResult could be pipeline_info, to avoid reusing the word "configuration".

By the way, I tried to store these attributes in a pipeline_info.json file, but it's not possible to store configuration since it contains functions and complex objects such as the initializers. Do you think we should save it as pipeline_info.pkl, to serialize these objects and restore them, or do you have any other idea?

@mberr
Copy link
Member

mberr commented Jan 13, 2023

Hello @mberr . I've looked into PipelineResult.config and unfortunately it doesn't store all the possible parameters we could give to pipeline (and so to the dict of pipeline_from_config). So I would say that the two "configurations" are quite different and we should distinguish them as you suggested. I think a good name for saving the attributes config, random_seed, version and git_hash of PipelineResult could be pipeline_info, to avoid reusing the word "configuration".

Sounds good 👍

By the way, I tried to store these attributes in a pipeline_info.json file, but it's not possible to store configuration since it contains functions and complex objects such as the initializers. Do you think we should save it as pipeline_info.pkl, to serialize these objects and restore them, or do you have any other idea?

Hm, that is quite nasty. I think the reason is that at that point we have already resolved names to the respective classes. Using pickle comes with some downsides compared to simple strings, e.g., that re-loading may fail if the code structure changed in between. However, if we keep reloading these parts optional, I would be fine with your proposed solution.

Add the possibility to save the pipeline_info, a set of
attributes of PipelineResult that were not saved in the previous
implementation of `save_to_directory`.

Reload the pipeline_info.
@nicolafan
Copy link
Contributor Author

nicolafan commented Jan 15, 2023

Hi @mberr ! With these changes we would just miss configuration, training_loop and stopper in the case the user keeps True as a value for all the params to save_to_directory. In this way we would be able to not use pickle to save these three difficult params. Otherwise, if we decide that we want to store also these three, I thought that we could save them separately in an inner directory containing only pkl files, so that if new PipelineResult attributes will be added in the future they could be added as single pickle files. Maybe add a param full_save=False to save_to_directory, which manages the creation of this directory. I think it depends on how much users need to reuse these attributes.

@@ -292,6 +280,18 @@ class PipelineResult(Result):
#: How long in seconds did evaluation take?
evaluate_seconds: float

#: The model trained by the pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I haven't been following discussion yet, but I'm pretty opposed to weakening the typing in this class. When people get results from the pipeline, they shouldn't have to check for if there are results there - it should be guaranteed.

This is definitely true for the model and seed. If the training triples don't get loaded properly, people get confused since they need to reference the training triples to do anything valuable. I don't want to make changes that will result in getting more github issues from confused users. Wrt the training loop, this could be a reasonable thing to make optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cthoyt ! Yes, it's perfectly understandable. So, maybe we should give up the implmentation of this feature or rethink it completely if users need to reload the PipelineResult, since the current implementation of save_to_directory saves only partial information of the object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants