-
-
Notifications
You must be signed in to change notification settings - Fork 201
Load pipeline results #1196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Load pipeline results #1196
Conversation
Implement the class method `load_from_directory` to reload a `PipelineResult` object starting from the results saved using `save_to_directory`.
…pykeen into load-pipeline-results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thanks for the PR. It looks already promising!
I quickly went over the changes and left a few comments.
You can do this in the load function via logging.warn (or better, a module-specific logger instance).
Maybe we can try to modify the existing Let me know if you need any help. |
Convert exists check to is_file Co-authored-by: Max Berrendorf <max.berrendorf@gmail.com>
Convert exists check to is_dir Co-authored-by: Max Berrendorf <max.berrendorf@gmail.com>
Hi @mberr! I think it is a good idea, we could add a new keyword argument to |
Sounds good. There is already a |
Hello @mberr . I've looked into By the way, I tried to store these attributes in a |
Sounds good 👍
Hm, that is quite nasty. I think the reason is that at that point we have already resolved names to the respective classes. Using pickle comes with some downsides compared to simple strings, e.g., that re-loading may fail if the code structure changed in between. However, if we keep reloading these parts optional, I would be fine with your proposed solution. |
Add the possibility to save the pipeline_info, a set of attributes of PipelineResult that were not saved in the previous implementation of `save_to_directory`. Reload the pipeline_info.
Hi @mberr ! With these changes we would just miss |
@@ -292,6 +280,18 @@ class PipelineResult(Result): | |||
#: How long in seconds did evaluation take? | |||
evaluate_seconds: float | |||
|
|||
#: The model trained by the pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I haven't been following discussion yet, but I'm pretty opposed to weakening the typing in this class. When people get results from the pipeline, they shouldn't have to check for if there are results there - it should be guaranteed.
This is definitely true for the model and seed. If the training triples don't get loaded properly, people get confused since they need to reference the training triples to do anything valuable. I don't want to make changes that will result in getting more github issues from confused users. Wrt the training loop, this could be a reasonable thing to make optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cthoyt ! Yes, it's perfectly understandable. So, maybe we should give up the implmentation of this feature or rethink it completely if users need to reload the PipelineResult
, since the current implementation of save_to_directory
saves only partial information of the object.
Link to the relevant feature request
#1192
Description of the Change
Add the classmethod
load_from_directory
to the classpykeen.pipeline.PipelineResult
. The class method produces an instance ofPipelineResult
containing the information stored in the directory produced viasave_to_directory
.Not all the instance attributes can be reloaded, since they're not saved to the directory. The attributes
training_loop
,training
,model
andrandom_seed
have been made optional, withNone
default values. Thestopper
(if present) is not reloaded because of some missing information (it was already optional). Other attributes such asconfiguration
,version
andgit_hash
are not reloaded, but their default value has been kept.Alternatives
I considered subclassing
PipelineResult
with a new classReloadedResult
which contained default values for the parameters that could've not been restored. However, I think this solution adds too much complexity.Possible Drawbacks
Users should be warned about the missing information. Making the
random_seed
,training
,model
andtraining_loop
as optional should not create any problem.Verification Process
Performed tests with
tox -e py
andtox -e docs-test
without any error.Tried to save and reload the results obtained via:
with different configurations for the boolean flags, with success. Still have to write some unit tests for the new function.