Description
Bug Report
Description
For certain tables in which the make()
function can take a long period of time, it is often desirable to force that only a small number of keys are handled before moving on to other tables.
Datajoint provides two ways to implement this, that can behave unexpectedly differently for reasons that can only be found by examing the datajoint source code in detail:
table.populate(limit=x)
table.populate(max_calls=x)
I believe that the way the limit
keyword currently operates does not align well either with its implied meaning (based on what the word "limit" means in english language", nor with how it is described in the docstring (:param limit: if not None, check at most this many keys
)
Within Autopopulate.populate
(https://github.com/datajoint/datajoint-python/blob/master/datajoint/autopopulate.py#L153), the handling goes as follows:
- Determine the set of incomplete keys for
table
(L212) - Fetch up to
limit
number of keys from this set - If
reserve_jobs
, then remove any keys from that group that exist in thejobs
table (L218) - if
max_calls
, reduce the length of the list of remaining keys tomax_calls
- For each remaining key, call
table.make(key)
(L248)
In the case where limit
is small (equal to or smaller than the number of workers that might be executing table.populate(reserve_jobs=True, limit=x)
simultaneously), this can result in a situation in which there are many potential keys to be processed, but each additional worker is handed a set of keys to process that are all in the schema.jobs
table simultaneously, and thus is reduced to an empty set of keys before getting to the table.make(key)
part of the .populate()
call.
The "correct" answer to this case is to use max_calls
instead, as the limit will then apply after checking whether keys are already in progress (or error, or any other form of unavailability due to the jobs table).
However, today marks the 4th time I have spent >1h chasing down weird behaviour where multiple workers were idling /ignoring available work due to misunderstanding / misremembering the meaning of this keyword.
I have several suggestions for fixing this behaviour, mostly depending on the degree to which it is intentional/desired:
-
Assuming it is unintentional, make
limit
andmax_calls
equivalent, by moving thelimit
handling withinAutopopulate.populate
after the job_table checking instead of before -
Change the logic for determining the set of available keys to include keys already in flight (i.e. listed in the jobs table with whatever status). This would combine well with issue Provide a way to list specific jobs #873 , although in practice it has the same outcome at (1)
-
If it is intended, change the keyword to something that better reflects its actual role. To me the literal meaning of the word "limit" is much more closely aligned with the logical behaviour of the keyword
max_calls
, and vice versa. It is not mentally logical to me thatlimit
applies before ignoring keys that are already in progress - I anticipate (every single time!) that keys already in error/reserved are already excluded from the set of keys that are available to be processed, at least is including thereserve_jobs
flag. -
Change the docstring to better reflect the actual behaviour. This obviously involves the smallest changes to compatibility or behaviour for everyone who already uses this logic, but still falls victim to the point I make above in (3). I would suggest something like
if not None, attempt to process this many keys, which may include keys already reserved
Reproducibility
Include:
- Any OS
- Any Python version
- Any MySQL version
- Datajoint <=0.14.3 (not tested with later versions, but the problematic logic still exists in code in
master
as of posting date)
Create two tables: one listing keys, the other representing some long-running job
# my_test_schema.py
import time
import datajoint as dj
schema = dj.schema("test_schema")
@schema
class TableOne(dj.Manual):
definition = """
id: int
---
"""
contents = list(range(10))
@schema
class TableTwo(dj.Computed):
definition = """
-> TableOne
---
some_value: int
"""
def make(self, key):
print(f"working on {key}\n")
time.sleep(30)
self.insert1({**key, "some_value": key["id"]})
Create a script to populate TableTwo
:
# my_test_script.py
import os
import time
from my_test_schema import *
def test():
pid = os.getpid()
while True:
in_flight_keys = (schema.jobs & {"table_name": TableTwo.table_name}).fetch("key")
available_keys = TableTwo.key_source - TableTwo - in_flight_keys
print(f"{pid} : available keys: {len(available_keys)}")
TableTwo.populate(reserve_jobs=True, limit=1)
time.sleep(1) # Reduce screen spam
if __name__ == "__main__":
test()
Run at least two copies of my_test_script.py
simultaneously. The output will look something like this:
1 : available keys: 10
working on {"id":0}
2 : available keys 9
2 : available keys 9
2 : available keys 9
2 : available keys 9
....
i.e. the second script will rapidly loop through the while
loop ignoring the 9 other jobs available while the first actually does something (in this test example, sleeping). There is no way to predict which process will pick up the second job once the first is complete, but the same pattern will play out - the other script idling.
Expected Behavior
I expect the limit
keyword to behave the way that the max_calls
keyword actually does.
Alternately, I expect either a choice of keyword that better reflects the actual logical behaviour, or a more descriptive docstring that better explains the actual logical behaviour.
It should not be necessary to examine the source code of the autopopulate.py
component of the Datajoint library to understand what the difference between the keywords limit
and max_calls
is and why one should choose one or the other.