This project extracts key product data from MongoDB into a Postgres database in order to support faster aggregate calculations
Make sure you use the same version of python that is mentioned in the Dockerfile
. I used pyenv to install it. This also required installing the build dependencies mentioned in the pyenv wiki. If you switch to a different version of python you may need to run poetry env use <version>
and also update the interpreter path in your IDE.
When running locally the project expects a Postgres database to be available on port 5512 and a Mongo database on port 27017, both on localhost. Running docker-compose will create a suitable Postgres database if needed. The database name can be set in the environment, but the schema name is always "query".
To get started...
Run the following:
docker compose up -d query_postgres
Update the POSTGRES_HOST (and other necessary) environment variables to reference your existing database.
Please use the .envrc
file to override settings so that edits are not committed to the repo.
When connecting to a PostgreSQL database running on a Windows host from a WSL2 instance you will need to enable the PostgreSQL port (5432) in Windows Firewall.
Run the following:
make dev
You can then start in watch mode with:
make watch
The service is exposed on port 5510, to avoid clashing with Robotoff. You can check the service is running by viewing the health check endpoint.
FastAPI is used to support the REST APIs. Non-blocking I/O is achieved using asyncio and this influences the PostgreSQL. MongoDB and Redis clients used.
Pydantic is used to model externally visible schemas. The generated OpenAPI documentation can be found here. There is also a redoc version.
Other than that the reliance on external code is kept to a minimum so that the project is easy for someone new to understand what's going on.
This is mainly an SQL-based project so the Python application framework is kept to a minimum. The docstring in the __init__.py
in each of the folders describes its purpose.
The entrypoint is main.py which runs database migrations and starts the service and scheduler.
Most of the logic in this project is in the actual SQL used to update and query data. There has therefore been conscious decision to avoid using any kind of ORM to hide the SQL from the code.
Pydantic models are only used for outward-facing models to ensure we get good API documentation and validation. Internally, data is mainly passed around using dictionaries or the asyncpg Record structure (which behaves like a dictionary).
The unit tests use testcontainers to create a temporary Postgres database and Redis instance in Docker, which lasts for the duration of the test run. The tests share the same database while running, so ensure that tests are independent from one another by using randomized product codes / tags.
Tests are mingled in with the project structure to make it easier to find them. A TDD approach to development is recommended.
Note that all tests are classified as "unit" tests as no external dependencies are needed. However, we use testcontainers to provide transient instances of PostgreSQL and Redis so that these do not need to be mocked all of the time.
By default, product opener is configured to call the "query" host on the COMMON_NET_NAME
network. To configure Product Opener to use a locally running instance update the following line in the Product Opener .env file:
QUERY_URL=http://host.docker.internal:5510
Note that when uploading scans the off-query PostgreSQL database username and password must be passed in using basic auth. This can be done by including them in the off-query URL, e.g.
QUERY_URL=http://DATABASE_USER:DATABASE_PASSWORD@query:5510
The project joins the Product Opener COMMON_NET_NAME
network.
The project still uses its own Postgres database but will connect to shared-services Mongo database using the "mongodb" host.
The service is exposed to localhost on 5511 to avoid clashing with any locally running instance.
Use docker compose to start:
make up
Support for new tags can be done by simply adding a further entity definition in the product_tags.py file. Extra migration logic will be needed to create the table.
The tag won't be picked up for queries until a full import is done (when it will be added to the loaded_tag table).
The main docker-compose.yml
creates the openfoodfacts-query service and associated Postgres database and expects MongoDB to already exist.
The docker-compose-run.yml
override explicitly sets the MONGO_URI
and REDIS_URL
variables to access those services within Docker.
The dev.yml
override simply adds the Docker build instruction.
To refresh the Query Postgres database with the current tags from MongoDB you can invoked a browser with:
http://localhost:5510/importfrommongo?from
The "from" option ensures that an incremental import is performed. If no date is supplied then the query service will look at the latest updated time for products it has already processed and only fetch products from MongoDB that have been updated since then. An explicit date can also be specified in the from parameter, e.g. "from=2023-02-23". If no from parameter is applied then all data in the Postgres database will be deleted and a full import will be performed.
The "count", "aggregate" and "find" POST endpoints accept a MongoDB style filter and aggregate pipeline respectively. Syntax support is only basic and is limited to what Product Opener currently uses. See the tests for some examples of what is supported.
You can test with curl with something like:
# selection
curl -d '{"categories_tags": "en:teas"}' -H "Content-Type: application/json" http://localhost:5510/select
# aggergation
curl -d '[{"$match": {"countries_tags": "en:france"}},{"$group":{"_id":"$brands_tags"}}]' -H "Content-Type: application/json" http://localhost:5510/aggregate