Training speed with dd.train_d2_faster_rcnn() #160

yflam1 · 2023-06-20T01:54:51Z

yflam1
Jun 20, 2023

Hi. I am a newbie to this library. I have a dataset of formal letters and would like to use deepdoctection to extract different kinds of texts (like address, sender, date, main paragraphs). I used Label Studio to label 20 images in COCO format (I used pdf2image and opencv to convert the PDFs to grayscale and binary image) for testing purposes. The labels are simply ["figure", "list", "table", "text", "title"]. I put my labeled images in ~/.cache/deepdoctection/datasets/dataset in the following manner:

dataset/
    |-- train
        |-- 01.png
        |-- 02.png
        |-- ...
    |-- train.json

Then I used the following script to fine-tune the given layout detection model:

import os
import deepdoctection as dd

_NAME = "dataset"
_DESCRIPTION = "Labeled images"
_SPLITS = {"train": "/train"}
_CATEGORIES = ["figure", "list", "table", "text", "title"]
_LOCATION = "dataset"
_ANNOTATION_FILES = {"train": "train.json"}


class CustomDataFlowBuilder(dd.DataFlowBaseBuilder):
    def build(self, **kwargs):
        path = self.get_workdir() / _ANNOTATION_FILES["train"]
        df = dd.SerializerCoco.load(path)
        coco_mapper = dd.coco_to_image(
            self.categories.get_categories(init=True), load_image=True,
            filter_empty_image=True, fake_score=False)
        df = dd.MapData(df, coco_mapper)
        return df


class CustomDataset(dd.DatasetBase):
    @classmethod
    def _info(cls):
        return dd.DatasetInfo(name=_NAME, description=_DESCRIPTION, splits=_SPLITS)

    def _categories(self):
        return dd.DatasetCategories(init_categories=_CATEGORIES)

    def _builder(self):
        return CustomDataFlowBuilder(location=_LOCATION, annotation_files=_ANNOTATION_FILES)


cfg = dd.set_config_by_yaml("path/to/conf_dd_one.yaml")

dataset = CustomDataset()

config_yaml_path = dd.ModelCatalog.get_full_path_configs(cfg.CONFIG.D2LAYOUT)
weights_path = dd.ModelCatalog.get_full_path_weights(cfg.WEIGHTS.D2LAYOUT)
categories = dd.ModelCatalog.get_profile(cfg.WEIGHTS.D2LAYOUT).categories
layout_detector = dd.D2FrcnnDetector(config_yaml_path, weights_path, categories, device=cfg.DEVICE)
layout_service = dd.ImageLayoutService(layout_detector)

coco_metric = dd.get_metric("coco")

config_overwrite=["SOLVER.MAX_ITER=100000",
                  "TEST.EVAL_PERIOD=20000",
                  "SOLVER.CHECKPOINT_PERIOD=20000",
                  "MODEL.BACKBONE.FREEZE_AT=0",
                  "SOLVER.BASE_LR=1e-3"]

build_train_config = ["max_datapoints=86000"]

dd.train_d2_faster_rcnn(
    path_config_yaml=config_yaml_path,
    dataset_train=dataset,
    path_weights=weights_path,
    config_overwrite=config_overwrite,
    log_dir="train_log",
    build_train_config=build_train_config,
    dataset_val=dataset,
    build_val_config=None,
    metric=coco_metric,
    pipeline_component_name="ImageLayoutService")

Here is my conf_dd_one.yaml:

CONFIG:
  D2LAYOUT: dd/d2/layout/CASCADE_RCNN_R_50_FPN_GN.yaml
  D2CELL: dd/d2/cell/CASCADE_RCNN_R_50_FPN_GN.yaml
  D2ITEM: dd/d2/item/CASCADE_RCNN_R_50_FPN_GN.yaml
WEIGHTS:
  D2LAYOUT: layout/d2_model_0829999_layout_inf_only.pt
  D2CELL: cell/d2_model_1849999_cell_inf_only.pt
  D2ITEM: item/d2_model_1639999_item_inf_only.pt
LAYOUT_NMS_PAIRS:
  COMBINATIONS:
    - - text
      - table
    - - title
      - table
    - - text
      - list
    - - title
      - list
    - - text
      - title
    - - list
      - table
  THRESHOLDS:
    - 0.005
    - 0.005
    - 0.542
    - 0.1
    - 0.699
    - 0.01
SEGMENTATION:
  ASSIGNMENT_RULE: ioa
  IOA_THRESHOLD_ROWS: 0.4
  IOA_THRESHOLD_COLS: 0.4
  IOU_THRESHOLD_ROWS: 0.01
  IOU_THRESHOLD_COLS: 0.001
  REMOVE_IOU_THRESHOLD_ROWS: 0.001
  REMOVE_IOU_THRESHOLD_COLS: 0.001
  FULL_TABLE_TILING: True
  STRETCH_RULE: left
  USE_REFINEMENT: False
WORD_MATCHING:
  PARENTAL_CATEGORIES:
    - text
    - title
    - cell
    - list
    - figure
  CHILD_CATEGORIES:
    - word
  RULE: ioa
  IOA_THRESHOLD: 0.6
  IOU_THRESHOLD: 0.001
  MAX_PARENT_ONLY: False
TEXT_ORDERING:
  TEXT_CONTAINER: word
  FLOATING_TEXT_BLOCK:
    - title
    - text
    - list
    - figure
  TEXT_BLOCK:
    - title
    - text
    - list
    - cell
    - figure
    - header
    - body
  TEXT_CONTAINER_TO_TEXT_BLOCK: True
DEVICE: cuda

However, when I run the script, it kind of freezes at Starting training from iteration 0. I have 2 Nvidia Quadros. watch -n 1 nvidia-smi shows that the GPUs are not utilized. Though top shows that CPU usage is at 100%. Is this performance normal?

Answered by JaMe76

Jun 21, 2023

Check #162 for CUDA OOM

View full answer

JaMe76 · 2023-06-20T07:05:14Z

JaMe76
Jun 20, 2023
Maintainer

Thank you for your answers.

The performance is not normal and it sounds that Pytorch does not recognize that GPUs are available.

Try:

from deepdoctection.extern.pt.ptutils import get_num_gpu

    print(get_num_gpu())

anf if this returns 0, it means that Pytorch does not connect to your GPU. Most of the time this issue comes from the fact that the Pytorch version does not align with CUDA.

One other thing worth to check is whether your datasets actually streams data from its dataflow:

    df = dataset.dataflow_builder.build(split="train")
    df.reset_state()
    for dp in df:
        print(dp.file_name)
        print(dp.image)  # should be some np.array as you set load_image=True

This should stream through all your datapoints.

Finally, the config file is only needed in your training script here

... = dd.ModelCatalog.get_full_path_configs(cfg.CONFIG.D2LAYOUT)
... = dd.ModelCatalog.get_full_path_weights(cfg.WEIGHTS.D2LAYOUT)
... = dd.ModelCatalog.get_profile(cfg.WEIGHTS.D2LAYOUT).categories
... = dd.D2FrcnnDetector(config_yaml_path, weights_path, categories, device=cfg.DEVICE)

and you can replace the cfg... entries with its corresponding values from the .yaml file. This might reduce some distraction and will focus on the relevant settings in your training script. (But this of course, is only a matter of taste.)

What really matters while training is the config of the layout model specifified by dd/d2/layout/CASCADE_RCNN_R_50_FPN_GN.yaml

7 replies

yflam1 Jun 21, 2023
Author

The output of

df = dataset.dataflow.build()
df.reset_state()
dp = next(iter(df))
print(dp)

is:

Image(file_name='something.png', location=PosixPath('/home/xxx/.cache/deepdoctection/datasets/dataset/train/something.png'), 
document_id='4385125b-dd1e-3025-880f-3311517cc8d5', _image_id='4385125b-dd1e-3025-880f-3311517cc8d5', 
embeddings={'4385125b-dd1e-3025-880f-3311517cc8d5': BoundingBox(absolute_coords=True, ulx=0.0, uly=0.0, lrx=3307.0, lry=4678.0, height=4678.0, width=3307.0)}, 
annotations=[
  ImageAnnotation(active=True, _annotation_id='4385125b-dd1e-3025-880f-3311517cc8d5', category_name=<LayoutType.list>, _category_name=<LayoutType.list>, category_id='2', score=None, sub_categories={}, relationships={}, bounding_box=BoundingBox(absolute_coords=True, ulx=285.8552885062406, uly=1011.6478484553409, lrx=2972.496794812193, lry=2204.71896849945, height=1193.0711200441092, width=2686.6415063059526)), 
  ImageAnnotation(active=True, _annotation_id='afd0b036-625a-3aa8-b639-9dc8c8fff0ff', category_name=<LayoutType.list>, _category_name=<LayoutType.list>, category_id='2', score=None, sub_categories={}, relationships={}, bounding_box=BoundingBox(absolute_coords=True, ulx=281.7246795169808, uly=2307.4730976429623, lrx=2972.4967948121916, lry=3265.708384721964, height=958.2352870790019, width=2690.772115295211)), 
  ImageAnnotation(active=True, _annotation_id='9c45c2f1-1761-3daa-ad31-1ff8703ae846', category_name=<LayoutType.figure>, _category_name=<LayoutType.figure>, category_id='1', score=None, sub_categories={}, relationships={}, bounding_box=BoundingBox(absolute_coords=True, ulx=546.926923076923, uly=3366.887074829933, lrx=1392.755769230769, lry=3946.0680272108857, height=579.1809523809525, width=845.8288461538459))])

Note that the first two categories should be <LayoutType.table> while the last category should be <LayoutType.list>.

JaMe76 Jun 21, 2023
Maintainer

Once you pass your list of categories to the CustomDataset, it will assign to each category a category_id defined by its list index:

dataset= dd.CustomDataset(
    "dataset",
    dd.DatasetType.object_detection,
    "dataset",
    [dd.LayoutType.figure,
     dd.LayoutType.list,
     dd.LayoutType.table,
     dd.LayoutType.text,
     dd.LayoutType.title],
    CustomDataFlowBuilder)

categories = dataset.dataflow.categories.get_categories(init=True) # {'1': <LayoutType.figure>, '2': <LayoutType.list>, '3': <LayoutType.table>, '4': <LayoutType.text>,'5':<LayoutType.title>}

Looking at the relevant part in coco_to_image each annotation from your raw JSON is going to be transformed into an ImageAnnotation.

            annotation = ImageAnnotation(
                category_name=categories[str(ann["category_id"])],
                bounding_box=bbox,
                category_id=ann["category_id"],
                score=maybe_get_fake_score(fake_score),
                external_id=ann["id"],
            )

This will result in a KeyError and an image filtering when trying to map categories[str(ann["category_id"])].
0 is not an eligible category_id in deepdoctection.

To circumvent this issue I suggest to re-index your category ids (you could do this in your preprocessing step overwrite_filename) so that they align with your categories dict.

Regarding your VRAM issue, this is not caused by your dataset. Looking at your logs all images except 1 seems to be filtered.

yflam1 Jun 21, 2023
Author

Much thanks! I changed the categories key in my train.json to

"categories": [
  {
    "id": 5,
    "name": "figure"
  },
  {
    "id": 3,
    "name": "list"
  },
  {
    "id": 4,
    "name": "table"
  },
  {
    "id": 1,
    "name": "text"
  },
  {
    "id": 2,
    "name": "title"
  }
]

and changed the corresponding category_id of each annotation. Now I get

[0621 15:03.53 @maputils.py:222]  INF  Ground-Truth category distribution:
 |  category  | #box   |  category  | #box   |  category  | #box   |
|:----------:|:-------|:----------:|:-------|:----------:|:-------|
|    text    | 546    |   title    | 35     |    list    | 49     |
|   table    | 5      |   figure   | 151    |            |        |
|   total    | 786    |            |        |            |        |

But unfortunately, I still get the CUDA out of memory error. Perhaps because I am using torch=1.8.1+cu101?

JaMe76 Jun 21, 2023
Maintainer

Check #162 for CUDA OOM

Answer selected by yflam1

ShivaGurrala94 · 2023-08-09T19:22:43Z

ShivaGurrala94
Aug 9, 2023

When I tried the following codes on the PubLaynet Dataset

df = dataset.dataflow.build()
df.reset_state()
dp = next(iter(df))

or

    df = dataset.dataflow_builder.build(split="train")
    df.reset_state()
    for dp in df:
        print(dp.file_name)
        print(dp.image)  # should be some np.array as you set load_image=True

its giving the following Warning Message:

[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image
[0809 17:57.33 @maputils.py:89]  WRN  MappingContextManager error. Will filter image

While training Inside the DatasetAdapter() call trying to create dataset obj -> inside the init method we have a similar loop "for dp in df:"
at both the places I observered that loop is running only once and somehow is trying to pull entire mapping utlil calls same time. I see datapoints.append(dp) is returning 0

I did try to check if deepdoctection is able to find the GPU using the following code and see it displaying 1. Not sure why is it picking CPU though. In this init call I see CPU memory reaching to !00%

from deepdoctection.extern.pt.ptutils import get_num_gpu

    print(get_num_gpu())

3 replies

JaMe76 Aug 10, 2023
Maintainer

DatasetAdapter buffers datapoints, so that annotations are loaded into memory. The for-loop in DatasetAdapter is not the training loop.

Looks like all images are being filtered because they do not pass the quality check. When this happens for all images, it might be caused that the dataset in not properly installed. Does Publaynet follows the folder structure:

publaynet
├── test
│ ├── PMC_1.png
├── train
│ ├── PMC_2.png
├── val
│ ├── PMC_3.png
├── train.json
├── val.json

Are images downloaded as well?

ShivaGurrala94 Aug 10, 2023

I have downloaded few datasets from the mentioned at notebooks
And copied them to ~/.cache/deepdoctection/datasets/ folder and made sure the data sets are of structure

Custom_dataset
├── test
│ ├── PMC_1.png
├── train
│ ├── PMC_2.png
├── val
│ ├── PMC_3.png
├── train.json
├── val.json

But the error still persists , tried debugging the issue and see that load_image_from_file() method is not able to read the images and hence the mapping error

> /home/AITESTUbuntu/deepdoctection/deepdoctection/mapper/cocostruct.py(68)coco_to_image()
     66 
     67     with MappingContextManager(dp.get("file_name")) as mapping_context:
---> 68         image = Image(file_name=os.path.split(dp["file_name"])[1], location=dp["file_name"], external_id=dp.get("id"))
     69 
     70         if load_image:

ipdb>  next
> /home/AITESTUbuntu/deepdoctection/deepdoctection/mapper/cocostruct.py(70)coco_to_image()
     68         image = Image(file_name=os.path.split(dp["file_name"])[1], location=dp["file_name"], external_id=dp.get("id"))
     69 
---> 70         if load_image:
     71             image.image = load_image_from_file(dp["file_name"])
     72         image.set_width_height(float(dp.get("width", 0)), float(dp.get("height", 0)))

ipdb>  image
Image(file_name='c6effb847ae7e4a80431696984fa90c98bb08c266481b9a03842422459c43bdd.png', location='c6effb847ae7e4a80431696984fa90c98bb08c266481b9a03842422459c43bdd.png', document_id='4385125b-dd1e-3025-880f-3311517cc8d5', _image_id='4385125b-dd1e-3025-880f-3311517cc8d5', embeddings={}, annotations=[])
ipdb>  next
> /home/AITESTUbuntu/deepdoctection/deepdoctection/mapper/cocostruct.py(71)coco_to_image()
     69 
     70         if load_image:
---> 71             image.image = load_image_from_file(dp["file_name"])
     72         image.set_width_height(float(dp.get("width", 0)), float(dp.get("height", 0)))
     73 

ipdb>  next
TypeError: Cannot load image is of type: <class 'NoneType'>
> /home/AITESTUbuntu/deepdoctection/deepdoctection/mapper/cocostruct.py(71)coco_to_image()
     69 
     70         if load_image:
---> 71             image.image = load_image_from_file(dp["file_name"])
     72         image.set_width_height(float(dp.get("width", 0)), float(dp.get("height", 0)))
     73 

ipdb>  next
[0810 20:52.19 @maputils.py:91]  WRN  MappingContextManager error. Will filter image
> /home/AITESTUbuntu/deepdoctection/deepdoctection/mapper/cocostruct.py(106)coco_to_image()
    104                 annotation.dump_sub_category(coarse_sub_cat_name, sub_cat)
    105 
--> 106     if mapping_context.context_error:
    107         return None
    108 

ipdb>  next
> /home/AITESTUbuntu/deepdoctection/deepdoctection/mapper/cocostruct.py(107)coco_to_image()
    105 
    106     if mapping_context.context_error:
--> 107         return None

JaMe76 Aug 10, 2023
Maintainer

Are you sure the dataset you downloaded was Publaynet?

I see image names being hashes resembling those of Doclaynet, not Publaynet. Dataflow builder is a custom loader tied to the annotation scheme of each dataset. Loading Doclaynet with Publaynet loader will not work.

Image(file_name='c6effb847ae7e4a80431696984fa90c98bb08c266481b9a03842422459c43bdd.png', location='c6effb847ae7e4a80431696984fa90c98bb08c266481b9a03842422459c43bdd.png', document_id='4385125b-dd1e-3025-880f-3311517cc8d5', _image_id='4385125b-dd1e-3025-880f-3311517cc8d5', embeddings={}, annotations=[])

Location attribute is not a path to image file which results in load_image_from_file(dp["file_name"]) not able to load the image.

dp["file_name"] will be populated following the rule (line 123 in (https://github.com/deepdoctection/deepdoctection/blob/master/deepdoctection/datasets/instances/publaynet.py#L123)

df = MapDataComponent(df, lambda dp: self.get_workdir() / self.get_split(split) / dp, "file_name")

Training speed with dd.train_d2_faster_rcnn() #160

Uh oh!

yflam1 Jun 20, 2023

Replies: 2 comments · 10 replies

Uh oh!

JaMe76 Jun 20, 2023 Maintainer

Uh oh!

Uh oh!

yflam1 Jun 21, 2023 Author

Uh oh!

JaMe76 Jun 21, 2023 Maintainer

Uh oh!

yflam1 Jun 21, 2023 Author

Uh oh!

JaMe76 Jun 21, 2023 Maintainer

Uh oh!

Uh oh!

ShivaGurrala94 Aug 9, 2023

Uh oh!

JaMe76 Aug 10, 2023 Maintainer

Uh oh!

Uh oh!

ShivaGurrala94 Aug 10, 2023

Uh oh!

JaMe76 Aug 10, 2023 Maintainer

yflam1
Jun 20, 2023

Replies: 2 comments 10 replies

JaMe76
Jun 20, 2023
Maintainer

yflam1 Jun 21, 2023
Author

JaMe76 Jun 21, 2023
Maintainer

yflam1 Jun 21, 2023
Author

JaMe76 Jun 21, 2023
Maintainer

ShivaGurrala94
Aug 9, 2023

JaMe76 Aug 10, 2023
Maintainer

JaMe76 Aug 10, 2023
Maintainer