Refining Tuberculosis Detection in CXR Imaging: Addressing Bias in Deep Neural Networks via Interpretability

Automatic classification of active tuberculosis from chest X-ray images has the potential to save lives, especially in low- and mid-income countries where skilled human experts can be scarce. Given the lack of available labeled data to train such systems and the unbalanced nature of publicly available datasets, we argue that the reliability of deep learning models is limited, even if they can be shown to obtain perfect classification accuracy on the test data. One way of evaluating the reliability of such systems is to ensure that models use the same regions of input images for predictions as medical experts would. In this paper, we show that pre-training a deep neural network on a large-scale proxy task, as well as using mixed objective optimization network (MOON), a technique to balance different classes during pre-training and fine-tuning, can improve the alignment of decision foundations between models and experts, as compared to a model directly trained on the target dataset. At the same time, these approaches keep perfect classification accuracy according to the area under the receiver operating characteristic curve (AUROC) on the test set, and improve generalization on an independent, unseen dataset. For the purpose of reproducibility, our source code is made available online (in this repository).

Lowest Prop.energy M_U Samples Matching M_B Samples
SALIENCY MAPS. In this figure, we show (top row) Saliency maps for the unbalanced model M_U featuring the five cases from the TBX11k test set with the lowest Proportional Energy scores; (bottom row) respective predictions of our best balanced model M_B_B. Human-annotated ground-truth regions including radiological signs are indicated by bright magenta bounding boxes. The heatmaps (ranging from red to blue) indicate the contribution of different regions to the models' decision-making, with non-colored areas having no significant contribution.

Citation

Please use the following citation if you use these materials on your own work:

@inproceedings{Guler_EUVIP24_2024,
         author = {G{\"{u}}ler, {\"{O}}zg{\"{u}}r and G{\"{u}}nther, Manuel and Anjos, Andr{\'{e}}},
       projects = {Idiap},
          month = {9},
          title = {Refining Tuberculosis Detection in CXR Imaging: Addressing Bias in Deep Neural Networks via Interpretability},
      booktitle = {Proceedings of the 12th European Workshop on Visual Information Processing},
           year = {2024},
           pdf  = {https://publications.idiap.ch/attachments/papers/2024/Guler_EUVIP24_2024.pdf}
}

Contents

This package contains instructions and pre-trained models (files with .ckpt extension) to reproduce results published in the paper. The .ckpt (checkpoint) files are named after the entries in Table I of the original paper. Follow this README file for installation and usage instructions to reproduce Table II measurements, generate figures such as those in Figure 1, and re-train the models if necessary.

Installation

  1. Clone (or copy) this package on a work directory:

    git clone https://gitlab.idiap.ch/medai/software/paper/euvip24-refine-cad-tb
    cd euvip24-refine-cad-tb
  2. Install databases: Procure and install both databases used in this work, namely the Shenzhen database, and the TBX11k database. Take note of the root directories where these databases were unpacked, as you will need that when setting up a required, auxiliary package called mednet. mednet requires that database files and directories from externally downloaded databases are not renamed.

  3. Install software stack: To install the required software, first install pixi, a conda-based software package manager. Once pixi is installed, execute the following command on the current package checkout to initialize the software stack:

    pixi install -e default  # cpu or MPS-based processing
    pixi install -e cuda     # CUDA-based processing

    Follow mednet's setup instructions to configure the location of your installed databases at $HOME/.config/mednet.toml. Then, check that your installation is operational:

    pixi run -e default mednet info
    # example output should contain these lines confirming mednet has
    # picked-up the location of your database installations
    mednet version: X.Y.Z
    ...
    mednet configuration file: /Users/self/.config/mednet.toml
    databases:
      ...
      - shenzhen (mednet.data.classify, mednet.data.segment): /Users/self/work/dbs/shenzhen
      ...
      - tbx11k (mednet.data.classify): /Users/self/work/dbs/tbx11k

    Note that running the above command after installation may take a few moments to execute (as all installed files are byte-compiled).

Usage

Within the current directory, you will find a script called evaluate-all.sh. This script re-runs the inference and evaluation on the available pre-trained models (.ckpt) files, on all datasets. It also extract saliency maps and runs the interpretability evaluation we discuss on the paper. It will allow you to repeat the results in all columns at Table II. Note that running the full script is long, in particular the saliency map analysis, which covers 3 distinct saliency map algorithms (GradCAM, HiResCAM, and ScoreCAM). We advise using the --device=cuda (or --device=mps if you are on a mac with Apple Silicon) option if you dispose of GPU to run the full script. Notice running ScoreCAM is very slow on common hardware. You may hand-edit and comment-out portions of the script that you may want to skip.

After the script was run, to output the ROC-AUC for all combinations of models and databases just run something like:

cat M_{U,B,U_U,U_B,B_B}-on-tbx11k-*/evaluation.rst | grep test | awk '{ print $11 }'
cat M_{U,B,U_U,U_B,B_B}-on-shenzhen-alltest/evaluation.rst | grep test | awk '{ print $11 }'

This would output the first (Target) and second columns (External) respectively of Table II.

To generate the other columns (the medians of the Proportional Energy accross the datasets), grep the generated tables on the output of the interpretability analysis. Here is an example:

cat M_{U,B,U_U,U_B,B_B}-on-tbx11k-*/gradcam/interpretability.rst | grep test | awk '{ print $6 }'
cat M_{U,B,U_U,U_B,B_B}-on-tbx11k-*/hirescam/interpretability.rst | grep test | awk '{ print $6 }'
cat M_{U,B,U_U,U_B,B_B}-on-tbx11k-*/scorecam/interpretability.rst | grep test | awk '{ print $6 }'

You can use the following command-line to generate saliency map visualizations for all samples in the dataset, using a particular saliency-map algorithm (e.g. gradcam):

pixi run -e default mednet classify saliency view -vv densenet tbx11k-v1-healthy-vs-atb --input-folder="M_U-on-tbx11k-v1-healthy-vs-atb/gradcam" --output-folder="M_U-on-tbx11k-v1-healthy-vs-atb/gradcam/view"
pixi run -e default mednet classify saliency view -vv densenet tbx11k-v1-healthy-vs-atb --input-folder="M_B_B-on-tbx11k-v1-healthy-vs-atb/gradcam" --output-folder="M_B_B-on-tbx11k-v1-healthy-vs-atb/gradcam/view"

To reconstruct the images in Figure 1 of the paper, you will need to sort results at ./M_U-on-tbx11k-v1-healthy-vs-atb/gradcam/interpretability.json and pick images with the lowest proportional energy score. This should correspond to the third entry for samples on that file that contain such measurement (i.e. samples with output target =1).

For your reference, the order of attributes in is "filename", "original target", "proportional energy" and "saliency focus". Note the two last attributes are only available if the original sample target is positive.

Re-training

You may retrain all models, and play with the existing hyper-parameters. Ready-made configuration files for mednet define the configuration that yielded published models.

Re-training M_* models

Training protocol for TBX11k
            (target) dataset
TARGET DATASET TRAINING. The training (fine-tuning) schematics for the base network trained from the target dataset (TBX11k) is represented in this figure.

These are the simplest models, as it only requires the base TBX11k installation, and training only happens in a single step.

To retrain M_U and M_B using an Nvidia GPU use one of the following variants:

# cpu (on either macOS or Linux, but **slow**)
pixi run -e default mednet experiment M_U.py --cache-samples --parallel=8 --output-folder=M_U
pixi run -e default mednet experiment M_B.py --cache-samples --parallel=8 --output-folder=M_B

# mps (on macOS with Apple Silicon)
pixi run -e default mednet experiment M_U.py --device=mps --cache-samples --parallel=8 --output-folder=M_U
pixi run -e default mednet experiment M_B.py --device=mps --cache-samples --parallel=8 --output-folder=M_B

# cuda (on Linux + Nvidia card)
pixi run -e cuda mednet experiment M_U.py --device=cuda --cache-samples --parallel=8 --output-folder=M_U
pixi run -e cuda mednet experiment M_B.py --device=cuda --cache-samples --parallel=8 --output-folder=M_B

You will need at least 10Gb of GPU (v)RAM, and 64Gb of RAM to execute these experiments. Note CPU-RAM requirements may be trimmed down if you set --no-cache-samples. The retrained models will be left at <output-folder>/model-at-lowest-validation-loss-epoch=*.ckpt.

To re-execute the evaluation and include your newly trained variant, link the .ckpt file to the current directory and re-executed the evaluate-all.sh script as per above.

Re-training M_*_* models

Training protocol for TBX11k
            with pre-training on NIH-CXR14
NIH-CXR14 PRE-TRAINING. The pre-training schematics for the advanced network trained first on the NIH-CXR14, and then fine-tuned on the target dataset (TBX11k).

These require a 2-step re-training procedure. You first need to train the models on NIH-CXR14 dataset, and then fine-tune on the TBX11k dataset. To do so, first run one of the following:

# cpu (on either macOS or Linux, but **EXTREMELY slow**)
pixi run -e default mednet experiment M_U_X.py --cache-samples --parallel=8 --output-folder=M_U_X
pixi run -e default mednet experiment M_B_X.py --cache-samples --parallel=8 --output-folder=M_B_X

# mps (on macOS with Apple Silicon)
pixi run -e default mednet experiment M_U_X.py --device=mps --cache-samples --parallel=8 --output-folder=M_U_X
pixi run -e default mednet experiment M_B_X.py --device=mps --cache-samples --parallel=8 --output-folder=M_B_X

# cuda (on Linux + Nvidia card)
pixi run -e cuda mednet experiment M_U_X.py --device=cuda --cache-samples --parallel=8 --output-folder=M_U_X
pixi run -e cuda mednet experiment M_B_X.py --device=cuda --cache-samples --parallel=8 --output-folder=M_B_X

You will need at least 10Gb of GPU (v)RAM, and 500Gb of RAM to execute these experiments. Note CPU-RAM requirements may be trimmed down if you set --no-cache-samples. The retrained models will be left at <output-folder>/model-at-lowest-validation-loss-epoch=*.ckpt.

Once the base models have been re-trained, you will need to fine-tune those on the TBX11k so you reproduce published results. For example, to produce the M_B_B model, start from the weights at M_B_X/model-at-lowest-validation-loss-epoch=*.ckpt, and execute one of the following:

# cpu (on either macOS or Linux, but **EXTREMELY slow**)
pixi run -e default mednet experiment M_B.py --cache-samples --parallel=8 --initial-weights=M_B_X --output-folder=M_B_B

# mps (on macOS with Apple Silicon)
pixi run -e default mednet experiment M_B.py --device=mps --cache-samples --parallel=8 --initial-weights=M_B_X --output-folder=M_B_B

# cuda (on Linux + Nvidia card)
pixi run -e cuda mednet experiment M_B.py --device=cuda --cache-samples --parallel=8 --initial-weights=M_B_X --output-folder=M_B_B

In this last step, hardware requirements are the same as for M_* models. Repeat the evaluation procedure to account for the new model.