Automatic classification of active tuberculosis from chest X-ray images has the potential to save lives, especially in low- and mid-income countries where skilled human experts can be scarce. Given the lack of available labeled data to train such systems and the unbalanced nature of publicly available datasets, we argue that the reliability of deep learning models is limited, even if they can be shown to obtain perfect classification accuracy on the test data. One way of evaluating the reliability of such systems is to ensure that models use the same regions of input images for predictions as medical experts would. In this paper, we show that pre-training a deep neural network on a large-scale proxy task, as well as using mixed objective optimization network (MOON), a technique to balance different classes during pre-training and fine-tuning, can improve the alignment of decision foundations between models and experts, as compared to a model directly trained on the target dataset. At the same time, these approaches keep perfect classification accuracy according to the area under the receiver operating characteristic curve (AUROC) on the test set, and improve generalization on an independent, unseen dataset. For the purpose of reproducibility, our source code is made available online (in this repository).
Please use the following citation if you use these materials on your own work:
@inproceedings{Guler_EUVIP24_2024,
author = {G{\"{u}}ler, {\"{O}}zg{\"{u}}r and G{\"{u}}nther, Manuel and Anjos, Andr{\'{e}}},
projects = {Idiap},
month = {9},
title = {Refining Tuberculosis Detection in CXR Imaging: Addressing Bias in Deep Neural Networks via Interpretability},
booktitle = {Proceedings of the 12th European Workshop on Visual Information Processing},
year = {2024},
pdf = {https://publications.idiap.ch/attachments/papers/2024/Guler_EUVIP24_2024.pdf}
}
This package contains instructions and pre-trained models (files with
.ckpt
extension) to reproduce results published in the
paper. The .ckpt
(checkpoint) files are named after the
entries in Table I of the original paper. Follow this README file for
installation and usage instructions to reproduce Table II measurements,
generate figures such as those in Figure 1, and re-train the models if
necessary.
Clone (or copy) this package on a work directory:
git clone https://gitlab.idiap.ch/medai/software/paper/euvip24-refine-cad-tb
cd euvip24-refine-cad-tb
Install databases: Procure and install both
databases used in this work, namely the Shenzhen
database, and the TBX11k
database. Take note of the root directories where these databases
were unpacked, as you will need that when setting up a required,
auxiliary package called mednet
. mednet
requires that database files and directories from externally downloaded
databases are not renamed.
Install software stack: To install the required software, first install pixi, a conda-based software package manager. Once pixi is installed, execute the following command on the current package checkout to initialize the software stack:
pixi install -e default # cpu or MPS-based processing
pixi install -e cuda # CUDA-based processing
Follow mednet's
setup instructions to configure the location of your installed
databases at $HOME/.config/mednet.toml
. Then, check that
your installation is operational:
pixi run -e default mednet info
# example output should contain these lines confirming mednet has
# picked-up the location of your database installations
mednet version: X.Y.Z
...
mednet configuration file: /Users/self/.config/mednet.toml
databases:
...
- shenzhen (mednet.data.classify, mednet.data.segment): /Users/self/work/dbs/shenzhen
...
- tbx11k (mednet.data.classify): /Users/self/work/dbs/tbx11k
Note that running the above command after installation may take a few moments to execute (as all installed files are byte-compiled).
Within the current directory, you will find a script called
evaluate-all.sh
. This script re-runs the inference and
evaluation on the available pre-trained models (.ckpt
)
files, on all datasets. It also extract saliency maps and runs the
interpretability evaluation we discuss on the paper. It will allow you
to repeat the results in all columns at Table II. Note that running the
full script is long, in particular the saliency map analysis, which
covers 3 distinct saliency map algorithms (GradCAM, HiResCAM, and
ScoreCAM). We advise using the --device=cuda
(or
--device=mps
if you are on a mac with Apple Silicon) option
if you dispose of GPU to run the full script. Notice running ScoreCAM is
very slow on common hardware. You may hand-edit and
comment-out portions of the script that you may want to skip.
After the script was run, to output the ROC-AUC for all combinations of models and databases just run something like:
cat M_{U,B,U_U,U_B,B_B}-on-tbx11k-*/evaluation.rst | grep test | awk '{ print $11 }'
cat M_{U,B,U_U,U_B,B_B}-on-shenzhen-alltest/evaluation.rst | grep test | awk '{ print $11 }'
This would output the first (Target) and second columns (External) respectively of Table II.
To generate the other columns (the medians of the Proportional Energy
accross the datasets), grep
the generated tables on the
output of the interpretability analysis. Here is an example:
cat M_{U,B,U_U,U_B,B_B}-on-tbx11k-*/gradcam/interpretability.rst | grep test | awk '{ print $6 }'
cat M_{U,B,U_U,U_B,B_B}-on-tbx11k-*/hirescam/interpretability.rst | grep test | awk '{ print $6 }'
cat M_{U,B,U_U,U_B,B_B}-on-tbx11k-*/scorecam/interpretability.rst | grep test | awk '{ print $6 }'
You can use the following command-line to generate saliency map
visualizations for all samples in the dataset, using a particular
saliency-map algorithm (e.g. gradcam
):
pixi run -e default mednet classify saliency view -vv densenet tbx11k-v1-healthy-vs-atb --input-folder="M_U-on-tbx11k-v1-healthy-vs-atb/gradcam" --output-folder="M_U-on-tbx11k-v1-healthy-vs-atb/gradcam/view"
pixi run -e default mednet classify saliency view -vv densenet tbx11k-v1-healthy-vs-atb --input-folder="M_B_B-on-tbx11k-v1-healthy-vs-atb/gradcam" --output-folder="M_B_B-on-tbx11k-v1-healthy-vs-atb/gradcam/view"
To reconstruct the images in Figure 1 of the paper, you will need to
sort results at
./M_U-on-tbx11k-v1-healthy-vs-atb/gradcam/interpretability.json
and pick images with the lowest proportional energy score. This should
correspond to the third entry for samples on that file
that contain such measurement (i.e. samples with output target
=1
).
For your reference, the order of attributes in is "filename", "original target", "proportional energy" and "saliency focus". Note the two last attributes are only available if the original sample target is positive.
You may retrain all models, and play with the existing
hyper-parameters. Ready-made configuration files for mednet
define the configuration that yielded published models.
These are the simplest models, as it only requires the base TBX11k installation, and training only happens in a single step.
To retrain M_U
and M_B
using an Nvidia GPU
use one of the following variants:
# cpu (on either macOS or Linux, but **slow**)
pixi run -e default mednet experiment M_U.py --cache-samples --parallel=8 --output-folder=M_U
pixi run -e default mednet experiment M_B.py --cache-samples --parallel=8 --output-folder=M_B
# mps (on macOS with Apple Silicon)
pixi run -e default mednet experiment M_U.py --device=mps --cache-samples --parallel=8 --output-folder=M_U
pixi run -e default mednet experiment M_B.py --device=mps --cache-samples --parallel=8 --output-folder=M_B
# cuda (on Linux + Nvidia card)
pixi run -e cuda mednet experiment M_U.py --device=cuda --cache-samples --parallel=8 --output-folder=M_U
pixi run -e cuda mednet experiment M_B.py --device=cuda --cache-samples --parallel=8 --output-folder=M_B
You will need at least 10Gb of GPU (v)RAM, and 64Gb of RAM to execute
these experiments. Note CPU-RAM requirements may be trimmed down if you
set --no-cache-samples
. The retrained models will be left
at
<output-folder>/model-at-lowest-validation-loss-epoch=*.ckpt
.
To re-execute the evaluation and include your newly trained variant,
link the .ckpt
file to the current directory and
re-executed the evaluate-all.sh
script as per above.
These require a 2-step re-training procedure. You first need to train the models on NIH-CXR14 dataset, and then fine-tune on the TBX11k dataset. To do so, first run one of the following:
# cpu (on either macOS or Linux, but **EXTREMELY slow**)
pixi run -e default mednet experiment M_U_X.py --cache-samples --parallel=8 --output-folder=M_U_X
pixi run -e default mednet experiment M_B_X.py --cache-samples --parallel=8 --output-folder=M_B_X
# mps (on macOS with Apple Silicon)
pixi run -e default mednet experiment M_U_X.py --device=mps --cache-samples --parallel=8 --output-folder=M_U_X
pixi run -e default mednet experiment M_B_X.py --device=mps --cache-samples --parallel=8 --output-folder=M_B_X
# cuda (on Linux + Nvidia card)
pixi run -e cuda mednet experiment M_U_X.py --device=cuda --cache-samples --parallel=8 --output-folder=M_U_X
pixi run -e cuda mednet experiment M_B_X.py --device=cuda --cache-samples --parallel=8 --output-folder=M_B_X
You will need at least 10Gb of GPU (v)RAM, and 500Gb of RAM to
execute these experiments. Note CPU-RAM requirements may be trimmed down
if you set --no-cache-samples
. The retrained models will be
left at
<output-folder>/model-at-lowest-validation-loss-epoch=*.ckpt
.
Once the base models have been re-trained, you will need to fine-tune
those on the TBX11k so you reproduce published results. For example, to
produce the M_B_B
model, start from the weights at
M_B_X/model-at-lowest-validation-loss-epoch=*.ckpt
, and
execute one of the following:
# cpu (on either macOS or Linux, but **EXTREMELY slow**)
pixi run -e default mednet experiment M_B.py --cache-samples --parallel=8 --initial-weights=M_B_X --output-folder=M_B_B
# mps (on macOS with Apple Silicon)
pixi run -e default mednet experiment M_B.py --device=mps --cache-samples --parallel=8 --initial-weights=M_B_X --output-folder=M_B_B
# cuda (on Linux + Nvidia card)
pixi run -e cuda mednet experiment M_B.py --device=cuda --cache-samples --parallel=8 --initial-weights=M_B_X --output-folder=M_B_B
In this last step, hardware requirements are the same as for
M_*
models. Repeat the evaluation procedure to account for
the new model.