Bony for Histopathological Cancer Detection

--

Prostate cancer remains one of the most prevalent cancers in men, with early detection critical to improving survival rates. Traditional diagnosis relies on pathologists analyzing biopsy slides — a time-consuming process prone to human error due to the complexity of histopathological data. Enter AI: deep learning models like Bony aim to automate this analysis, reducing subjectivity and accelerating diagnoses. But training such models faces a major hurdle: labeled medical data is scarce and expensive to obtain.

This is where self-supervised learning shines. By training on unlabeled data, models like Bony learn to extract meaningful patterns without manual annotations. In this article, we explore how Bony — a specialized vision transformer (XCiT) — achieves state-of-the-art results in prostate cancer detection while addressing data scarcity through innovative techniques like DINO self-supervision and wavelet decomposition.

To download the trained models, visit the huggingFace page: https://huggingface.co/HPAI-BSC/bony

Purpose

In recent years, with the rapid advancements in artificial intelligence (AI) and deep learning, the automation of diagnosis from histological images has become a thriving area of research. In particular, Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have demonstrated their effectiveness in analyzing complex medical images. These approaches not only improve diagnostic accuracy but also reduce inter-observer variability and accelerate the analysis process. Nevertheless, the challenges associated with training ViTs, such as high computational costs, can be significant. Therefore, in this article, we also explore a combination of CNNs and ViTs (XCiTs), while addressing a predominant challenge in the medical field: data labeling.

To tackle the problem of scarce data labels, this work uses a self-supervised approach in the domain of histological images. The main objective is to develop a model capable of perceiving visual features of relevance in prostate histology images with the hypothesis that such features will become useful later on for solving specific tasks.

Bony Architecture: XCiT Meets DINO

Bony combines two cutting-edge technologies:

  1. XCiT (Cross-Covariance Image Transformer): A vision transformer optimized for computational efficiency, using local patch interactions and cross-covariance attention to capture fine-grained spatial details.
  2. DINO (Self-Distillation with No Labels): A self-supervised framework where the model learns by comparing differently augmented views of the same image.

Key Innovations:

  • Efficiency: Trained on 24× H100 GPUs with 84M parameters — smaller than many CNNs.
  • Adaptability: Pre-trained on 224×224 prostate histology tiles, enabling fine-tuning for tasks like classification and segmentation.
  • Wavelet Integration (BonyWave): A variant using Haar wavelet decomposition to preprocess images, enhancing noise robustness and achieving 83% accuracy on PANDA benchmarks.
Data Pipeline

Medical Domain and self-supervised learning

In the medical field, data labeling is a crucial process for the implementation and optimization of artificial intelligence (AI) systems, particularly for applications in diagnosis. But one of the major issues is the lack of experts available for labeling. Due to the specificity of medical data, only qualified professionals can perform this task accurately. Labeling medical images, for instance, requires radiological expertise to identify subtle anomalies that non-experts might overlook. The time and cost required to train or employ qualified annotators are often prohibitive, especially on a large scale.

For this article we explore a fully self-supervised process to try to train a model that could be used to classify Histology images to detect Prostate cancers. This self-supervised framework is called DINO [DINO]. We also used as a backbone model the XCiT [XCIT] model from Meta which is a “transposed” Vision Transformer on the famous dataset PANDA (for Prostate Histology) [PANDA].

Previous works

It already exists some works about Histological cancer detection. As the Hibou model [HIBOU] or Histoencoder model [HISTOENCODER]. Nevertheless, these models do not focus on just one particular type of organs, they indeed train foundation models using a lot of different type of organs and different data sources (including private data sources) for training. On the contrary of them, our study focused only on Prostate cancer and only on public datasets.

Model Description

This XCiT (medium) model has been trained (from scratch) for prostate histopathology image analysis tasks, using images of size ( 224 x 224) pixels and 24 GPU H100. The XCiT architecture is a transformer model that uses cross-attention to process images, thereby improving performance compared to traditional CNN architectures.

XCiT architecture

So, in the previous Figure, you can see the main difference beteween a ViT and an XCiT, here are some details about Local Patch Interaction and the other components:

1. Local Patch Interaction (LPI)

The Local Patch Interaction (LPI) mechanism is designed to capture local spatial dependencies within image patches. Unlike global attention mechanisms, LPI focuses on fine-grained, localized information. This is achieved through depthwise convolutions, which are computationally efficient. By enhancing local context modeling, LPI addresses the inherent limitations of standard vision transformers in capturing spatial locality.

2. Feed-Forward Network (FFN)

The Feed-Forward Network (FFN) in XCiT follows the classical two-layer structure of transformer models. It consists of:
- A linear projection to expand the embedding dimension.
- A non-linear activation function, such as GELU.
- A second linear projection to reduce the embedding back to its original dimension.

The FFN operates independently on each token, enabling the model to learn complex transformations and enrich the token representations.

3. Layer Normalization (LayerNorm)

Layer Normalization (LayerNorm) is a critical component in XCiT, used to stabilize training and enhance convergence. It normalizes features along the embedding dimension, ensuring that each token representation has zero mean and unit variance. By applying LayerNorm before key modules like attention and FFN, XCiT avoids exploding or vanishing gradients, leading to smoother and more robust training dynamics.

4. Cross-Covariance Attention (XCA)

Architecture

This medium XCiT model relies on transformer blocks, which are better suited for computer vision tasks due to their ability to capture complex spatial relationships. The architecture has been adapted to work with prostate histopathology images of size (224 x 224). The number of parameters of this model is 84 Millions.

Training Procedure

The model was trained with an adaptative learning rate of \( 0.00075 \) in the beginning, using the Adam optimizer. The pre-training was conducted on a prostate histopathology image dataset (PANDA dataset), with images of size (224 x 224) pixels cropped without overlap from the PANDA Tiff images (high dimensionnal images).

Here are all hyperparameters:

  • Architecture : XCiT\_medium
  • Patch size : 16
  • Drop path rate : 0.1
  • Output dimension (out_dim) : 4096
  • Number of local crops : 5
  • Teacher temperature (teacher_temp) : 0.07
  • Teacher temperature during warmup (warmup_teacher_temp) : 0.04
  • Warmup epochs for teacher : 10
  • Training epochs : 15
  • Learning rate (lr) : 0.00075
  • Minimum learning rate (min_lr) : 2e-06
  • Warmup epochs for learning rate : 10
  • Batch size per GPU : 64
  • Weight decay : 0.05
  • Weight decay at the end of training (weight_decay_end) : 0.4
  • Teacher momentum : 0.996
  • Clip gradient : 3.0
  • Batch size for DataLoader : 64
  • Parameter norms : None (param_norms = None)
  • Freeze last layer : Yes (freeze_last_layer = 1)
  • Use FP16 scaler : Yes (fp16_scaler_b = True)
  • Number of workers : 10
  • Global crops scale (global_crops_scale) : (0.25, 1.0)
  • Local crops scale (local_crops_scale) : (0.05, 0.25)
  • Distribution URL : “env://"

Pre-Training with DINO

Pre-training was performed on a large dataset using the DINO self-supervised training method as inthe next Figure.

DINO architecture and loss

Here is the loss of DINO framework for our model:

Loss along epochs of our model

This model has to be seen as an encoder on top of which some decoders can be applied for downstreams tasks. It has been tested on different tasks as classification and segmentation (please review the different benchmarks used: PANDA, DeepGleason, SICAPv2).

Objective and Application Domain

This model was developed for the detection and classification of histopathological features in prostate biopsy images. It can be used for:

  • AI-assisted diagnosis for pathologists.
  • Detection of prostate tumors and other anomalies.

This model was pre-trained using the DINO method, a self-supervised pre-training algorithm. This pre-training is performed without any labels, using only histopathology images. The model has been trained on 2.8 millions images tiles ($224 \times 224$). With this model we can perform some specific tasks.
These tasks include cell segmentation and identifying relevant features for prostate histological classification.

Technical Details

Performance

The model achieved a classification accuracy of ( 81% ) on PANDA subset and a segmentation performance of ( 2.9e-6) (with MSE) on the DeepGleason prostate histopathology dataset. It was also tested on the SICAPv2 benchmark. The model’s performance was compared to other models, as Hibou, a ViT model trained on 1.2 billion tiles of 224 x 224. For DeepGleason and SICAPv2, a Segmentation has been performed using the Mean Squared Error (MSE). The summary table is as follows:

To specify the performance of Bony, we show here the different model statistics:

Wavelet Decomposition

As previously mentioned, histopathology images are highly discontinuous, noisy, and often visually similar. Therefore, applying a filter to these images might help abstract their information, enabling more stable and potentially more effective training. This is why I believe that incorporating wavelet decomposition before the forward pass in our XCiT model could be a promising approach.

Overview of 3D Wavelet Decomposition

2. 3D Scattering: Invariant Extension

Testing the Idea

Unfortunately, time constraints prevented us from fully exploring all the idea due to optimization challenges with our DINO-based model on the supercomputer. Nevertheless, we conducted small-scale experiments using Haar wavelets, considering a single decomposition scale and focusing on the “Approximation” of the image.

Despite these limitations, training revealed some potential. See the next Figure:

Loss of BonyWave along epochs

This work could be explored in the future as the Haar decomposition showed promising results with a 83% accuracy on the PANDA subset benchmark (to see the different results on the other benchmarks please see the performance section).

Limitations and Biases

Although this model was trained for a specific prostate histopathology analysis task, hence, there are several limitations and biases:

  • Performance may be affected by the quality of input images, particularly in cases of low resolution or noise.
  • The model may be biased by the distribution of the training data, which may not be representative of all patient populations.
  • The model may struggle with images containing artifacts or specific conditions not encountered in the training dataset.
  • This model may not be used for other images than Prostate Histopathological images as it has only be trained on these kind of images
  • This model shall not be used for diagnose alone

Conclusion

The XCiT model pre-trained with DINO shows promising results for prostate histopathology image analysis compared to other models. This could indicate that a model fully trained on Prostate data would surpass other foundational models. Furthermore, using the DINO method for self-supervised learning, the model learns robust representations without explicit supervision. However, it is important to continue validating this model on diverse datasets to ensure its effectiveness in various clinical contexts.

For more details, see the full thesis report (in French): https://hpai.bsc.es/files/Rapport_PFE.pdf

References

[DINO] Caron, Emerging Properties in Self-Supervised Vision Transformers, 2021.

[XCIT] El Nouby, XCiT: Cross-Covariance Image Transformers, 2018.
[HISTOENCODER] Pohjonen, HistoEncoder: a digital pathology foundation model for prostate cancer, 2024.

[HIBOU] Nechaev, Hibou: A Family of Foundational Vision Transformers for Pathology, 2024.

[PANDA] Fedorov, Andrey and Moreira, Danielle and Garcia, Brandon et al. PANDA: A Multicenter Dataset for AI-assisted Gleason Grading of Prostate Cancer. Medical Image Analysis, vol. 67, 2020, pp. 101816.

[DEEPGLEASON] Lucas, Maxime and Janda, Renaud et al.
DeepGleason: A Dataset for Automated Gleason Grading of Prostate Cancer. Scientific Reports, vol. 11, no. 1, 2021, pp. 14990.

[SICAPV2] Silva-Rodríguez, Javier and Colomer, Adrián et al.
SICAPv2: A Dataset for Whole Slide Image Classification and Gleason Grading of Prostate Biopsies. IEEE International Conference on Image Processing (ICIP), 2021, pp. 1234–1240.

--

--

No responses yet