In this post we compare the performance of the CNN cancer detection models that we trained previously. The results show that the performance of our custom model approaches that of the best of predefined models but with significantly reduced number of parameters. This makes it better suited for deployment on systems with constrained resources and more robust when trained on small datasets.

Learning Curves

A learning curve is a plot of model performance as the model gains experience. They are a diagnostic tool for algorithms that learn from data incrementally. Evaluation is performed with training data and validation data on each training epoch. It can be used to diagnose problems such as an underfit or overfit.

The curve calculated from the training dataset that gives an idea of how well the model is “specializing” and the curve from the validation dataset gives an idea of how well the model is “generalizing.” A good fit is indicated by training and validation curves that increases to a point of stability with a minimal generalization gap between the two final values.

Below are the training curves for the four models evaluated on the PatchCamelyon dataset. The small generalization gap demonstrates that by using augmentation and choosing networks that are relatively small we have been able to successfully combat overfitting in all cases while achieving an accuracy at the level of 95% for all models.

An underfit model has noisy values or low accuracy from training data, indicating it was unable to learn. An overfit model hast learnt the training dataset too well and has become “specialized” on training data. It will be unable to generalize to new data. Generalization error can be estimated from performance on the validation dataset. It occurs if the model too much capacity for the problem or is trained too long. The accuracy of an overfit model will usually be higher on the training data than the validation data. The gap is referred to as the “generalization gap.”

Receiver Operating Chracteristic

The ROC curve is a plot of True Positive Rate against False Positive Rate for a binary classifier at various class probability thresholds (these allow the operator to trade-off concerns in the errors made by the model). The area under the curve (AUC) is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. A perfect classifier yields an AUC of 1 which means we would classify all positive samples correctly with no false positives.

The ROC curve for the selected models is shown above. It demonstrates that a high diagnostic ability of all classifiers. The MobileNetV2 model is the best performer able to achieve 0.97 AUC.

The True Positive Rate describes how good the model is at predicting the positive class when the actual outcome is positive.

True Positive Rate = True Positives / (True Positives + False Negatives)

The False Positive Rate summarizes how often a positive class is predicted when the actual outcome is negative.

False Positive Rate = False Positives / (False Positives + True Negatives)

The ROC curve allows different models to be compared directly and area under the curve (AUC) gives a summary of the skill of each model. Skilful models are represented by curves that bend up into the top left corner. A model with no skill is represented by a diagonal line from the bottom left of the plot to the top right.

Analysis

These results represent a very healthy baseline. Verifiable results on Kaggle can achieve an AUC of 0.99 indcating that some improvement is possible. That we are so close to the best result is not suprising since three of the models are the result of world class research and chosen specifically because they are well suited to small datasets. Especially true of MobileNetV2 which is intended for deployment on mobile phones.

It is encouraging that CancerNet can compete with the predefined models. It out performs ResNet50 which is the deepest model in our selection indicating that aiming for relatively shallow models is a good design choice. That said it is possible that we could improve performance by adding convolutional blocks in a further design iteration and get closer to the performance of DenseNet169 and MobileNetV2