In Medical Imaging large datasets are typically not available due to low incidence of conditions and performance of deep learning based algorithms is compromised. The size of the datasets can be increased using augmentation to generate additional data which is used to train the model. This improves model performance when validated against unseen dataset. This post uses Tensorflow/Keras to augment histopathologic cancer data which will be used to train a CNN for cancer detection in a following post.
The PatchCamelyon dataset consists of 327.680 color images (96 x 96px). These small image patches are extracted from larger histopathologic scans of lymph node sections used to identify metastatic cancer. Each image is annoted with a binary label indicating presence of metastatic tissue. Download the dataset from Kaggle or from the original source
Training a machine learning model really means tuning its parameters such that it maps an input (e.g. image) to the correct output (a label) in a consistent way.
State of the art neural networks typically have parameters in the order of millions. You need to show your machine learning model a proportional amount of examples to get good performance. With small datasets this would be problematic but luckily neural networks aren’t smart to begin with. We just need to make minor alterations to our existing dataset such as flips or translations or rotations to make our neural network think these are distinct images.
Figure.1 Example augmentations of a single image sample.
You can flip images horizontally and vertically and immediatly get a data augmentation factor of 2 to 4.
Rotation range is a value in degrees (0-180) which defines a range within which to randomly rotate pictures.
The image will be zoomed inward to cut out a section from the new image with size equal to the original image.
width and height shifts are ranges (as a fraction of total width or height) within which to randomly translate pictures vertically or horizontally.
Before doing anything else let’s read the PatchCamelyon data into a pandas dataframe so that we can conveniently prepare it for preprocessing.
import os
import numpy as np
import pandas as pd
import random
from glob import glob
from sklearn.model_selection import train_test_split
path = "/data/patchcamelyon/"
labels = pd.read_csv(path + 'train_labels.csv')
train_path = path + 'train/'
test_path = path + 'test/'
df = pd.DataFrame({'path': glob(os.path.join(train_path,'*.tif'))})
df['id'] = df.path.map(lambda x: ((x.split("n")[2].split('.')[0])[1:]))
df = df.merge(labels, on = "id")
df['label'] = df['label'].astype(str)
train, test = train_test_split(df, test_size=0.2, stratify = df['label'])
test, valid = train_test_split(test, test_size=0.5, stratify = test['label'])
In Keras augmentation can be done via the keras.preprocessing.image.ImageDataGenerator
class allowing you to configure random transformations and normalization operations to be done on your image data during training and instantiate generators of augmented image batches .
from tensorflow.keras.preprocessing.image import ImageDataGenerator
generator = ImageDataGenerator(
vertical_flip = True,
horizontal_flip = True,
rotation_range=5,
zoom_range=0.1,
width_shift_range=0.1,
height_shift_range=0.1
)
Let’s prepare our data. We will use .flow_from_dataframe() to generate batches of image data (and their labels) directly from our jpgs in their respective folders.
train_generator = generator.flow_from_dataframe(
dataframe = train,
...
)
valid_generator = generator.flow_from_dataframe(
dataframe = valid,
...
)
We can now use these generators to train our model.
model.fit(
train_generator,
steps_per_epoch=2000,
epochs=50,
validation_data=valid_generator,
validation_steps=800
)