For the first time, a Generative Adversarial Network is being used to create synthetic datasets of wound imagery, in order to redress a critical lack of diverse and accessible content of this type in healthcare machine learning applications.
The system, called WG2AN, is a collaboration between the Batten College of Engineering & Technology and AI heath company eKare, which specializes in applying machine learning methodologies to the measurement and identification of wounds.
The GAN is trained on 100-4000 labeled stereoscopic chronic wound images provided by eKare, including anonymized pictures of injury types from causes such as pressure, surgery, lymphovascular incidents, diabetes and burn injuries. The source material varied in size between 1224×1224 to 2160×2160, all taken under available light by physicians.
To accommodate the available latent space in the model training architecture, the images were rescaled to 512×512, and extracted from their backgrounds. To study the effect of dataset size, test runs were implemented on batches of 100, 250, 500, 1000, 2000, and 4000 images.
The image above shows increasing detail and granularity according to the size of the contributing training set, and the number of epochs run on each pass.
WG2GAN runs on PyTorch on a relatively lean consumer-style setup, with 8GB of VRAM on a GTX 1080 GPU. Training took between 4-58 hours over the range of dataset sizes from 100-4000 images, and over a range of epochs, on a batch size of 64 as a trade-off between accuracy and performance. The Adam Optimizer is used for the first half of training at a learning rate of 0.0002, and concluded with a linear decaying learning rate until a loss of zero is achieved.
In medical datasets, as with so many other sectors of machine learning, labeling is an inevitable bottleneck. In this case, the researchers used a semi-automated labeling system that leverages earlier research from eKare, which employed real-world models of wounds, created in Play-Doh and roughly colored for semantic context.
The researchers noted a problem that frequently occurs in the initial stages of training, when a dataset is quite diverse and weights are randomized – the model takes a long time (75 epochs) to ‘settle down’:
Where data is variegated, both GAN and encoder/decoder models struggle to obtain generalization in the earlier stages, as we can see evidenced in the above graph of the training of WG2GAN, which tracks the training timeline from inception to zero loss.
Care must be taken to ensure that the training process does not fixate on the features or characteristics of any one iteration or epoch, but rather continues to generalize to a usable mean loss without producing results that excessively abstract the source material. In the case of WG2GAN, that would risk to create unbounded, entirely ‘fictional’ wounds, concatenated among too wide a range of unrelated wound types, rather than producing an accurate range of variations within a particular wound type.
Controlling Scope In A Machine Learning Dataset
Models with lighter training sets generalize faster, and the paper’s researchers contend that the most realistic images could be obtained at less than the maximum settings: a 1000 image dataset trained over 200 epochs.
Though smaller datasets might achieve highly realistic images in less time, the range of images and types of wound generated will necessarily be more limited as well. There is a delicate balance in GAN and encoder/decoder training regimes between the volume and variety of input data, the fidelity of the produced images, and the realism of the produced images — issues of scope and weighting that are certainly not confined to medical image synthesis.
Class Imbalances In Medical Datasets
In general, healthcare machine learning is beset not only by a lack of datasets, but by class imbalances, where essential data on a specific disease constitutes so small a percentage of its host dataset that it risks to either be dismissed as outlier data, or to become assimilated in the process of generalization throughout training.
A number of methods have been proposed to address the latter issue, such as under-sampling or over-sampling. However, the problem is frequently side-stepped by developing disease-specific datasets that are entirely bound to a single medical issue. Though this approach is effective on a per-case basis, it does contribute to the culture of Balkanization in the sphere of medical machine learning research, and arguably slows down general progress in the sector.