Generative Models for Synthetic Biosignal Data

Biomedical signal processing is a cornerstone of modern healthcare and assistive technologies. However, machine learning applications in this domain are often hampered by significant data challenges. Datasets for signals like electrocardiograms (ECG) or electroencephalograms (EEG) are typically small, expensive to collect and annotate, and subject to strict privacy constraints. Furthermore, these datasets are frequently imbalanced, with critical but rare event data being underrepresented, which biases model training.

This project addresses these challenges by developing and evaluating deep generative models capable of creating high-fidelity, realistic synthetic biosignals. By leveraging the power of Generative Adversarial Networks (GANs) and Denoising Diffusion Probabilistic Models (DDPMs), our work provides a robust framework for data augmentation, class imbalance correction, and data denoising, ultimately enhancing the performance and reliability of downstream machine learning models.

From GANs to Transformers

Our initial approach focused on Generative Adversarial Networks. We progressed from standard RNN-based models to more advanced, transformer-based architectures that are better suited for capturing long-range dependencies in time-series data.

TTS-GAN (Li et al., 2022) introduced a pure transformer-based GAN capable of generating realistic single-class time-series data. To address multi-class generation and the issue of data scarcity in minority classes, we developed BioSGAN (Li et al., 2023), a conditional, label-guided model. BioSGAN can be trained on an entire multi-class dataset, leveraging transfer learning between classes to generate high-quality, class-specific signals, even for classes with very few real examples.

On the left, the architecture for our label-guided BioSGAN model. On the right, examples of synthetic ECG heartbeats generated by the model, which are visually indistinguishable from real signals.

Advancing with Diffusion Models

While GANs are powerful, Diffusion Models have recently emerged as a state-of-the-art alternative, often producing more diverse and high-fidelity samples. Our research has explored their application to biosignal synthesis with a focus on versatility and robustness.

BioDiffusion (Li et al., 2024) is a versatile framework for unconditional, label-conditional, and signal-conditional generation. This single model can not only generate new data from scratch but can also perform tasks like denoising, signal imputation (filling in missing values), and super-resolution, making it a comprehensive tool for signal enhancement.

Furthermore, we tackled the real-world problem of noisy labels in training data. Our work in (Atkinson et al., 2023) introduces a novel adaptation of Denoising Diffusion Probabilistic Models (DDPMs) that incorporates label smoothing. This technique prevents the model from becoming overconfident in mislabeled data, significantly improving the quality and reliability of the generated signals when trained on imperfect datasets.

Left: BioDiffusion demonstrates its versatility by successfully denoising noisy signals and imputing missing data points. Right: A comparison of generated signals shows how label smoothing helps the diffusion model create more accurate representations, even when trained on data with mislabeled examples.

Impact on Downstream Tasks

The ultimate goal of generating synthetic data is to improve the performance of other machine learning models. We have demonstrated that augmenting imbalanced datasets with our synthetically generated signals significantly improves the performance of classifiers. For instance, in classifying arrhythmia from ECG signals, adding synthetic data for rare classes dramatically increased the F1-score and recall for those classes, leading to a more robust and clinically useful diagnostic model.

The quality of the generated data is validated through both quantitative metrics and qualitative visualizations, such as the t-SNE plot below, which shows a strong overlap between the distributions of real and synthetic data.

A t-SNE projection shows the close alignment between the distribution of real data (in blue) and synthetic data generated by our model (in red), demonstrating the high fidelity of the generated samples.

From GANs to Transformers

Advancing with Diffusion Models

Impact on Downstream Tasks

References

2024

2023

2022