Distribution-Aware Data Augmentation for Imbalanced Regression

Abstract

Imbalanced regression occurs when the distribution of a continuous target variable is non- uniform, leaving some regions with low sample density and causing models to perform poorly on rare but important values. Most data-level solutions rely on linear interpolation and global thresholds, which fail to capture complex data distributions and cannot handle high-dimensional inputs like images. This thesis develops three distribution-aware data augmentation methods for imbalanced regression. SMOGAN (Synthetic Minority Oversampling with GAN Refinement) uses a Generative Adversarial Network to refine synthetic samples generated by base oversamplers, aligning them with the true joint feature-target distribution. LDAO (Local Distribution-Based Adaptive Oversampling) employs clustering and Kernel Density Estimation to generate samples that preserve lo- cal distributional structure, avoiding arbitrary global thresholds. LatentDiff extends data augmentation to high-dimensional domains through a diffusion-based framework that op- erates in learned feature space, generating synthetic features conditioned on continuous labels. Experiments on benchmark datasets for both tabular and image-based regression demonstrate that these methods outperform existing approaches across different data modalities.

Summary for Lay Audience

Machine learning models are good at predicting typical outcomes but often fail when predicting rare ones. For example, a model trained to estimate home prices might do well on average-priced houses but poorly on very cheap or very expensive ones, even though those predictions might matter most. Existing methods try to fix this by creating extra training examples, but they use oversimplified techniques that don’t work well with complex data like images. This thesis introduces three new methods that generate more realistic training examples for rare cases. Each method takes a different approach to understanding the structure of the data before creating new examples. Testing on multiple datasets shows that these methods lead to better predictions on rare values compared to existing techniques.

Description

Keywords

Kernel Density Estimation

DOI

License

Collections