Distribution-Aware Data Augmentation for Imbalanced Regression
Abstract
Imbalanced regression occurs when the distribution of a continuous target variable is non- uniform, leaving some regions with low sample density and causing models to perform poorly on rare but important values. Most data-level solutions rely on linear interpolation and global thresholds, which fail to capture complex data distributions and cannot handle high-dimensional inputs like images. This thesis develops three distribution-aware data augmentation methods for imbalanced regression. SMOGAN (Synthetic Minority Oversampling with GAN Refinement) uses a Generative Adversarial Network to refine synthetic samples generated by base oversamplers, aligning them with the true joint feature-target distribution. LDAO (Local Distribution-Based Adaptive Oversampling) employs clustering and Kernel Density Estimation to generate samples that preserve lo- cal distributional structure, avoiding arbitrary global thresholds. LatentDiff extends data augmentation to high-dimensional domains through a diffusion-based framework that op- erates in learned feature space, generating synthetic features conditioned on continuous labels. Experiments on benchmark datasets for both tabular and image-based regression demonstrate that these methods outperform existing approaches across different data modalities.