Controllable and Robust Conditional Models and Manipulation Under Distribution Variation and Multimodal Condition
Abstract
Deep learning with large-scale data has achieved remarkable success, spanning both discriminative and generative models. These enable landing various applications such as image classification, object detection, and text-to-image generation and editing. A central goal across these tasks is to model the conditional distribution—for example, mapping images to class labels or generating images conditioned on text.
However, learning a controllable and robust conditional distribution presents two major challenges. First, models must overcome distributional shifts and variations, and maintain robustness across different domains. Second, they must learn and manipulate conditional relationships under multimodal conditions, such as interactions between text and images. This thesis addresses the problem of controllable conditional modeling and manipulation for both discriminative and generative tasks, particularly under conditions of distributional shift and multimodality. Specifically, for discriminative tasks, we investigate how to mitigate distributional shifts across domains and learn controllable mappings from images to class labels that generalize across diverse environments. For generative tasks, we explore how to model and manipulate multimodal conditional distributions—especially those involving both text and images—to achieve flexible and precise image editing.
In Chapter 2, we address the robustness under distributional shifts for a discriminative model in the context of unsupervised domain adaptation. Specifically, we propose a mutual conditional alignment method to align the conditional distributions across blended and different domains, which enables adapting a source model to different unseen target domains. We theoretically prove that aligning the conditional distributions is overwhelmed when adapting to multiple blended target domains, and design a framework to mutually align two conditional distributions related to the class labels. In Chapter 3, we extend conditional modeling to the generative setting by learning the mapping from image pairs to textual descriptions. Specifically, we distill the transformation semantics between a pair of before-and-after images into textual form and leverage these learned representations to facilitate more complex image editing tasks. In Chapter 4, we discuss the text-to-image distribution controlling in the rectified flow and diffusion transformer, and leverage the text-to-image generative model for text-guided image editing. In Chapter 5, we provide a theoretical analysis of the inversion algorithm introduced in Chapter 4 and generalize the framework to broader scenarios of conditional image generation, encompassing both image editing and ID-consistent generation. Furthermore, we analyze the limitations of current evaluation metrics for non-rigid editing and curate a new dataset specifically designed for assessing non-rigid image editing performance.
In summary, this dissertation presents a comprehensive exploration of controllable conditional modeling under conditions of domain shift and multimodal interaction. By bridging theoretical insights and practical methodologies across discriminative and generative paradigms, the presented research contributes to a deeper understanding of how conditional distributions can be effectively aligned, modeled, and manipulated across complex visual and multimodal tasks.