In 19 day(s), 8 hour(s) and 24 minute(s): The Open Repository will be unavailable between June 28-June 30 due to scheduled system maintenance. Submissions will resume on Tuesday, June 30 at 11:00 EDT. For questions, please contact: rsclib@uwo.ca

Controllable and Robust Conditional Models and Manipulation Under Distribution Variation and Multimodal Condition

Abstract

Deep learning with large-scale data has achieved remarkable success, spanning both discriminative and generative models. These enable landing various applications such as image classification, object detection, and text-to-image generation and editing. A central goal across these tasks is to model the conditional distribution—for example, mapping images to class labels or generating images conditioned on text.

However, learning a controllable and robust conditional distribution presents two major challenges. First, models must overcome distributional shifts and variations, and maintain robustness across different domains. Second, they must learn and manipulate conditional relationships under multimodal conditions, such as interactions between text and images. This thesis addresses the problem of controllable conditional modeling and manipulation for both discriminative and generative tasks, particularly under conditions of distributional shift and multimodality. Specifically, for discriminative tasks, we investigate how to mitigate distributional shifts across domains and learn controllable mappings from images to class labels that generalize across diverse environments. For generative tasks, we explore how to model and manipulate multimodal conditional distributions—especially those involving both text and images—to achieve flexible and precise image editing.

In Chapter 2, we address the robustness under distributional shifts for a discriminative model in the context of unsupervised domain adaptation. Specifically, we propose a mutual conditional alignment method to align the conditional distributions across blended and different domains, which enables adapting a source model to different unseen target domains. We theoretically prove that aligning the conditional distributions is overwhelmed when adapting to multiple blended target domains, and design a framework to mutually align two conditional distributions related to the class labels. In Chapter 3, we extend conditional modeling to the generative setting by learning the mapping from image pairs to textual descriptions. Specifically, we distill the transformation semantics between a pair of before-and-after images into textual form and leverage these learned representations to facilitate more complex image editing tasks. In Chapter 4, we discuss the text-to-image distribution controlling in the rectified flow and diffusion transformer, and leverage the text-to-image generative model for text-guided image editing. In Chapter 5, we provide a theoretical analysis of the inversion algorithm introduced in Chapter 4 and generalize the framework to broader scenarios of conditional image generation, encompassing both image editing and ID-consistent generation. Furthermore, we analyze the limitations of current evaluation metrics for non-rigid editing and curate a new dataset specifically designed for assessing non-rigid image editing performance.

In summary, this dissertation presents a comprehensive exploration of controllable conditional modeling under conditions of domain shift and multimodal interaction. By bridging theoretical insights and practical methodologies across discriminative and generative paradigms, the presented research contributes to a deeper understanding of how conditional distributions can be effectively aligned, modeled, and manipulated across complex visual and multimodal tasks.

Summary for Lay Audience

This dissertation explores how to make artificial intelligence (AI) systems more controllable and adaptable when understanding and generating images. Although deep learning has enabled impressive progress in image recognition and text-to-image generation, AI models often struggle when used in new environments or when handling multiple types of input, such as both text and images. This thesis develops new methods to align models across different visual domains, improve text-guided image generation and editing, and analyze how these systems can be made more reliable and consistent. It also introduces a new dataset for evaluating flexible, non-rigid image edits. Together, these contributions advance the goal of creating AI systems that can better understand, generate, and manipulate visual information in a controllable and generalizable way.

Description

Keywords

conditional distribution, domain adaptation, image editing, diffusion model

DOI

Collections