Conditional Denoising Diffusion Probabilistic Model

cDDPM for Facial Attribute Translation and Domain Conversion | GEI898

University of Sherbrooke

1. Context & Objective

This project builds a conditional denoising diffusion probabilistic model (cDDPM) for domain-to-domain image translation. The primary task converts male portraits to female and female to male using the CelebA facial attribute dataset. A secondary experiment applies the same architecture to Day2Night landscape conversion.

Technical Goal

A standard diffusion model generates images from noise with no external guidance. The conditional variant steers that generation toward a target domain using a reference image as input. The model must learn which visual attributes belong to the target domain, transfer them to the output, and do so without discarding the source content entirely.

Translation Tasks

  • Primary task: Facial attribute translation on the CelebA dataset. Male portraits are converted to female and vice versa.
  • Secondary task: Day-to-night landscape conversion, used to benchmark generalization beyond facial data.
  • Evaluation: KID (Kernel Inception Distance) measures output quality against the target distribution. Huber Loss tracks training convergence.

Technology Stack

Python 3.x
PyTorch
NumPy
Matplotlib
CelebA Dataset
Source (Male) vs Target (Female)
Domain translation concept

Domain translation: source portrait and expected translated output.

2. Architecture & Models

2.1 U-Net Denoising Backbone

The core model is a U-Net adapted for iterative denoising. Forward diffusion corrupts the input image by adding noise progressively over timesteps using a linear scheduler. At each timestep, the network receives the noisy image plus the conditioning signal and predicts the noise component added at that step. Repeating this prediction in reverse across all timesteps reconstructs a clean, domain-translated image.

Core Architecture Components
  • Backbone: U-Net with contracting encoder, expanding decoder, and skip connections
  • Noise Scheduler: Linear scheduler for forward diffusion — noise variance increases uniformly across timesteps
  • Self-Attention: Captures long-range spatial dependencies within feature maps
  • Cross-Attention: Relates image features directly to the conditioning signal at each layer
  • Condition Injection: FiLM (Feature-wise Linear Modulation) applied inside every ResNet block

The encoder compresses the noisy input into a compact feature representation by progressively reducing spatial resolution. Skip connections carry high-resolution detail forward to the decoder, preventing fine structure from disappearing in the bottleneck. The decoder rebuilds spatial resolution while using those skip connections to restore sharp boundaries and textures. Predicting noise rather than the clean image directly makes training more numerically stable, since the target is always a well-behaved Gaussian rather than a complex natural image distribution.

U-Net Architecture with FiLM conditioning and attention blocks

Global U-Net architecture: FiLM conditioning modules and attention layers at each depth.

Forward diffusion: progressive noise addition via linear scheduler

Forward diffusion process: the linear scheduler progressively corrupts the image across timesteps.

2.2 FiLM Conditioning Mechanism

Conditioning is applied through FiLM (Feature-wise Linear Modulation). A lightweight CNN encoder processes the condition image and outputs a compact latent vector. Two linear projections map that vector to a scale factor γ and a shift factor β. These two parameters modulate every intermediate feature map inside the U-Net ResNet blocks: FiLM(x) = γ ⊙ x + β.

FiLM applied at every residual block means the conditioning signal reaches both shallow layers, where texture and local structure live, and deep layers, where high-level semantic attributes are represented. Injecting only at the input would lose influence over deep features entirely.

FiLM Modulation Details
  • Condition Encoder: Lightweight CNN maps reference image to a latent vector
  • Scale (γ): Multiplies each feature map channel-wise to amplify or suppress activations
  • Shift (β): Adds a learned offset per channel after scaling
  • Scope: Applied inside every ResNet block in the encoder and decoder
  • Formula: FiLM(x) = γ ⊙ x + β, where x is the intermediate feature tensor

Think of it this way: the conditioning image sends two instructions to every layer of the network. One says "scale up these features, scale down those." The other says "shift everything by this amount." Those instructions are different at every layer because the condition encoder produces a single latent vector that gets projected to layer-specific γ and β pairs. The network learns which features to amplify at each depth to best perform the requested domain shift.

ResNet Block with FiLM
γ (scale) and β (shift) application

FiLM modulation inside the ResNet block: γ scales and β shifts each feature map channel-wise.

2.3 Attention Placement Strategy

Attention is not applied uniformly across the network. Each location in the U-Net has a different representational role, and the type of attention placed there was chosen to match that role. The table below summarizes what each attention type contributes at each network location.

Attention Type U-Net Feature Maps Skip Connections Latent Space (Bottleneck)
No Attention N/A Direct concatenation only — no filtering N/A
Linear Self-Attention Ensures spatial coherence of features modulated by the timestep embedding Encoder features are refined before being forwarded to the decoder N/A
Cross-Attention Selects relevant information from the condition at each position, based on the current noise level The decoder selectively filters which encoder features are actually useful for reconstruction N/A
Full Self-Attention N/A N/A Global coherence at the most compressed point — every position attends to every other
Why This Placement Matters
  • Feature maps (high resolution): Linear Self-Attention is used here because full attention would be prohibitively expensive at large spatial sizes. It maintains temporal and spatial consistency across feature channels without quadratic cost.
  • Skip connections: Cross-Attention here is critical — it prevents the decoder from blindly accepting encoder features that are not relevant to the requested domain translation. The condition (via cross-attention) actively selects what the decoder receives.
  • Latent space (bottleneck): Full Self-Attention is applied only here, where the spatial resolution is at its smallest. The computational cost is manageable, and global coherence at this level propagates through the entire decoder, shaping the output structure at the highest semantic level.

Think of it as three specialized filters placed at three key checkpoints: one that keeps the feature maps internally consistent while denoising, one that decides which encoder context the decoder should trust, and one that ensures the final compressed representation is globally coherent before reconstruction begins.

3. Implementation & Key Optimizations

Three design decisions drove the performance differences across experiment configurations: whether training pairs source and target images explicitly, how the condition enters the network, and whether guidance is applied at inference.

Dataset Strategy: Paired vs Unpaired Training

Paired training presents explicit male-female image correspondences at each update step, giving the model a direct signal for what "correct" translation looks like. Unpaired training provides no explicit correspondence. The model learns the translation from population-level statistics across source and target domains. Paired training produces stronger attribute alignment but requires a dataset of matched pairs, which is harder to curate at scale.

Input Concatenation Conditioning Architecture

Input concatenation baseline: the condition is appended as extra channels at the network input, limiting its influence to early layers only.

Injection Method Comparison

Two injection strategies were evaluated. Condition Encoding (FiLM) adds the modulated latent representation inside each ResNet block. Input Concatenation appends the condition image as additional channels at the network input. FiLM propagates the condition signal through all network depths. Concatenation limits influence to early processing stages.

Configuration Injection Method Data Pairing Conditioning Reach
Baseline Input Concatenation Unpaired Input layer only
FiLM (Unpaired) Condition Encoding (FiLM) Unpaired All ResNet blocks
FiLM + Paired + CFG Condition Encoding (FiLM) Paired All ResNet blocks + guided inference

Classifier-Free Guidance (CFG)

Without Classifier-Free Guidance, the condition can over-constrain the denoising process, pulling outputs toward a mode that satisfies the condition but deviates from a natural image distribution. CFG addresses this by randomly dropping the condition during training with a probability of 20%. The model learns to generate under both conditioned and unconditioned paths.

CFG at Inference

At inference time, predictions from the conditioned and unconditioned passes are interpolated. This interpolation controls how strongly the condition steers the output. A high guidance weight prioritizes attribute transfer. A low weight prioritizes image naturalness. The 20% training dropout keeps both modes well-calibrated.

Training Configuration
Epochs (cDDPM): 54 CFG Dropout: 20% Scheduler: Linear Injection: FiLM Loss Function: Huber Eval Metric: KID Tasks: CelebA, Day2Night
Training Pipeline
  1. Sample source image and condition image (paired or unpaired)
  2. Forward diffusion: add noise via linear scheduler across timesteps
  3. Drop condition with 20% probability (CFG training)
  4. Forward pass through U-Net with FiLM injection at each ResNet block
  5. Compute Huber loss against ground-truth noise
  6. Backward pass, update weights
  7. At inference: interpolate conditioned and unconditioned predictions via CFG

4. Results & Visuals

4.1 Facial Attribute Translation (CelebA)

Training Epochs
54
cDDPM on CelebA
CFG Dropout
20%
condition dropout rate
Quality Metric
KID
Kernel Inception Distance
Training Loss
Huber
robust to noise outliers

Generated outputs show successful transfer of gender-linked attributes including hair length, facial structure, and apparent makeup. Source identity is not consistently preserved. The conditioning signal transfers target-domain attributes but provides no explicit anchor for source-specific structure, which allows identity drift.

Adding a face recognition loss would constrain structural drift directly. Without it, extending training is unlikely to resolve the issue since the model receives no gradient signal penalizing identity change. FiLM's multi-depth injection is the design choice that produced attribute transfer quality at the level shown here — input concatenation alone would not have reached it.

Generated samples and loss curves for paired training

Generated samples and training loss curve for the paired image configuration.

4.2 Injection Method Comparison

Generated samples and loss curves for unpaired training

Generated samples and training loss curve for the unpaired image configuration.

Multi-depth FiLM injection produces stronger attribute transfer than input concatenation alone. A condition injected only at the input loses influence over deep layers where high-level semantic attributes such as facial structure and expression are represented. FiLM's reach across all ResNet blocks closes that gap by modulating both local texture features and global semantic features simultaneously. The KID difference between configurations reflects this directly: translation quality scales with how deeply the condition penetrates the network.

4.3 Day2Night Benchmark: cDDPM vs GAN

The cDDPM trained for 54 epochs was benchmarked against a standard GAN baseline on Day2Night landscape conversion. This comparison tests whether diffusion-based generation is competitive against the established adversarial approach on a structurally simpler domain shift.

Model Training Paradigm Conditioning Task
cDDPM (54 epochs) Diffusion, iterative denoising FiLM + CFG Day2Night
GAN (Baseline) Adversarial, generator-discriminator Standard conditional Day2Night

Diffusion models produce high-fidelity outputs but require significantly more compute at inference time than GANs. The GAN baseline reaches stable coherence faster on low-complexity domain shifts where the target distribution is narrow and well-defined, which makes Day2Night a competitive benchmark for it. The 54-epoch cDDPM result establishes how much training budget this architecture requires to reach comparable coherence. On more complex tasks where the target domain is semantically richer, the diffusion model's iterative refinement advantage over adversarial training would be more pronounced.

4.4 Stress Test: Personal Portrait

Personal photo: condition (original portrait) and generated output

Left: condition (source portrait). Right: the model's generated output.

The model was tested on a personal portrait. The objective was simple: take a real photo of the author and translate it into a female version that preserves the original features. What it produced instead could generously be described as abstract.

The output keeps a rough facial geometry. Everything else collapses: skin tone drifts, expression dissolves, the background merges with the foreground, and the overall result looks less like a translated portrait and more like a late-night encounter in a haunted house. The model did not turn the author into a woman. It turned him into a demon.

What this actually demonstrates

This result is not an edge case. It is a precise illustration of the core limitation: the condition injection mechanism steers the domain but provides no structural anchor for source identity. The model knows what female faces look like. It does not know how to make this specific face look female. That gap is exactly what refined spatial conditioning and an identity loss are designed to close in future work.

5. Conclusion & Future Perspectives

Project Summary

This project demonstrates a functional cDDPM pipeline for image-to-image translation using FiLM conditioning, Classifier-Free Guidance, and paired training. The model transfers gender-linked facial attributes on CelebA and produces directional domain shifts on Day2Night. The conditioning signal was not fully optimized in this iteration: guidance from the condition image sometimes interferes with the denoising process rather than steering it cleanly.

Key Outcomes

  • FiLM conditioning across all ResNet blocks outperforms input concatenation for attribute transfer tasks
  • CFG at 20% dropout enables stable inference guidance without destabilizing training
  • Paired training improves attribute correspondence over unpaired population-level sampling
  • Day2Night benchmark provides a concrete reference comparison against standard GAN performance
  • Identity preservation across translation is the primary open problem in this configuration

Future Improvements

  • Spatial Placeholder Conditioning: Reserve specific spatial areas in the feature maps to anchor semantic attributes, improving condition fidelity without over-constraining the denoising path
  • Identity Loss: Add a face recognition loss term to penalize structural drift from the source identity during translation
  • Deeper Condition Optimization: Extend training beyond 54 epochs with the conditioning scheme refined to reduce interference with the denoising trajectory
  • Attention-Based Injection: Replace or supplement FiLM with cross-attention over the condition latent for finer spatial control over where attributes are transferred
  • Classifier Guidance: Combine CFG with a domain attribute classifier to apply semantic gradients at inference time
Conditioning Failure Example
Output where condition signal was not followed

Example where the condition failed to redirect the denoising trajectory toward the target domain.

Spatial Placeholder Concept
Proposed future conditioning approach

Future direction: placeholder-based spatial conditioning that reserves feature map areas for targeted semantic attribute injection.

Resources & Links

Associated Documents
  • Primary Dataset: CelebA (Large-scale CelebFaces Attributes Dataset) — facial attribute annotations across 200k+ images
  • Secondary Dataset: Day2Night landscape pairs for generalization benchmarking
  • Source Code: Training and evaluation scripts (link to be added)