Conditional image-to-image translation · PyTorch · CelebA dataset
University of Sherbrooke
I built a conditional diffusion model (cDDPM) that translates facial attributes — converting male portraits to female and vice versa — without any external pretrained backbone. A secondary experiment applies the same architecture to Day2Night landscape conversion.
The challenge: give the model a reference image and have it transfer only the relevant visual attributes — hair, facial structure, skin tone — without overwriting everything else.
CelebA dataset samples: source and target domain portraits used for training.
The model works by destroying and rebuilding: it learns to add noise to an image in steps, then reverse that process — but steered by a reference image to land in a different domain. The backbone is a U-Net that predicts the noise at each step, guided by the conditioning signal until a clean, domain-translated image emerges.
The encoder compresses the image down to a compact representation; skip connections carry fine detail forward so the decoder can reconstruct sharp edges and textures without losing them in the bottleneck. Predicting noise rather than the clean image makes training more stable — the target is always a simple Gaussian, not a complex natural image.
Global U-Net architecture: conditioning injection and attention layers at each depth.
Forward diffusion process: the linear scheduler progressively corrupts the image across timesteps.
Different layers handle different jobs — I matched the attention mechanism to what each layer actually needs to do.
Think of it as three specialized filters placed at three key checkpoints: one that keeps the feature maps internally consistent while denoising, one that decides which encoder context the decoder should trust, and one that ensures the final compressed representation is globally coherent before reconstruction begins.
Three design decisions drove the performance differences across experiment configurations: whether training pairs source and target images explicitly, how the condition enters the network, and whether guidance is applied at inference.
Paired training shows the model exact before/after examples — stronger results, harder to scale. Unpaired training learns the translation pattern statistically from two separate pools of images — more flexible, but weaker attribute transfer.
Input concatenation baseline: the condition is appended as extra channels at the network input, limiting its influence to early layers only.
I feed the reference image to the network by appending it as extra input channels — simple, but limited to early-layer influence only. I tested two configurations: a baseline (unpaired, no guidance) against paired training with CFG.
| Configuration | Injection Method | Data Pairing | Guidance |
|---|---|---|---|
| Baseline (Unpaired) | Input Concatenation | Unpaired | None |
| Paired + CFG | Input Concatenation | Paired | Classifier-Free Guidance |
Without guidance, the model over-commits to the reference image and produces outputs that look unnatural — it follows instructions too literally. CFG fixes this by randomly dropping the condition during training (20% probability), teaching the model to generate under both conditioned and unconditioned paths.
At inference, I blend the conditioned and unconditioned predictions to control how strongly the reference image steers the output — high weight means stronger attribute transfer, low weight means more natural-looking result. The 20% dropout during training keeps both modes reliable.
The model successfully transfers hair length, facial structure, and makeup — but doesn't preserve who the person is. It knows what female faces look like. It doesn't know how to make this specific face female.
The fix is a face recognition loss — a signal that directly penalizes the model for changing who the person is. Without it, more training won't help; the model is never told identity matters. Injecting the reference image deeper into the network (beyond just the input) would also keep it influential where it counts.
Generated samples and training loss curve for the paired image configuration.
Generated samples and training loss curve for the unpaired image configuration.
Paired training with CFG clearly wins. The reference image only enters at the input, so it stops influencing the deeper layers that handle the visual attributes that matter most — paired data and CFG both compensate for that structural limitation.
I ran Day2Night conversion with and without guidance to isolate exactly how much CFG contributes — on a simpler task where everything else is held constant.
| Model | Training Paradigm | Conditioning | Task |
|---|---|---|---|
| cDDPM (54 epochs) | Diffusion, iterative denoising | Concatenation + CFG | Day2Night |
| GAN (Baseline) | Adversarial, generator-discriminator | Standard conditional | Day2Night |
Training and validation loss curves — Day2Night with CFG.
Generated Day2Night outputs with CFG enabled.
Training and validation loss curves — Day2Night without CFG.
Generated Day2Night outputs without CFG.
The validation loss bounces instead of converging — a learning rate issue, not a generalization failure. The optimizer is overshooting good solutions rather than settling into them. Training loss trends down correctly, so the fix is simple: train longer at a lower rate.
CFG produces noticeably cleaner results. Without it, the reference image has weaker influence at inference and the output drifts. Both runs learned equally well — the quality gap is entirely from the guidance mechanism, not from training.
Left: condition (source portrait). Right: the model's generated output.
The model was tested on a personal portrait. The objective was simple: take a real photo of the author and translate it into a female version that preserves the original features. What it produced instead could generously be described as abstract.
The output keeps a rough facial geometry. Everything else collapses: skin tone drifts, expression dissolves, the background merges with the foreground, and the overall result looks less like a translated portrait and more like a late-night encounter in a haunted house. The model did not turn the author into a woman. It turned him into a demon.
This result is not an edge case. It is a precise illustration of the core limitation: the condition injection mechanism steers the domain but provides no structural anchor for source identity. The model knows what female faces look like. It does not know how to make this specific face look female. That gap is exactly what refined spatial conditioning and an identity loss are designed to close in future work.
I built and trained a working conditional diffusion pipeline from scratch — paired training with guidance beats unpaired training in every configuration tested. The main open problem: condition injection via concatenation only influences early layers. The fix — cross-attention over the condition at every network depth — is the primary next step.
Example where the reference image failed to redirect the output toward the target domain.
Future direction: reserving dedicated feature map regions to anchor specific visual attributes during generation.