Pix2Pix Shoe Generator

✦ Conditional GAN · Academic Research Demo

TurnSketches

IntoReality

AI-powered design preview using Conditional Generative Adversarial Networks. Draw a shoe sketch and watch it come to life.

Scroll

Live Demo

Draw a shoe sketch or upload an edge map

Your Sketch

400 × 400

Pen size: 4px

Upload Sketch

Generated Shoe

👟

Your generated shoe will appear here

The Pipeline

How It Works

Input

Draw or upload a shoe sketch

Provide a clean edge map or line drawing of a shoe silhouette. The model expects a 256×256 binary-style sketch where white strokes on black background define the structural outline — just like the Canny edge detector would produce.

→

U-Net Generator

AI translates structure to texture

A U-Net encoder-decoder maps your sketch to a realistic image. The encoder compresses spatial information through 8 downsampling blocks; skip connections bridge encoder and decoder layers, preserving fine-grained details that would otherwise be lost during compression.

→

Output

Photorealistic shoe image

The generator outputs a 256×256 RGB image with realistic textures, lighting, and color. The PatchGAN discriminator ensures local texture coherence at a 70×70 receptive field, producing fine leather, mesh, or fabric detail without global blurriness.

Technical Deep Dive

Model Architecture

Generator: U-Net

Encoder → Decoder + Skip Connections

Encoder → Bottleneck → Decoder

128

256

512

256

128

channels per block (simplified)

Input

256×256

Activation

Tanh

Skip links

8 pairs

Skip connections concatenate encoder feature maps to the corresponding decoder layers, allowing the network to preserve low-level spatial details (edges, corners) that would otherwise be lost during downsampling.

Discriminator: PatchGAN

5 Strided Conv Layers

5 Conv Layers → Patch Grid

Input 256×256

→

Conv 128×128

→

Conv 64×64

→

Conv 32×32

→

Conv 30×30

→

Patch Output

30×30 real/fake judgments

Receptive field

70×70

Output

30×30

Input

Concat

PatchGAN judges whether 70×70 image patches look real or fake. This produces sharper textures than full-image discrimination and naturally captures high-frequency texture statistics like fabric weave and stitching detail.

Performance

Training Results

⚡

Epochs Trained

🏆

0.1602

Best Val L1

📦

8,000

Training Pairs

Loss Curves

G Loss

D Loss

Train L1

Sample Outputs

Edge Map

Input sketch

Edge-detected outline from dataset

Generated

AI output

Pix2Pix U-Net generator output

Real Photo

Ground truth

Original paired photo from dataset

Implementation Details

Methodology

Dataset

KEY
Edges2Shoes · 50,000 paired images
Each pair: Canny edge map + real photograph
Paired training enables direct pixel-level supervision — the model learns exact sketch-to-photo mappings rather than unpaired style transfer

Augmentation

KEY
Random jitter: resize to 286×286 then crop to 256×256
Random horizontal flip with 50% probability
These two transforms alone prevent overfitting and teach the model translational invariance without introducing unrealistic distortions

Loss Function

KEY
L1 (λ=100) + Adversarial BCE
L1 penalises per-pixel distance — enforces structural fidelity and prevents colour drift
Adversarial loss from PatchGAN pushes the generator toward photorealistic textures that L1 alone cannot produce

Optimizer

KEY
Adam · β₁ = 0.5, β₂ = 0.999, LR 2×10⁻⁴
Lower β₁ reduces momentum — critical for GAN stability as gradients change direction frequently
Linear LR decay after epoch 25 gently anneals learning towards convergence without oscillation