AI Image Detector — Alina Chuang

Overview

As AI-generated images become increasingly photorealistic, detecting synthetic content is a growing challenge in computer vision. This project explores how different neural network architectures — from a hand-built MLP to a convolutional network — perform on the same binary classification task: is this image real or AI-generated?

The project takes two approaches. First, an MLP implemented entirely from scratch in NumPy to understand the underlying mechanics. Second, a CNN in PyTorch to demonstrate real-world performance and compare against the baseline.

96.47%

Best Test Accuracy

100K

Training Images

0.76%

Train/Test Gap

50ep

Final Model

Approach 1 — MLP from Scratch (NumPy)

Built without any deep learning framework. Every component is implemented manually: parameter initialisation (He scaling), forward propagation, cost computation (Binary Cross-Entropy), backpropagation, and gradient descent with learning rate decay. Both full-batch and mini-batch variants were implemented.

2-Layer Network

Architecture: LINEAR → ReLU → LINEAR → Sigmoid

Full-batch (7000 epochs): 75.44% test accuracy
Mini-batch (100 epochs): 78.40% test accuracy

4-Layer Network

Architecture: [LINEAR → ReLU] × 3 → LINEAR → Sigmoid

Full-batch (7000 epochs): 76.44% test accuracy
Mini-batch (100 epochs): 81.94% test accuracy

// He initialisation (NumPy)
W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / n_in)
b = np.zeros((n_out, 1))

// Mini-batch gradient descent with decay
lr = lr_0 / (1 + decay_rate * epoch)

Approach 2 — CNN (PyTorch)

Three progressively improved CNN versions, each addressing overfitting from the previous.

Baseline CNN

Input (3, 32, 32)
  → Conv1 (32 filters, 3×3) → ReLU → MaxPool(2×2)   # (32, 16, 16)
  → Conv2 (64 filters, 3×3) → ReLU → MaxPool(2×2)   # (64,  8,  8)
  → Flatten                                            # (4096,)
  → FC1 (4096 → 512)        → ReLU
  → FC2 (512  → 1)          → Sigmoid

Train: 99.39% / Test: 95.08% / Gap: 4.31%

CNN + Dropout

Dropout(0.5) inserted after FC1's ReLU, randomly zeroing 50% of neurons per batch to reduce memorisation.

Train: 99.89% / Test: 95.91% / Gap: 3.98%

CNNv3 — Augmentation + BatchNorm + Dropout (Best)

Three regularisation techniques combined:

Technique	Where Applied	Effect
Data Augmentation	Training loader	RandomHorizontalFlip + RandomCrop — prevents memorising fixed orientations
Batch Normalisation	After Conv1 and Conv2	Normalises activations per batch, stabilises training
Dropout(0.5)	After FC1 ReLU	Forces distributed representations in FC layers

Input (3, 32, 32)
  → Conv1 → BatchNorm → ReLU → MaxPool   # (32, 16, 16)
  → Conv2 → BatchNorm → ReLU → MaxPool   # (64,  8,  8)
  → Flatten                               # (4096,)
  → FC1 (4096 → 512) → ReLU → Dropout(0.5)
  → FC2 (512 → 1)    → Sigmoid

Trained for 50 epochs (augmentation requires more epochs to converge)
Train: 97.22% / Test: 96.47% / Gap: 0.76%

Results Summary

Model	Train	Test	Gap
MLP 2-layer full-batch (7000ep)	75.75%	75.44%	0.31%
MLP 4-layer full-batch (7000ep)	76.64%	76.44%	0.20%
MLP 2-layer mini-batch (100ep)	80.56%	78.40%	2.16%
MLP 4-layer mini-batch (100ep)	86.82%	81.94%	4.88%
CNN baseline (10ep)	99.39%	95.08%	4.31%
CNN + Dropout (25ep)	99.89%	95.91%	3.98%
CNNv3 + Aug + BN + Dropout (50ep)	97.22%	96.47%	0.76%

Key insight: CNN baseline surpassed the best MLP in just 10 epochs without any tuning — the core reason being that MLP discards all spatial structure by flattening the image, while CNN preserves it through convolutional filters. The real challenge was closing the train/test gap, which required combining all three regularisation techniques.

Dataset

CIFAKE: Real and AI-Generated Synthetic Images (Kaggle). 100,000 training images (50K real / 50K fake) and 20,000 test images at 32×32 RGB. Real images sourced from CIFAR-10; fake images generated using Stable Diffusion v1.4.

AI Image Detector — From Scratch

Overview

Approach 1 — MLP from Scratch (NumPy)

2-Layer Network

4-Layer Network

Approach 2 — CNN (PyTorch)

Baseline CNN

CNN + Dropout

CNNv3 — Augmentation + BatchNorm + Dropout (Best)

Results Summary

Dataset