← Back to Projects
Deep Learning Computer Vision PyTorch NumPy

AI Image Detector — From Scratch

Python · NumPy · PyTorch · CIFAKE dataset · 100K images

Overview

As AI-generated images become increasingly photorealistic, detecting synthetic content is a growing challenge in computer vision. This project explores how different neural network architectures — from a hand-built MLP to a convolutional network — perform on the same binary classification task: is this image real or AI-generated?

The project takes two approaches. First, an MLP implemented entirely from scratch in NumPy to understand the underlying mechanics. Second, a CNN in PyTorch to demonstrate real-world performance and compare against the baseline.

96.47%
Best Test Accuracy
100K
Training Images
0.76%
Train/Test Gap
50ep
Final Model

Approach 1 — MLP from Scratch (NumPy)

Built without any deep learning framework. Every component is implemented manually: parameter initialisation (He scaling), forward propagation, cost computation (Binary Cross-Entropy), backpropagation, and gradient descent with learning rate decay. Both full-batch and mini-batch variants were implemented.

2-Layer Network

Architecture: LINEAR → ReLU → LINEAR → Sigmoid

  • Full-batch (7000 epochs): 75.44% test accuracy
  • Mini-batch (100 epochs): 78.40% test accuracy

4-Layer Network

Architecture: [LINEAR → ReLU] × 3 → LINEAR → Sigmoid

  • Full-batch (7000 epochs): 76.44% test accuracy
  • Mini-batch (100 epochs): 81.94% test accuracy
// He initialisation (NumPy) W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / n_in) b = np.zeros((n_out, 1)) // Mini-batch gradient descent with decay lr = lr_0 / (1 + decay_rate * epoch)

Approach 2 — CNN (PyTorch)

Three progressively improved CNN versions, each addressing overfitting from the previous.

Baseline CNN

Input (3, 32, 32) → Conv1 (32 filters, 3×3) → ReLU → MaxPool(2×2) # (32, 16, 16) → Conv2 (64 filters, 3×3) → ReLU → MaxPool(2×2) # (64, 8, 8) → Flatten # (4096,) → FC1 (4096 → 512) → ReLU → FC2 (512 → 1) → Sigmoid

Train: 99.39% / Test: 95.08% / Gap: 4.31%

CNN + Dropout

Dropout(0.5) inserted after FC1's ReLU, randomly zeroing 50% of neurons per batch to reduce memorisation.

Train: 99.89% / Test: 95.91% / Gap: 3.98%

CNNv3 — Augmentation + BatchNorm + Dropout (Best)

Three regularisation techniques combined:

TechniqueWhere AppliedEffect
Data Augmentation Training loader RandomHorizontalFlip + RandomCrop — prevents memorising fixed orientations
Batch Normalisation After Conv1 and Conv2 Normalises activations per batch, stabilises training
Dropout(0.5) After FC1 ReLU Forces distributed representations in FC layers
Input (3, 32, 32) → Conv1 → BatchNorm → ReLU → MaxPool # (32, 16, 16) → Conv2 → BatchNorm → ReLU → MaxPool # (64, 8, 8) → Flatten # (4096,) → FC1 (4096 → 512) → ReLU → Dropout(0.5) → FC2 (512 → 1) → Sigmoid Trained for 50 epochs (augmentation requires more epochs to converge) Train: 97.22% / Test: 96.47% / Gap: 0.76%

Results Summary

ModelTrainTestGap
MLP 2-layer full-batch (7000ep) 75.75% 75.44% 0.31%
MLP 4-layer full-batch (7000ep) 76.64% 76.44% 0.20%
MLP 2-layer mini-batch (100ep) 80.56% 78.40% 2.16%
MLP 4-layer mini-batch (100ep) 86.82% 81.94% 4.88%
CNN baseline (10ep) 99.39% 95.08% 4.31%
CNN + Dropout (25ep) 99.89% 95.91% 3.98%
CNNv3 + Aug + BN + Dropout (50ep) 97.22% 96.47% 0.76%
Key insight: CNN baseline surpassed the best MLP in just 10 epochs without any tuning — the core reason being that MLP discards all spatial structure by flattening the image, while CNN preserves it through convolutional filters. The real challenge was closing the train/test gap, which required combining all three regularisation techniques.

Dataset

CIFAKE: Real and AI-Generated Synthetic Images (Kaggle). 100,000 training images (50K real / 50K fake) and 20,000 test images at 32×32 RGB. Real images sourced from CIFAR-10; fake images generated using Stable Diffusion v1.4.