Classification, Détection et Segmentation de Formes

Réseaux de neurones convolutifs en traitement d'images | GRO721

Université de Sherbrooke | 25 février 2025

1. Context & Objective

This project is a proof of concept (PoC) for a baggage screening system design company. The objective is to determine the feasibility of developing commercial software based on deep learning techniques to automate the baggage examination process.

Problem to Solve

Develop a system capable of classifying, detecting and segmenting simple geometric shapes (circle, triangle, cross) in grayscale images. This task is a simplified model of the real problem of object detection in airport scanner images.

Project Constraints

  • Resource Limitations: The algorithm will be deployed on systems with limited memory and computational capacity
  • Limited Timelines and Budgets: Quick validation of technical feasibility
  • Parameter Limits:
    • Classification: max 200,000 parameters
    • Detection: max 400,000 parameters
    • Segmentation: max 1,000,000 parameters

Dataset Characteristics

  • Image Size: 53×53 pixels in grayscale
  • Shapes to Detect: Circle, Triangle, Cross
  • Image Composition: Max 3 shapes per image, max 1 instance per shape
  • Features: Random grayscale level, noisy background tending towards black

Technology Stack

Python 3.x
PyTorch
Torchvision
NumPy
Matplotlib
OpenCV
Sample Dataset Image
(53×53 px in grayscale)

2. Architecture & CNN Models

2.1 Classification Network

The classification model identifies the presence or absence of each shape in an image. Inspired by AlexNet, the architecture consists of four segments (Conv + ReLU + MaxPool) for progressive feature extraction, followed by two fully-connected layers.

Design Choices
  • Final Activation: Sigmoid (multi-label classification)
  • Loss Function: Binary Cross Entropy
  • Learning Rate: 0.001
  • Epochs: 17
  • Batch Size: 32
  • Total Parameters: ~180,000 (< 200,000)

The use of Sigmoid rather than Softmax is justified because each shape is independent (an image can contain 0, 1, 2 or 3 different shapes).

🏗️ Classification Architecture
Detailed Diagram (to add)

2.2 Detection Network

The detection model generates bounding boxes around identified shapes. Inspired by YOLO, the network consists of three convolution segments with BatchNorm and LeakyReLU, producing an output of dimension (1, 3, 7) for a 3×7 grid.

Design Choices
  • Architecture: YOLO-inspired with BatchNorm
  • Loss Function: MSE (localization) + Cross Entropy (class)
  • Learning Rate: 0.001
  • Epochs: 20
  • Batch Size: 32
  • Total Parameters: ~390,000 (< 400,000)

Adding BatchNorm improves gradient stability and accelerates convergence. LeakyReLU (slope 0.1) prevents the "dying ReLU" problem.

🏗️ Detection Architecture
Detailed Diagram (to add)

2.3 Semantic Segmentation Network

The segmentation model classifies each pixel according to its class (circle, triangle, cross or background). The U-Net architecture uses an encoder to extract features and a decoder to reconstruct the image, with skip-connections to preserve spatial information.

Design Choices
  • Architecture: U-Net with skip-connections
  • Loss Function: Cross Entropy Loss (multi-class)
  • Learning Rate: 1e-2 (fast) then 1e-4 (refined)
  • Epochs: 17-25
  • Batch Size: 32
  • Total Parameters: ~993,700 (< 1,000,000)

Skip-connections are crucial for recovering spatial information lost during MaxPool. They enable precise reconstruction of shape contours.

🏗️ Segmentation Architecture (U-Net)
Detailed Diagram (to add)

3. Implementation & Key Optimizations

🔧 Performance Optimizations

Data Loading & Preprocessing

Data is loaded via a ConveyorSimulator class inheriting from torch.utils.data.Dataset. The split used is 90/5/5 (train/validation/test) instead of the standard 70/15/15, maximizing training data for this small dataset.

# Transformation des images transforms.Compose([ transforms.ToTensor() ]) # DataLoader avec parallelisation DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)
Challenges Encountered & Solutions
Challenge 1: Overfitting

Symptom: Training loss decreases, validation loss plateaus
Solution: Early stopping, learning rate adjustment, data augmentation

Challenge 2: Data Imbalance

Symptom: Some classes less frequent, model bias
Solution: Class weighting in the loss function

Challenge 3: Gradient Stability (Detection)

Symptom: Significant fluctuations during validation
Solution: Reduced learning rate (0.001), BatchNorm in layers

Applied Optimization Techniques
  • Activation Functions: ReLU for classification/segmentation, LeakyReLU for detection (avoids dying ReLU)
  • Batch Normalization: Stabilizes training, accelerates convergence (especially in detection)
  • Max Pooling: Reduces spatial dimension, increases invariance to small translations
  • Skip Connections: Preserves spatial information in segmentation
  • Hyperparameter Tuning: Experimental optimization of learning rates and epochs
📈 Pipeline d'Entraînement
  1. Chargement du dataset
  2. Initialisation du modèle
  3. Boucle d'entraînement:
    • Forward pass
    • Calcul de la perte
    • Backward pass
    • Mise à jour des poids
  4. Validation à chaque époque
  5. Sauvegarde du meilleur modèle
  6. Test final
💡 Leçons Apprises
  • Importer les courbes de validation dès le départ
  • Utiliser BatchNorm systématiquement
  • Adapter l'architecture aux données
  • Monitoring de la perte aide à identifier les problèmes

4. Results & Visuals

4.1 Classification

Accuracy
92.5%
on test set
Precision
90.8%
average per class
Recall
91.2%
average per class
Final Loss
0.0068
validation

The classification model shows solid performance with 92.5% accuracy. The training curve shows a progressive decrease in loss, while precision increases steadily. Fluctuations in validation may be caused by natural variations in the dataset.

📊 Loss & Accuracy Curves
(to add: loss_accuracy_classification.png)
🎯 Example Predictions
(to add: exemple_classification_predictions.png)

Tests on images show that the model recognizes shapes correctly. The main error cases correspond to overlapping shapes or partially visible shapes. Performance can be improved through data augmentation (random rotations, zooms).

4.2 Object Detection

mAP (IoU=0.5)
78.3%
average across all classes
Precision
82.1%
détection correcte
Recall
75.9%
formes détectées
Loss Final
0.0035
validation

Detection achieves an mAP of 78.3%, showing good ability to locate and classify objects. However, fluctuations are visible on the validation set, suggesting potential instability and overfitting. Some bounding boxes are not perfectly aligned with the actual objects.

📊 Loss & mAP Curves
(to add: loss_mAP_detection.png)
🎯 Detection Predictions
(to add: exemple_detection_predictions.png)

Results show some error cases: imperfect box alignment, shape classification errors, multiple detections of the same object. Improvements could come from revising the loss function or more aggressive data augmentation (random rotations).

4.3 Semantic Segmentation

Average IoU
79.1%
Intersection over Union
Pixel Accuracy
94.7%
correct classification
Loss Final
0.0018
validation (CrossEntropy)
Optimal Epochs
17-25
without overfitting

Semantic segmentation achieves an average IoU of 79.1%, demonstrating the U-Net model's ability to accurately segment shapes at the pixel level. The architecture with skip-connections proves very effective for preserving contours. Training curves show stable convergence.

📊 Loss & IoU Curves
(to add: loss_IoU_segmentation.png)
🎯 Segmentation Predictions
(to add: exemple_segmentation_predictions.png)

Segmentation results show excellent correspondence between predictions and ground truth. The model accurately segments shape contours, even in overlapping cases. An epoch number of 17-25 proves optimal; beyond 25 the model saturates without improvement.

5. Comparative Analysis

Performance by Task

Task Key Metric Value Parameters Inference Time
Classification Accuracy 92.5% ~180 K / 200 K ~2-3 ms
Detection mAP 78.3% ~390 K / 400 K ~4-5 ms
Segmentation IoU 79.1% ~994 K / 1000 K ~8-10 ms

Key Observations

✅ Successes
  • Classification: Highly performant (92.5%), stable without major overfitting
  • Segmentation: Very efficient U-Net architecture (79.1% IoU), essential skip-connections
  • Optimization: All models respect parameter constraints with margin
  • Convergence: All three models converge correctly with adjusted hyperparameters
⚠️ Identified Limitations
  • Detection: Significant validation fluctuations (model instability)
  • Bounding Boxes: Imperfect alignment, especially for partial shapes
  • Multiple Detections: Occasionally, the model detects the same shape twice
  • Generalization: Model struggles with images very different from training set

Error Case Distribution

Classification (7.5% error): Errors mainly on images with 2-3 overlapping or partially visible shapes. Model sometimes confuses boundaries between two shapes.

Detection (21.7% error): Box alignment, shape classification errors, multiple detections. Likely caused by too high a learning rate or suboptimal loss function.

Segmentation (20.9% error): Mainly on fine contours or overlaps. Excellent IoU for well-separated shapes, less good at boundaries.

📊 Error Analysis
(to add: error comparison by task)

Performance vs Resource Tradeoffs

The three models demonstrate a good balance between performance and computational efficiency:

Total inference time to process an image (classification → detection → segmentation) would be approximately 14-18 ms, acceptable for an airport scanner application requiring fast processing but not real-time (<100 ms acceptable).

6. Conclusion & Future Perspectives

📝 Project Summary

This project validated the feasibility of a proof of concept to automate scanner image processing through convolutional neural networks. The three developed architectures (classification, detection, segmentation) demonstrate that optimized models can achieve good performance even with strict resource constraints.

✅ Key Successes

  • Robust classification with 92.5% accuracy — Production-ready performance
  • Precise segmentation with 79.1% IoU — Very efficient U-Net architecture
  • Respect of parameter constraints — All models < defined limits
  • Acceptable inference time — ~14-18 ms for the full pipeline
  • Modular and well-documented code — Adaptable for other image tasks

🚀 Future Improvements

  • Data Augmentation: Random rotations, zooms, translations to improve generalization
  • Detection: Reduce learning rate, implement NMS (Non-Maximum Suppression) to eliminate multiple detections
  • Ensemble Learning: Combine all three models for more robust predictions
  • Transfer Learning: Fine-tune on pre-trained models (ResNet, MobileNet) if resources allow
  • Quantization: Reduce model size for deployment on limited hardware
  • Test on Real Data: Validate on actual scanner images (domain transfer)

💡 Key Learnings

  • BatchNorm is essential for stability, especially in detection
  • Skip-connections are crucial for preserving spatial information
  • Adapting architecture to data is more important than using standard architectures unchanged
  • Monitoring validation curves allows quick identification of overfitting problems
  • Hyperparameter optimization must be iterative and based on curve observations

📚 Resources & Links

View Complete Source Code on GitHub
Associated Documents
  • Complete Report: GRO721_rapport_app2_Neurones.pdf
  • Project Statement: Problematique_gro721_guide_etudiant_H25.pdf
  • Source Code: Dataset, models, training scripts on GitHub
  • Jupyter Notebooks: Exploratory analysis and results visualization
📧 Authors

Team 6: Andrei Corduneanu (cora5428), Marek Théoret (them0901)
Course: GRO-720 - Artificial Neural Networks
Date: February 25, 2025
University: University of Sherbrooke