Convolutional neural networks in image processing | GRO721
University of Sherbrooke | February 25, 2025
This university course project (GRO721) explores the feasibility of CNN-based shape detection applied to a simplified baggage screening scenario.
Develop a system capable of classifying, detecting and segmenting simple geometric shapes (circle, triangle, cross) in grayscale images. This task is a simplified model of the real problem of object detection in airport scanner images.
Sample Dataset Image (53×53 px, grayscale)
The classification model identifies the presence or absence of each shape in an image. A 4-layer CNN with progressive feature extraction (Conv + ReLU + MaxPool) feeds two fully-connected layers for multi-label output.
The network scans the image layer by layer, each pass compressing what it sees into a shorter, more abstract description. Think of it like reading a paragraph, then a sentence, then a single keyword. That final summary feeds a decision layer that answers three independent yes/no questions: circle present? Triangle? Cross?
The use of Sigmoid rather than Softmax is justified because each shape is independent (an image can contain 0, 1, 2 or 3 different shapes).
Classification Architecture
The detection model generates bounding boxes around identified shapes. A grid-based single-pass detection head uses three convolution segments with BatchNorm and LeakyReLU, producing an output of dimension (1, 3, 7) for a 3×7 grid.
Detection goes further than classification: the model must also draw a box around each shape it finds. The network divides the image into a 3×7 grid and asks every cell whether a shape is present and, if so, exactly where. Batch normalization keeps each layer's outputs in a consistent numerical range, which prevents training from becoming erratic as the model learns.
Adding BatchNorm improves gradient stability and accelerates convergence. LeakyReLU (slope 0.1) prevents the "dying ReLU" problem.
The segmentation model classifies each pixel according to its class (circle, triangle, cross or background). The U-Net architecture uses an encoder to extract features and a decoder to reconstruct the image, with skip-connections to preserve spatial information.
Where detection draws a box, segmentation labels every single pixel. The encoder works like squinting at the image: you lose fine detail, but the big shapes become obvious, layer by layer. Skip connections carry fine-grained detail from earlier layers forward, so the decoder can reconstruct sharp boundaries when rebuilding the pixel map.
Skip-connections are crucial for recovering spatial information lost during MaxPool. They enable precise reconstruction of shape contours.
U-Net Segmentation Architecture
Getting from 83% to 96% accuracy came down to one problem: the model was memorizing the training data instead of learning from it. Here's what that investigation looked like.
Custom ConveyorSimulator Dataset class loads ~270 48×48 grayscale images. Split: 90% train (243 images) + 10% validation (27 images). Preprocessing: simple ToTensor() normalization. No augmentation was used since we found it destroys geometric shape classification performance.
Starting from 83.7% accuracy with significant overfitting (9% train-val gap), applied systematic optimizations to reach 96.0%:
Applied rotation, affine transforms, and horizontal flipping → Accuracy dropped to 50%
Root Cause: Geometric shapes (circles, triangles, crosses) have inherent orientation. Aggressive augmentation breaks shape recognition. For this task, NO augmentation proved optimal.
Learning: Not all augmentation is beneficial. The strategy has to fit the specific task.
| Optimization Phase | Train Acc | Val Acc | Train-Val Gap | Status |
|---|---|---|---|---|
| Baseline (83.7%) | 84.2% | 75.2% | 9.0% | Severe overfitting |
| + BatchNorm + Dropout | 81.1% | 78.9% | 2.2% | Better generalization |
| Final (100 epochs, 85 FC neurons) | 93.9% | 93.0% | 0.95% | Optimal, minimal overfitting |
96% test accuracy means the model correctly identifies which shapes are present nearly 24 times out of 25, trained on fewer than 270 images.
The classification model achieves 96.0% test accuracy. The training curve shows a progressive decrease in loss while accuracy rises steadily. Fluctuations in validation may be caused by natural variations in the dataset.
Loss & Accuracy Curves
Tests on images show that the model recognizes shapes correctly. The main error cases correspond to overlapping shapes or partially visible shapes. Careful, shape-preserving augmentation may help, though aggressive transforms hurt performance on this dataset (see Section 3).
78.3% mAP means the model finds the right object in the right place about 4 times out of 5, within the strict parameter budget required for embedded deployment.
Detection achieves an mAP of 78.3%, showing good ability to locate and classify objects. However, fluctuations are visible on the validation set, suggesting potential instability and overfitting. Some bounding boxes are not perfectly aligned with the actual objects.
Loss & mAP Curves
Detection Predictions
Results show some error cases: imperfect box alignment, shape classification errors, multiple detections of the same object. Improvements could come from revising the loss function or more aggressive data augmentation (random rotations).
86% IoU means the model correctly labels roughly 6 out of every 7 pixels, achieved with under 1 million parameters and fewer than 300 training images.
Semantic segmentation achieves a best validation IoU of 86.0% (epoch ~142), demonstrating the U-Net model's ability to accurately segment shapes at the pixel level. The final train IoU of 96.2% yields a train-val gap of ~10%, indicating moderate overfitting that stabilized after epoch 80. The architecture with skip-connections proves very effective for preserving contours.
Training ran for 150 epochs, with the model converging steadily from epoch 1 (IoU ~27%) to plateau around epoch 80–100 (IoU ~84–85%). Beyond epoch 100, improvements are marginal. The final validation loss of 0.00114 reflects stable and well-fitted training.
Loss & IoU Curves
Tests on segmentation show that the model accurately delineates shape boundaries and generates precise pixel-level masks. The main error cases correspond to overlapping shapes where boundaries become ambiguous, and partially visible shapes at image edges. Performance can be improved through data augmentation (random rotations, zooms, and edge variations) to handle more diverse segmentation scenarios.
This project validated the feasibility of a proof of concept to automate scanner image processing through convolutional neural networks. The three developed architectures (classification, detection, segmentation) demonstrate that optimized models can achieve good performance even with strict resource constraints.