Texts: Comprehension - A Three-Class Classification CNN
Let's consider a CNN-based architecture designed to classify an image into one of three classes: a pedestrian, a tree, or a traffic signal. Each input image is of size (512, 512, 3) (RGB).
The network contains the following 11 layers in order. Note that we will address the input layer as the first layer, the next convolution layer as the second layer, and so on (i.e., according to the numbers).
1. Input image (512, 512, 3)
2. Convolution: 32 5x5 filters, stride "s1", padding "p1"
3. Convolution: 32 3x3 filters, stride 1, padding 1
4. Max pooling: 2x2 filter, stride 2
5. Convolution: 64 3x3 filters, stride 1, padding 1
6. Convolution: 64 3x3 filters, stride 1, padding 1
7. Max pooling: 2x2 filter, stride 2
8. Layer 'l'
9. Fully connected: 4096 neurons
10. Fully connected: 512 neurons
11. Fully connected: 'F' neurons
Comprehension:
If the spatial dimensions (width and height) of the output going into the third layer are the same as the input from the previous layer, what could be the possible values of stride 's1' and padding 'p1'?
A. stride 1, padding 1 (Incorrect option)
B. stride 2, padding 2
C. stride 1, padding 2
D. stride 2, padding 1