CNN Architectures: An Overview

Convolutional Neural Networks (CNN) are now ubiquitous in computer vision. After foundational work in the late 1990’s the last decade saw an explosion in the amount of CNN advancement leaving us with a plethora of architectures to choose from. In this post we sketch the details of the more important CNNs in the hope that it gives a starting point in the selection process.

LeNet-5

Released in 1998 LeNet-5 was one of the first CNNs. It has 2 convolutional and 3 fully connected layers. It has trainable weights and a pooling layer with about about 60,000 trainable parameters. It became the standard template for all modern CNNs that follow the pattern of stacking convolutional and pooling layers, and terminating the model with one or more fully-connected layers.

AlexNet

Released 2012 AlexNet builds on the ideas of LeNet but has 8 layers, 3 fully-connected and 5 convolutional giving a total of 60 million parameters. AlexNet developers successfully used overlapping pooling and introduced Rectified Linear Units (ReLUs) as activation functions.

VGG-16

Released in 2014 by developers at Visual Geometry Group (VGG) believing that the best way to improve the efficiency of a CNN was to stack more layers onto it. VGG-16 has 13 convolutional and 3 fully-connected layers giving it a total of 138 million parameters. It used ReLUs as activation functions, just like AlexNet.

Inception-v1

Released in 2014 Inception-v1 heavily used the Network in Network approach and had 22 layers along with 5 million parameters. This network improves the usage of computer resources by stacking dense modules with convolutional layers within them instead of simply stacking convolutional layers on top each other.

Inception-v3

Released in 2015 a successor to Inception-v1, Inception v-3 had 24 million parameters and was 48 layers deep. Inception v3 could classify images into a total of 1000 categories and was trained on more than one million images from the ImageNet database. Inception v3 was among the first algorithms to use batch normalization and used the factorization method for more efficient computation using spatially separable convolutions. Simply, a 3x3 kernel is decomposed into two smaller ones: a 1x3 and a 3x1 kernel, which are applied sequentially.

Inception-v4

Released 2016 with 43 million parameters Inception-v4 dramatically improved training speed due to residual connections.

Resnet50

Released in 2015 and consisting of 50 layers of ResNet blocks (each with 2 or 3 convolutional layers) ResNet 50 had 26 million parameters. The basic building blocks for ResNet-50 are convolutional and identity blocks. To address the degradation in accuracy, Microsoft researchers added skip connection ability. ResNet-50 was among the first CNNs to have the batch normalization feature.

MobileNet

Released in 2017 MobileNet had 3.4 million parameters. It makes use of a novel convolutional layer, known as Depthwise Separable Convolution that deals with image depth (channels) and not just with the spatial dimensions. The number of multiplications is then greatly reduced. Because of the small size of the model, these models are considered very useful to be implemented on mobile and embedded devices

DenseNet169

Released in 2018 DenseNet was developed specifically to improve the declined accuracy caused by the vanishing gradient problem. Maximum information flow is ensured by simply connecting every layer directly with each other. This requires fewer parameters than an equivalent traditional CNN, as there is no need to learn redundant feature maps. Some variations of ResNets showed that many layers barely contribute and can be dropped. Instead DenseNet uses very narrow layers that add a smaller set of new feature-maps.