Car Image Recognition with Convolutional Neural Network Applications

Team Members

Sujay Chebbi | Lai Jiang | Immanuel Ponminissery | Jenny Robinson | Patricia Schutter


Image recognition is an exciting new field and has many important applications ranging from delivery service, payment processing, and public safety. We wanted to take image recognition and apply it to cars. Have you ever wondered what the car model is when you received an Amber Alert? What kind of car is the suspect driving? We believe by applying cameras with high speed image recognition capability at key highway intersections, we are able to match the car to the model and help police find the criminal faster. We used the Stanford car dataset and a variety of transfer learning models. We found RestNet50 to be our best model.

Project Summary


Building a fine grained image classification model which classifies car models from different brands. Achieving a high accuracy poses challenges due to the small variance in features of similar sedan, truck, SUV styles while coming from different brands.


Car classification can help police and car owners of stolen cars locate the suspect’s location by matching the model to the images of cars taken from major intersections of highway or road.

As the image recognition field becomes faster and more accurate, it can help maintain public safety by taking the perpetrator’s license plate and car models and reporting them to the relative authority in real time.

Besides public safety, image recognition could play an important role in marketing and sales, automakers can use image recognition technology in car enthusiasts blogs, instagram, twitter accounts and gauge consumer’s sentiments towards a certain model and association between brands.

Overall, image recognition has a wide range of applications that deserve to be further explored, however, our main focus here is to compare and contrast different image recognition models and select the most reliable one for classification.


We used the Stanford car dataset of 16,185 images of cars which include 49 different car makers and 157 different car models. The dataset is split into 80/20 Test and Train set and apply different variations of CNNs image recognition model to train and test accuracy.

Introduction to CNNs

Convolution Neural Networks (CNNs) are a category of neural networks that is effective for classification and image recognition, commonly used in identification of faces, traffic signs in robot visioning and self-driving cars.

There are four main components of CNNs as shown above, these are the basic building blocks of every CNNs. The image below represents the basic foundation of CNNs.

Figure 1

First layer is the convolution layer, followed by the ReLu layer which introduces non-linearity into the data. Third player is Pooling or sometimes called Sub-sampling, lastly the classification layer which is also called Fully connected layer.

The convolution layer preserves the spatial relationship between pixels by learning image features using small squares of input data. Here, the number of filters, filter size, stride, and zero padding are important parameters to manipulate.

ReLu is an element wise operation and replaces all negative pixel values in the feature map by zero, purpose is to introduce non-linearity in CNN, other nonlinear functions such as tanh or sigmoid can also be used instead of ReLu, but ReLu is the default activation when developing multilayer Perceptron and convolutional neural networks. The ReLu overcomes the vanishing gradient problem, allowing models to learn faster and perform better.

The function of pooling is to progressively reduce the spatial size of the input representation and make the feature dimension smaller and more manageable. Note that pooling layer is not necessary after every layer, you can have multiple convolution + ReLu operation before having a pooling operation

Fully connected implies all neurons in the previous layer are connected to every neuron in the next layer. Adding a fully connected layer is also a useful way of learning a non-linear combination of these features; Sum of probabilities in fully connected layer output equals 1 when using softmax activation function in the output layer of the fully connected layer. Pooling layer acts as a feature extractor from the input image while the fully connected layer acts as a classifier.

Data Description

The data was obtained from the ‘Cars’ dataset publicly available on Stanford University’s Artificial Intelligence Laboratory website. A few of the images are shown below.

Figure 2

There are 196 models in the Stanford car dataset. However, for the purposes of the classifiers built in this project, cars with multiple variants but virtually identical visual characteristics and similar cars from different model years were put in the same class. For instance, images that belong to Bentley Continental GT Coupe 2012 and Bentley Continental GT Coupe 2007 were put in the same class since the cars look similar. This brought the total number of classes down to 157.

The car_ims.tgz file contains all the images used for this project and the cars_annos.mat file contains the names of the cars corresponding to each image file. Code was written to create folders that were named after the 157 distinct classes for the train and test sets. After the directories were created, the images were copied from the unzipped car_ims.tgz file and put into the respective train and test folders. An 80/20 split was used to split the train and test data; there were 12948 train images and 3237 test images. The train images were further split using an 80/20 split for the purposes of validation.

Image Preprocessing

Since the different models make use of distinct preprocessing techniques, the image preprocessing was done on a model by model basis. For the transfer learning techniques, the images were preprocessed using their specific Keras ‘preprocess_input’ function. It was also made that the images were loaded in using the appropriate image sizes before running the preprocessing functions.

Basic CNN Model

The basic CNN model makes use of five convolutional layers with ‘relu’ activation and the ‘adam’ optimizer. Initially, there was clear evidence of overfitting and in order to reduce the amount of overfitting, dropout layers were added. Regularization in the form of ‘l1’ and ‘l2’ regularizers were also added to help prevent overfitting. In the end, there were 1,818,679 total parameters and all of them were trainable. Below is a snippet of the code:

Figure 3

The model was allowed to train for a 100 epochs with a callback option that monitored the loss. After the model stopped training for 59 epochs, the accuracy and loss curves below were obtained:

Figure 4

For this model, the test accuracy was only 15.6%. This called for transfer learning techniques to be used in the hope that they would yield higher accuracies.

Transfer Learning

Most convolutional neural networks are not trained from scratch; instead, people use models that have already been trained on another dataset and use said models to help classify the data. This method of solving problems, where knowledge gathered from solving problem A is used to solve problem B is called transfer learning. The idea is that the model will still be able to detect edges and other features based on knowledge it gained from solving a previous problem.

For the purpose of this project, five different transfer learning models have been trained and tested:

  1. VGG16
  2. ResNet50
  3. ResNet101
  4. InceptionV3
  5. Xception


The first model we used to upgrade from the basic CNN model was VGG-16. This model has 16 convolutional layers that are each composed of trainable parameters. Although it’s more complex with 138 million parameters, it’s architecture is very uniform. The convolution kernels are of the uniform size 3 x 3, and it has a stride of 1. Figure 5 shows the process: two convolutional layers, followed by pooling, then another two, again followed by pooling, then three sets of three convolutional layers, each followed by pooling, all the way down to the three dense layers. It is then followed by a softmax function for output. The hidden layers all use ReLU and the three dense layers at the end are fully-connected. We used the VGG-16 model library that uses pre-trained weights, but we unfroze the last two layers before fitting on our own data set to allow for more flexibility. We also pre-processed the images for the training and test sets.

Figure 5

Figure 6

Figure 7

Our results with VGG-16 were better than the previous basic CNN model, but still not very impressive. The model had 10 epochs, and the batch size of 50 gave us a total of 259 batches.

We decided on these numbers after some trial and error. The left side of Figure 7 shows the accuracy curve where the test accuracy leveled out after about 10 epochs around an optimal value of .4963. The training accuracy continued to rise (as expected) to .8165 after 10 epochs.

The right side of Figure 7 shows the loss curve, which also levels out at a test loss just above 2, the minimum being 2.26.

An accuracy score of about 50% is unsatisfactory, but VGG-16 does have potential in other circumstances. For instance, with access to proper equipment with higher machine capability, we could pre-train all of the weights rather than a subset. This could potentially help us achieve higher accuracy scores when predicting the car images. VGG-16 has also helped pave the way for more advanced neural networks like ResNet.


Key difference: ResNet50 has 50 layers and adds entity mapping.

Figure 8

In Figure 8, you can see that the plain net flows in one direction only from layer to layer. However, with Residual Networks, we have included this additional mapping, which skips layers. This helps improve gradient descent as it can locate and reference the previous layer to fine tune the accuracy more and adjust accordingly if the accuracy begins to decrease or slow its pace in increasing.

The ResNet50 architecture is shown in Figure 9. It consists of five stages that include both a convolution and identity block that each include three convolution layers themselves.

Figure 9

This model was tested using ReLu activation and Adam optimizer. The initial ResNet50 weights were imported from ImageNet, which resulted in 23,909,405 total parameters. The top few layers were then frozen resulting in 9,257,117 parameters that were trainable and 14,652, 288 that were not. A snippet of the layers is shown below in Figure 10.

Figure 10

The model was set to train for 50 epochs with a callback option that monitored the loss. The model stopped after 13 epochs, resulting in the following accuracy and loss curves:

Figure 11

The model appears to have performed quite well, resulting in a test accuracy of 76.68%.


Key Difference: ResNet101 is essentially the big brother to ResNet50. Rather than 50 layers, it has 101 layers.

As seen in Figure 11, ResNet101 obtains more layers by using more three layer blocks in the fourth step.

Figure 12

This model was again tested using ReLu activation and Adam optimizer. The initial ResNet101 weights were imported from ImageNet, which resulted in 42,979,869 total parameters. The top few layers were then frozen resulting in 15,563,421 parameters that were trainable and 27,416,448 that were not. A snippet of the layers is shown below in Figure 13.

Figure 13

The model was set to train for 40 epochs with a callback option that monitored the loss. The model stopped after seven epochs, resulting in the following accuracy and loss curves:

Figure 14

ResNet101 resulted in a test accuracy of 68.18%, which is a bit disappointing as it is about 8% lower than that of ResNet50. Looking at the training and validation accuracy graph, we can see that the validation accuracy of the model was higher at one point than where it actually finished, indicating that towards the end, the model began predicting on noise and overfitting.


The InceptionV3 model makes use of stacked smaller kernels rather than 7 x 7 or 5 x 5 convolutions. This helps reduce the number of parameters and the risk of overfitting. Furthermore, InceptionV3 makes use of auxiliary classifiers. Although it has been previously stated by researchers that auxiliary classifiers “promote more stable learning and better convergence”, it was found that auxiliary classifiers function as regularizers in the InceptionV3 framework. A schematic that demonstrates the InceptionV3 structure is shown below:

Figure 15

For the purposes of this project, the InceptionV3 model with weights from ImageNet were imported. The top layers were frozen so that only 6,641,501 parameters out of the total 22,124,477 parameters could be trained on the cars dataset. A snippet of the model summary is shown below in Figure 16 along with the accuracy and loss curves obtained in Figure 17.

Figure 16

Figure 17

As can be seen, the model stopped training after 8 epochs based on a callback that monitored loss The validation accuracy plateaued at approximately 70%. When the model was evaluated on a test set, it was found that an accuracy of 73.5% had been achieved.


François Chollet, creator of Keras, proposed an “extreme” variation of the Inception V3 convolutional network model under the name Xception. Inception modules would be replaced with depth-wise separable convolutions, and as such, the Xception model is identical to depth-wise separable convolutions. Some notable differences between Inception and Xception include the order of operation of convolutions. In the Xception model, channel-wise spatial convolution is performed first, followed by 1x1 convolution; in the Inception V3 model, 1x1 convolution is performed first. The Xception model is also absent of a non-linearity in the activation function. In the Inception V3 model, the non-linear ReLu (rectified linear unit) is used, whereas in the Xception model, non-linearities are not implemented. 1x1 convolution used to reduce depth channels and speed up convolution. Like all other models mentioned above, the Xception model is pre-trained on ImageNet. In the Figure 18, we can see the Xception architecture that Chollet proposed.

Figure 18

The Xception base model contained 132 layers. We sought to freeze the first 110 layers and train the remaining 22 layers to preserve computational capacity. We also utilized a Dropout of 0.2. We had a total of 21,183,173 parameters with 8,186,061 trainable parameters. Figure 19 details the layers used in the Xception model.

Figure 19

Unfortunately, we were unable to repeat the success that the VGG, ResNet, and Inception models had in determining accuracy. The Xception model, by far, performed the worse; it even performed worse than the basic CNN model. We ran 20 epochs, and the training and test accuracies and loss values did not improve in each iteration. In Figure 20, we can see the accuracy and loss curves.

Figure 20

We implemented an early stopping callback function that would stop after 5 epochs of unsuccessful decrease for the validation loss values. As can be seen in Figure 20, the validation loss stayed constant at a value around 5 and barely decreased. The final test accuracy and test loss values were 5.19% and 5.0138, respectively. Some improvements could perhaps be made on pre-processing and/or training the model on different sets of parameters.


Model Summary:

After creating and testing all six of these different models, we found that ResNet50 was the model that produced the best test accuracy. To truly understand what these different accuracies mean when classifying the car make and model of an image, we decided to put these models to the test and tested each model on four different images.

Image 1: Ferrari 458

Model Classification:

Image of Incorrect classification:

As you can see from the results, only four of the models: CNN, VGG 16, ResNet 50, and Inception V3 classified the Ferrari correctly. ResNet 101 and Xception incorrectly classified the Ferrari as a Corvette. Although these models are classified incorrectly, you can see that the models were not that far off in terms of classifying the image to a similar shape of car, such that none of these models are predicting a truck or a van, when this is a sports car.

Image 2: AM General

Model Classification:

Images of Incorrect classification:

Again, from the results, we can see that only three models correctly classified the AM General: Resnet 50, Resnet 101, and Inception V3. CNN, VGG16, and Xception incorrectly classified this car. Diving a bit deeper into these results, if we consider the accuracy score of these models, these results make sense. Xcpetion and CNN had the worst results of the six models and are far off in this classification. VGG16 was somewhat in the middle and with the classification of a jeep wrangler, it’s understandable from the image why the model may have predicted it as such.

Image 3: BMW 3

Model Classification:

Images of Incorrect classification:

For this image of a 3 Series BMW we wanted to test a more common car. Looking at the results, four of the models: VGG16, RESNET 50, RESNET 101, and Inception V3 correctly classified the BMW. Again, this makes sense as CNN and Xception did have the worse results.

Image 4: Acura TL

Model Classification:

Images of Incorrect classification:

For the final test, we wanted to test an original image; if you look at this photo, one of our group members, Immanuel, is actually in the photo. With that said this is an image of an Acura TL. Two of the models: ResNet50 and ResNet101 correctly classified this car. Based on the accuracy values of the models, this was a bit surprising as we would have expected Inception V3 and VGG16 to have better results since ResNet101 was able to correctly classify the image.

Summary of Image Tests:


We found that the model that performed best was ResNet50, which achieved a test accuracy of 76.68% and was the only model to successfully classify all four random images. It’s interesting to note that only the ResNet models (ResNet50 and ResNet 101) were able to correctly classify the final car image of the Acura TL, which also contained an “extraneous” image (our group member, Immanuel). It seems that the ResNet models, out of this particular set of models, are best at classifying cars.

Some future improvement opportunities include testing more models. The Keras library contains a total of 25 different applications, and in this project, we only utilized 5. By no means is ResNet50 the best application model possible, as ResNet50V2 or ResNet152 could achieve better results. We would also have liked to have had more variety in our data; basically, more car brands and models. However, this also increases the number of classes our model needs to classify. This does require more computation power, but we feel that this could lead to better accuracy in the future. We also could have utilized more changes in the fine tuning of the models, specifically with regards to the batch size, number of layers, and input shape of the images. Finally, this project could most definitely be improved by utilizing computer vision to classify cars in real-time, with the help of libraries such as OpenCV.


Image Classification: 1 | 2 | 3

CNN: 1 |2| 3

Transfer Learning: 1 | 2

VGG16: 1| 2 | 3

Residual Networks: 1 | 2 | 3

Inception V3



GitHub Link for Project