Exploring Art Classification Using Convolutional Neural Networks
Introduction
Neural Networks are becoming increasingly ubiquitous in our lives today, existing in a wide-variety of areas including cars, defense, space travel, and search engines. For this project I wanted to use the power of neural networks to help classify one of my favorite things: art. The dataset I’m using originally contained over 9000 images, classified into 5 types of art: Drawings, Paintings, Sculptures, Graphic Art, and Iconography.
What is a Convolutional Neural Network (CNN)?
A Convolutional Neural Network (or CNN) is most popularly used with images as it is great at feature detection, giving a high accuracy for datasets including different types of animals or street signs. In a CNN, the spatial relationship between pixels is maintained as the images are taken in in their original format and then transformed.
A CNN is made up of two different types of layers: Feature Learning Layers, which consist of alternating Convolutional and Pooling layers, and Classification Layers, which are fully connected layers which help parse the data. Each of these different layer types contain different hyper-parameters that need to be tuned by hand which can have a large impact on the accuracy of the model.
Here we can see a simple example that shows the use cases for each type of layer.
The Dataset
For this project, I used this dataset from Kaggle that consists of over 9,000 images of different types of art: drawings, engravings, iconography (old Russian art), paintings, and sculptures. I cleaned the data by removing duplicates, corrupted images, and other unsupported image types, which resulted in a final dataset size of 6,642. In order to input this dataset into the CNN, I had to normalize the data, i.e. resize them to be the same dimensions.
Drawings
Engravings
Iconography
Paintings
Sculptures
In order to determine how the size and qualities of the image impacts the resulting accuracy, I generated 5 new datasets: 25x25 grayscale, 50x50 grayscale, 75x75 grayscale, 50x50 RGB, and 50x50 RGB cropped, which doesn't resize the images but rather takes a 50x50 cutout from the center of each image. If I had more processing power, I would've also trained a 75x75 RGB dataset.
25x25
50x50
75x75
Network Structure
Convolutional Layer
a. 16 filters
Pooling Layer
Convolutional Layer
a. 32 filters
Pooling Layer
Convolutional Layer
a. 64 filters
Pooling Layer
Dense Layer
a. 512 neurons
Dense Layer
a. 5 neurons (one for each class)
Most of my work for this project was with hyper-parameter tuning, where I spent many hours adjusting the number of convolutional layers, the number of dense layers, the number of filters per convolutional layer, and the number of neurons per dense layer in order to optimize accuracy. After I had finished writing my research paper, I spent some more time tuning these parameters and ended up with the exact opposite conclusion that I had stated in the paper.
In this network, we can see that the number of the number of filters are powers of 2, and increase by a factor of 2 per convolutional layer. The filter itself is a randomly generated matrix of a predefined size that extracts "features" from an image by sequencing through portions of the input image, generating the dot product of it and that section, and storing it in a feature map. So, the more filters you have, the more features you are trying to garner from each image. So on the first layer, we are only trying to detect 16 different features, like edges and small shapes. By the time we get to the last convolutional layer, we are trying to detect more broad portions of the image, like a nose or a dog.
Results
Grayscale Images
25x25
34.0% Accuracy
50x50
46.3% Accuracy
75x75
54.0% Accuracy
Color Images
50x50
60.3% Accuracy
50x50 (Cropped)
72.1% Accuracy
The results clearly indicate that image resolution significantly affects the model's accuracy. For instance, the 75x75 dataset outperforms the 25x25 dataset by 20% in terms of accuracy. As predicted, the 50x50 RGB dataset is 14% more accurate than its grayscale counterpart. It's also worth noting that cropped images yield better accuracy (12% in this case) compared to their transformed counterparts.
There was an issue with the prediction component of the model when using the testing dataset, so I ended up using a separate validation dataset (which wasn't part of the training or testing sets) to create the confusion matrix for the 50x50 dataset. From the results, it's evident that our model excels at identifying iconography but struggles with sculptures. This is understandable given that images in the iconography dataset are quite consistent, while sculptures can vary widely in appearance.
Points of Interest
I originally started with 32, 64, and 128 filters in each convolutional layer, tuning it to the RGB datasets. When I shifted to testing the model on the grayscale images, I was consistently getting around 30-32% accuracy for the grayscale images, and the model accuracy stopped increasing after the first few epochs. I decreased the number of filters per layer by a factor of 2, and reduced the number of neurons in the deeply connected layer by a factor of 4. The accuracy of the 50x50 image went up by 16%, and the accuracy of the 75x75 dataset increased to 54%. So in the model, more isn’t always better. Using this architecture, the accuracy of the 50x50RGB and cropped versions only decreased by a few percent.
Lessons Learned & Next Steps
During this project, I made many mistakes, but each one was a learning opportunity. If I were to start again, I might opt for a cleaner dataset since I spent about half the project time on preprocessing the data. If given another chance, I'd likely begin with convolutional layers rather than solely dense layers and compare their variations rather than contrasting them with the dense layer models.
If there were more time, I'd experiment with different architecture combinations to aim for better accuracy. I'd also like to test larger image sizes as I was unable to due to processing power limitations. Considering most images are larger than 150x150, training the model on that size may very well yield better results than the 50x50 or 75x75 images. But, the question is not if it raises the accuracy of the model, but by how much and after what point does increasing the input size become ineffective?