Audio and Face Recognition
One of the most curious and modern projects I worked on as a Data Scientist was
the one I’m discussing here. This project is divided into three tasks that concern
the creation of three applications with whom we deal daily. Every day most people on
Earth use devices regulated by Data Science solutions such as artificial intelligence
or machine learning models. To make an example, when you unlock the smartphone using
your face the device is making use of an application regulated by a model trained to
detect and recognize your face. The same concept could be applied to audio recognition.
The first two tasks of the project concerned both the examples made above while it
focused on image retrieval in the last one.
I worked alongside two Data Scientists to develop these three applications.
The first one, Audio Recognition, required to create a program able to recognize the
speaker among three people. After the data collection phase, we gathered voice samples
from which we extracted some features from the audio track to have a numerical
representation to feed the model. A convolutional neural network allowed us to
predict almost perfectly the speaker. Despite the great result obtained, the task
wasn’t easy. The voice tracks were recorded both in quiet and noisy places.
Besides, the speakers should read different spans of texts to make the problem more
challenging.
The second task was focused on Face Recognition. We proceeded the same way as the
first task taking pictures of our faces in different conditions and with several facial
expressions. Then we used a pre-trained Convolutional Neural Network to apply Transfer
Learning. Thanks to some data augmentation and regularizations we reached great results.
The final task was oriented toward developing an application able to retrieve images
from a database. The application should receive a query image and extract ten similar
images. The database used contained 10000 images of famous people from actors to athletes.
We selected a couple of models pre-trained on a famous dataset called VGGFace which contains
thousands of people’s faces. This selection allowed us to use a model with task-oriented parameters.
We chose the most accurate model extracting a test set and computing recall and precision metrics.
Then, we used our faces to extract the tenth VIPs with more similarities with us. The results were
quite decent. If you wish, you can check them by going to my GitHub page.