Audio and Face Recognition

One of the most curious and modern projects I worked on as a Data Scientist was the one I’m discussing here. This project is divided into three tasks that concern the creation of three applications with whom we deal daily. Every day most people on Earth use devices regulated by Data Science solutions such as artificial intelligence or machine learning models. To make an example, when you unlock the smartphone using your face the device is making use of an application regulated by a model trained to detect and recognize your face. The same concept could be applied to audio recognition. The first two tasks of the project concerned both the examples made above while it focused on image retrieval in the last one.

I worked alongside two Data Scientists to develop these three applications. The first one, Audio Recognition, required to create a program able to recognize the speaker among three people. After the data collection phase, we gathered voice samples from which we extracted some features from the audio track to have a numerical representation to feed the model. A convolutional neural network allowed us to predict almost perfectly the speaker. Despite the great result obtained, the task wasn’t easy. The voice tracks were recorded both in quiet and noisy places. Besides, the speakers should read different spans of texts to make the problem more challenging.

The second task was focused on Face Recognition. We proceeded the same way as the first task taking pictures of our faces in different conditions and with several facial expressions. Then we used a pre-trained Convolutional Neural Network to apply Transfer Learning. Thanks to some data augmentation and regularizations we reached great results.

The final task was oriented toward developing an application able to retrieve images from a database. The application should receive a query image and extract ten similar images. The database used contained 10000 images of famous people from actors to athletes. We selected a couple of models pre-trained on a famous dataset called VGGFace which contains thousands of people’s faces. This selection allowed us to use a model with task-oriented parameters. We chose the most accurate model extracting a test set and computing recall and precision metrics. Then, we used our faces to extract the tenth VIPs with more similarities with us. The results were quite decent. If you wish, you can check them by going to my GitHub page.

Audio and Face Recognition

BACK TO PORTFOLIO