Design, implementation and evaluation of an acoustic source localization system using Deep Learning techniques
Authors
Vera Díaz, Juan ManuelDirector
Pizarro Pérez, DanielDate
2019Keywords
Acoustic source localization
Microphone arrays
Deep Learning
CNN (Convolutional Neural Network)
Document type
info:eu-repo/semantics/masterThesis
Version
info:eu-repo/semantics/acceptedVersion
Rights
Attribution-NonCommercial-NoDerivatives 4.0 Internacional
Access rights
info:eu-repo/semantics/openAccess
Abstract
This Master Thesis presents a novel approach for indoor acoustic source localization using microphone
arrays, based on a Convolutional Neural Network (CNN) that we call the ASLNet. It directly estimates
the three-dimensional position of a single acoustic source using as inputs the raw audio signals from a set
of microphones. We use supervised learning methods to train our network end-to-end. The amount of
labeled training data available for this problem is however small. This Thesis presents a training strategy
based on two steps that mitigates this problem. We first train our network using semi-synthetic data
generated from close talk speech recordings and a mathematical model for signal propagation from the
source to the microphones. The amount of semi-synthetic data can be virtually as large as needed. We
then fine tune the resulting network using a small amount of real data. Our experimental results, evaluated
on a publicly available dataset recorded in a real room, show that this approach is able to improve existing
localization methods based on SRP-PHAT strategies and also those presented in very recent proposals
based on Convolutional Recurrent Neural Networks (CRNN). In addition, our experiments show that the
performance of the ASLNet does not show a relevant dependency on the speaker’s gender, nor on the
size of the signal window being used. This work also investigates methods to improve the generalization
properties of our network using only semi-synthetic data for training. This is a highly important objective
due to the cost of labelling localization data. We proceed by including specific effects in the input signals
to force the network to be insensitive to multipath, high noise and distortion likely to be present in real
scenarios. We obtain promising results with this strategy although they still lack behind strategies based
on fine-tuning.
Files in this item
Files | Size | Format |
|
---|---|---|---|
TFM_Vera_Diaz_2019.pdf | 2.763Mb |
![]() |
Files | Size | Format |
|
---|---|---|---|
TFM_Vera_Diaz_2019.pdf | 2.763Mb |
![]() |