This is the repository for the method presented in the paper "Language Modeling for Sound Event Detection with Teacher Forcing and Scheduled Sampling", by K. Drossos, S. Gharib, P, Magron, and T. Virtanen.
Our paper is presented to the Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop 2019. You can find an online version of our paper at arXiv: https://arxiv.org/abs/1907.08506
If you use our method, please cite our paper.
You can get the version of the code used in the paper from
- Method introduction
- Dependencies, pre-requisites, and setting up the project
- Using SEDLM
- Acknowledgements
Sound event detection (SED) is the task of identifying activities of sound events from
short time representations of audio. For example, given an audio feature vector that
is extracted from 0.04 seconds, a SED method should identify the activities of different
sound events in this vector. Usually, SED is applied over a sequence of short time audio
feature vectors and the identification of activities of sound events is performed for
every input feature vector. That is, as an input is given a matrix
,
with T
and F
to be the amount of feature vectors and features, respectively, the output is the matrix
,
which holds the predictions for each of the C
classes at every t
feature vector.
In real-life recordings, the various sound events likely have temporal structures within and across events. For instance, a “footsteps” event might be repeated with pauses in between (intra-event structure). On the other hand, “car horn” is likely to follow or precede the “car passing by” sound event (inter-events structure). Such temporal structures are employed and used in other machine learning tasks, for example in machine translation, image captioning, and speech recognition. In these tasks, the developed method also learns a model of the temporal associations of the targeted classes. These associations usually are termed as language model.
SED methods can benefit from a language model. The method in this repository is about exactly this. A method to take advantage of language model for SED.
In order to take advantage of the above mentioned temporal structures, we use the teacher forcing [1] technique. Teacher forcing is the conditioning of the input to an RNN with the activities of sound events at the previous time step. That is,
where is the output of the RNN at time-step t, is the input to the RNN (from a previous layer) and at time-step t, and is the activities of the sound events at the time-step t-1.
If as are used the ground truth values, then the RNN will not be robust to cases where the is not a correct class activity. For example, in the testing process where there are no ground truth values.
If as are used the predictions of the classifier, then the RNN will have a difficult time to learn any dependencies of the sound events, because during training (and especially at the beginning of the training process) it will be fed incorrect class activities.
To tackle both of the above, we employ the scheduled sampling technique [2]. That is, at the beginning of the training we use as the ground truth values. As the training proceeds and the classifier learns to predict more and more correct class activities, we gradually employ the predictions of the classifier as .
[1] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation, vol. 1, no. 2, pp. 270–280, June 1989.
[2] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Proceedings of the 28th International Conference on Neural Information Processing Systems, Volume 1, ser. NIPS’15. Cambridge, MA, USA:MIT Press, 2015, pp. 1171–1179. Online. Available: http://dl.acm.org/citation.cfm?id=2969239.2969370
To start using our project, you have to:
-
Use Python 3.6. The code in this repository is tested and works with Python 3.6. Probably using other Python 3.X versions will be OK, but please have in mind that this code is for Python 3.6.
-
Set-up the dependencies using either the
pip
(pip_requirements.txt) orconda
(conda_requirements.txt) files. Navigate with your terminal inside the root directory of the project (i.e. the directory that is created after cloning this repository) and then issue the proper command at the terminal:- pip: To set-up the dependencies with
pip
use:$ pip install -r requirements/pip_requirements.txt
- conda: To set-up the dependencies with
conda
, you can issue the command$ conda install --yes --file requirements/conda_requirements.txt
- pip: To set-up the dependencies with
-
Download the audio data. You can download the three audio datasets from:
-
TUT-SED Synthetic 2016 dataset is available from here.
Download the audio files (i.e. the Audio 1/5, Audio 2/5, ..., Audio 5/5), do your feature extraction and follow the instructions at the Data Set-up section.
-
The TUT Sound Events 2016 is available from here.
Download the audio files, do your feature extraction and follow the instructions at the Data Set-up section.
-
The TUT Sound Event 2017 is available from here
Download the audio files, do your feature extraction and follow the instructions at the Data Set-up section.
-
-
Now the project is set-up and you can use it with the data that you got from step 3.
You can use SEDLM directly for your data, or you can check the code and adopt the SEDLM to your SED task, or repeat the process described in our paper.
SEDLM code is based on PyTorch, version 1.1.0.
In the current form, different variables of the code are specified in a YAML file, holding all the settings for the
code. All the YAML files are in the settings
directory, and the YAML loading function searches in the settings
directory for YAML files. In general, you can just alter the values of settings in the YAML file and then run the
code.
The data have to be in the data
directory.
If you want to use the existing data loaders, then you have to have your data organized in
a specific way. First of all, you have to have different files for input features and target
values. For example, input_features.npy
and target_values.npy
. Then, depending on the
dataset that you will use, you have to have your data in different directories. That is:
-
TUTSED Synthetic 2016.
The data have to be in a directory called
synthetic
, in thedata
directory. That is,data/synthetic
. Then, the files for the training, validation, and testing data have to be in a different directory. That is:data/synthetic/training
data/synthetic/validation
data/synthetic/testing
You have to have different numpy files for the input features and the target values. You can specify the name of each of the input or target files in the YAML settings file. For example, the training files should be like:
data/synthetic/training/input_features.npy
data/synthetic/training/target_values.npy
The code will load the numpy files and use them for training the SEDLM method. You can to make sure though that the input features and target values are properly ordered. That is, the first element in the input features corresponds to the first element in the target values.
-
TUT Real Life 2016
The data have to be in a directory called
real_life_2016
, in thedata
directory. That is,data/real_life_2016
. Then, the files for each of the folds have to be in a different directory. That is:data/real_life_2016/fold_1
data/real_life_2016/fold_2
data/real_life_2016/fold_3
data/real_life_2016/fold_4
You have to have different pickle files for the input features and the target values, and for the training and testing of each fold. Since there are multiple files per scene and per fold, you cannot have all features in a numpy array. Thus, you have to have all the data in a list and serialize (i.e. store to disk) that list using the pickle package. Also, there are files for training and testing in each fold.
For convenience, SEDLM uses automatically a pre-fix for the file names. That is, it automatically adds "train" and "test" to the specified file name.
You can specify the name of each of the input or target files in the YAML settings file. For example, the files should be like:
input_features.p
target_values.p
Then, SEDLM code will search for the proper files and for each fold. For example, for fold 1 and home scene, the following files will be sought:
data/real_life_2016/home/fold_1/train_input_features.p
data/real_life_2016/home/fold_1/train_target_values.p
data/real_life_2016/home/fold_1/test_input_features.p
data/real_life_2016/home/fold_1/test_target_values.p
The code will load the pickle files and use them for training the SEDLM method. You can to make sure though that the input features and target values are properly ordered. That is, the first element in the input features corresponds to the first element in the target values.
-
TUT Real Life 2017
The data have to be in a directory called
real_life_2017
, in thedata
directory. That is,data/real_life_2017
. Then, the files for each of the folds have to be in a different directory. That is:data/real_life_2017/fold_1
data/real_life_2017/fold_2
data/real_life_2017/fold_3
data/real_life_2017/fold_4
You have to have different pickle files for the input features and the target values, and for the training and testing of each fold. Since there are multiple files per fold, you cannot have all features in a numpy array. Thus, you have to have all the data in a list and serialize (i.e. store to disk) that list using the pickle package. Also, there are files for training and testing in each fold.
For convenience, SEDLM uses automatically a pre-fix for the file names. That is, it automatically adds "train" and "test" to the specified file name.
You can specify the name of each of the input or target files in the YAML settings file. For example, the files should be like:
input_features.p
target_values.p
Then, SEDLM code will search for the proper files and for each fold. For example, for fold 1, the following files will be sought:
data/real_life_2017/fold_1/train_input_features.p
data/real_life_2017/fold_1/train_target_values.p
data/real_life_2017/fold_1/test_input_features.p
data/real_life_2017/fold_1/test_target_values.p
The code will load the pickle files and use them for training the SEDLM method. You can to make sure though that the input features and target values are properly ordered. That is, the first element in the input features corresponds to the first element in the target values.
The hyper-parameters can be tuned from the YAML settings files. Available hyper-parameters for tuning are:
- Amount of CNN channels
- Dropout for CNNs and RNN
- Scheduled sampling parameters
- Learning rate of Adam optimizer
- Batch size
You can run the system using a bash script. An example of such script are the files:
example_bash_script_baseline.sh
, which runs the baseline configuration for the SEDLMexample_bash_script_tf.sh
, which runs the SEDLM with the TUT Real Life 2017 dataset.
- Part of the computations leading to these results were performed on a TITAN-X GPU donated by NVIDIA to K. Drossos.
- The authors wish to acknowledge CSC-IT Center for Science, Finland, for computational resources.
- The research leading to these results has received funding from the European Research Council under the European Union’s H2020 Framework Programme through ERC Grant Agreement 637422 EVERYSOUND.
- P. Magron is supported by the Academy of Finland, project no. 290190.