IITG MV Phase-I database contains two sessions speech data from 100 speakers collected in an office environment. In this database the following three types of variabilities were considered while collecting the speech data from different speakers:

  • Multi-sensor: Speech data were recorded over five different sensors simultaneously.
  • Multi-lingual: Every speaker spoke in two different languages, namely, English and his/her favorite Indian language.
  • Multi-style: Every speaker spoke in reading and conversational styles.

One of the faculty chambers (measuring 15 X 12 X 12 feet) was used as an office environment. Apart from the air conditioner and a fan no other electrical equipments were working in the room. Both the windows and doors were closed during recording, to avoid any external noise coming into the room. The idea was to capture the effects of reverberation and ambient noise in the room effectively. The following were the recording sensors used:

  • Headset Microphone: The headset microphone that comes with most personal computers (PCs) was used as the sensor. It has wideband characteristics with a flat response up to 16 kHz, thus enabling us to collect speech data at 16 kHz sampling frequency with 16 bits per sample resolution. The headset microphone is an omni-directional one. The headphone was mounted close to the speaker, so that the best possible clean speech data is obtained compared to all other sensors.
    • In-built microphone: The in-built microphone that comes with Tablet PCs was used as another sensor. The in-built microphone is also of wideband characteristics and also omni-directional in nature. The speech data is also collected using a sampling frequency of 16 kHz with 16 bits per sample resolution.
    • Digital Voice Recorder: A digital voice recorder was used for recording the data. This included a sensor that has wideband characteristics of at least 22 kHz and omni-directional in nature. It was the most sensitive of all the sensors and recorded in stereo (two) channel, MP3 format with sampling frequency of 44 kHz and 16 bits per sample resolution. After transferring data from the recorder to the PC, the MP3 files were converted to WAV format with the sampling frequency of 16 kHz, mono channel and 16 bits per sample resolution.
    • Mobile phones in offline mode: Two mobile phones of different make were used for recording the data. Both the phones have a voice recorder software that stores files in AMR format at a sampling frequency of 8 kHz and stored with 16 bits per sample resolution and were placed on the table at a distance of 2-3 feet from the subject. The AMR files were converted into WAV format after storing them in the PC.

    Figure-1: Placement of sensors for Phase-I data recording

    Figure-2: Snapshot depicting recording scenario in case of Phase-I data collection.

    Figures 1 and 2 shows the placement of sensors and the way recording was done in the office environment respectively. The subjects for recording included members of the student, staff and faculty community from IIT Guwahati in the age group of 20-40. Speech data for about 3-5 minutes in reading style using English passage was initially collected. This was followed by speech data for about 6-8 minutes of two recordings in conversation style for both English and favorite language, where later happens to be the mother tongue in most cases. During the entire recording a facilitator was present to direct the subject and also to converse with him for recording. The second session of recording for each speaker was done after a gap of around one week. Please refer to the IITG DIT MV database documentation for more details.