Graduation Year

2008

Document Type

Dissertation

Degree

Ph.D.

Degree Granting Department

Computer Science and Engineering

Major Professor

Rangachar Kasturi, Ph.D.

Co-Major Professor

Sudeep Sarkar, Ph.D.

Keywords

Speaker localization, Speaker diarization, Audio-visual association, Meeting indexing, Multimedia analysis

Abstract

This dissertation documents the research performed on the topics of localization, diarization and indexing in meeting archives. It surveys existing work in these areas, identifies opportunities for improvements and proposes novel solutions for each of these problems. The framework resulting from this dissertation enables various kinds of queries such as identifying the participants of a meeting, finding all meetings for a particular participant, locating a particular individual in the video and finding all instances of speech from a particular individual. Also, since the proposed solutions are computationally efficient, require no training and use little domain knowledge, they can be easily ported to other domains of multimedia analysis. Speaker diarization involves determining the number of distinct speakers and identifying the durations when they spoke in an audio recording.

We propose novel solutions for the segmentation and clustering sub-tasks, based on graph spectral clustering. The resulting system yields a diarization error rate of around 20%, a relative improvement of 16% over the current popular diarization technique which is based on hierarchical clustering. The most significant contribution of this work lies in performing speaker localization using only a single camera and a single microphone by exploiting long term audio-visual co-occurence. Our novel computational model allows identifying regions in the image belonging to the speaker even when the speaker's face is non-frontal and even when the speaker is only partially visible. This approach results in a hit ratio of 73.8% compared to an MI based approach which results in a hit ratio of 52.6%, which illustrates its suitability in the meeting domain.

The third problem addresses indexing meeting archives to enable retrieving all segments from the archive during which a particular individual speaks, in a query by example framework. By performing audio-visual association and clustering, a target cluster is generated per individual that contains multiple multimodal samples for that individual to which a query sample is matched. The use of multiple samples results in a retrieval precision of 92.6% at 90% recall compared to a precision of 71% at the same recall, achieved by a unimodal unisample system.

Share

COinS