Graduation Year


Document Type




Degree Name

Doctor of Philosophy (Ph.D.)


Computer Science

Degree Granting Department

Computer Science and Engineering

Major Professor

Sudeep Sarkar, Ph.D.

Co-Major Professor

Rangachar Kasturi, Ph.D.

Committee Member

Rangachar Kasturi, Ph.D.

Committee Member

Yu Sun, Ph.D.

Committee Member

Andrew Raij, Ph.D.

Committee Member

Thomas Sanocki, Ph.D.


conditional distance, distance measure, gesture recognition, level building, one-shot, warp vector


Designing generalized data-driven distance measures for both ordered and unordered set data is the core focus of the proposed work. An ordered set is a set where time-linear property is maintained when distance between pair of temporal segments. One application in the ordered set is the human gesture analysis from RGBD data. Human gestures are fast becoming the natural form of human computer interaction. This serves as a motivation to modeling, analyzing, and recognition of gestures. The large number of gesture categories such as sign language, traffic signals, everyday actions and also subtle cultural variations in gesture classes makes gesture recognition a challenging problem. As part of generalization, an algorithm is proposed as part of an overlap speech detection application for unordered set.

Any gesture recognition task involves comparing an incoming or a query gesture against a training set of gestures. Having one or few samples deters any class statistic learning approaches to classification, as the full range of variation is not covered. Due to the large variability in gesture classes, temporally segmenting individual gestures also becomes hard. A matching algorithm in such scenarios needs to be able to handle single sample classes and have the ability to label multiple gestures without temporal segmentation.

Each gesture sequence is considered as a class and each class is a data point on an input space. A pair-wise distances pattern between to gesture frame sequences conditioned on a third (anchor) sequence is considered and is referred to as warp vectors. Such a process is defined as conditional distances. At the algorithmic core we have two dynamic time warping processes, one to compute the warp vectors with the anchor sequences and the other to compare these warp vectors. We show that having class dependent distance function can disambiguate classification process where the samples of classes are close to each other. Given a situation where the model base is large (number of classes is also large); the disadvantage of such a distance would be the computational cost. A distributed version combined with sub-sampling anchor gestures is proposed as speedup strategy. In order to label multiple connected gestures in query we use a simultaneous segmentation and recognition matching algorithm called level building algorithm. We use the dynamic programming implementation of the level building algorithm. The core of this algorithm depends on a distance function that compares two gesture sequences. We propose that, we replace this distance function, with the proposed distances. Hence, this version of level building is called as conditional level building (clb). We present results on a large dataset of 8000 RGBD sequences spanning over 200 gesture classes, extracted from the ChaLearn Gesture Challenge dataset. The result is that there is significant improvement over the underlying distance used to compute conditional distance when compared to conditional distance.

As an application of unordered set and non-visual data, overlap speech segment detection algorithm is proposed. Speech recognition systems have a vast variety of application, but fail when there is overlap speech involved. This is especially true in a meeting-room setting. The ability to recognize speaker and localize him/her in the room is an important step towards a higher-level representation of the meeting dynamics. Similar to gesture recognition, a new distance function is defined and it serves as the core of the algorithm to distinguish between individual speech and overlap speech temporal segments. The overlap speech detection problem is framed as outlier detection problem. An incoming audio is broken into temporal segments based on Bayesian Information Criterion (BIC). Each of these segments is considered as node and conditional distance between the nodes are determined. The underlying distances for triples used in conditional distances is the symmetric KL distance. As each node is modeled as a Gaussian, the distance between the two segments or nodes is given by Monte-Carlo estimation of the KL distance. An MDS based global embedding is created based on the pairwise distance between the nodes and RANSAC is applied to compute the outliers. NIST meeting room data set is used to perform experiments on the overlap speech detection. An improvement of more than 20% is achieved with conditional distance based approach when compared to a KL distance based approach.