Paper: ICASSP (2008) “Discriminative Feature Selection for Hidden Markov Models using Segmental Boosting”

April 3rd, 2008 Irfan Essa Posted in Face and Gesture, James Rehg, Numerical Machine Learning, PAMI/ICCV/CVPR/ECCV, Papers, Pei Yin, Thad Starner No Comments »

Pei Yin, Irfan Essa, James Rehg, Thad Starner (2008) “Discriminative Feature Selection for Hidden Markov Models using Segmental Boosting”, ICASSP 2008 - March 30 - April 4, 2008 - Las Vegas, Nevada, U.S.A. (Paper: MLSP-P3.D8, Session: Pattern Recognition and Classification II, Time: Thursday, April 3, 15:30 - 17:30, Topic: Machine Learning for Signal Processing: Learning Theory and Modeling) (PDF|Project Site)

ABSTRACT

icassp08We address the feature selection problem for hidden Markov models (HMMs) in sequence classification. Temporal correlation in sequences often causes difficulty in applying feature selection techniques. Inspired by segmental k-means segmentation (SKS), we propose Segmentally Boosted HMMs (SBHMMs), where the state-optimized features are constructed in a segmental and discriminative manner. The contributions are twofold. First, we introduce a novel feature selection algorithm, where the temporal dynamics are decoupled from the static learning procedure by assuming that the sequential data are piecewise independent and identically distributed. Second, we show that the SBHMM consistently improves traditional HMM recognition in various domains. The reduction of error compared to traditional HMMs ranges from 17% to 70% in American Sign Language recognition, human gait identification, lip reading, and speech recognition.

AddThis Social Bookmark Button

Paper: IEEE CVPR (2004) “Asymmetrically boosted HMM for speech reading”

June 2nd, 2004 Irfan Essa Posted in James Rehg, Papers, Pei Yin No Comments »

Asymmetrically boosted HMM for speech reading (IEEEXplore#)

Pei Yin Essa, I. Rehg, J.M.
GVU Center, Georgia Inst. of Technol., Atlanta, GA, USA
This paper appears in: Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on
Publication Date: 27 June-2 July 2004
Volume: 2
On page(s): II-755 - II-761 Vol.2
Number of Pages: 2001
ISSN: 1063-6919
ISBN: 0-7695-2158-4
INSPEC Accession Number:8161546
Digital Object Identifier: 10.1109/CVPR.2004.1315240
Posted online: 2004-07-19 11:09:26.0

Abstract

Speech reading, also known as lip reading, is aimed at extracting visual cues of lip and facial movements to aid in recognition of speech. The main hurdle for speech reading is that visual measurements of lip and facial motion lack information-rich features like the Mel frequency cepstral coefficients (MFCC), widely used in acoustic speech recognition. These MFCC are used with hidden Markov models (HMM) in most speech recognition systems at present. Speech reading could greatly benefit from automatic selection and formation of informative features from measurements in the visual domain. These new features can then be used with HMM to capture the dynamics of lip movement and eventual recognition of lip shapes. Towards this end, we use AdaBoost methods for automatic visual feature formation. Specifically, we design an asymmetric variant of AdaBoost M2 algorithm to deal with the ill-posed multi-class sample distribution inherent in our problem. Our experiments show that the boosted HMM approach outperforms conventional AdaBoost and HMM classifiers. Our primary contributions are in the design of (a) boosted HMM and (b) asymmetric multi-class boosting.

AddThis Social Bookmark Button

Paper: Asilomar Conference (2003) “Boosted audio-visual HMM for speech reading”

November 9th, 2003 Irfan Essa Posted in Face and Gesture, James Rehg, Numerical Machine Learning, Papers, Pei Yin No Comments »

Boosted audio-visual HMM for speech reading (IEEEXplore)

Yin, P. Essa, I. Rehg, J.M.
GVU Center, Georgia Inst. of Technol., Atlanta, GA, USA
This paper appears in: Signals, Systems and Computers, 2003. Conference Record of the Thirty-Seventh Asilomar Conference on
Publication Date: 9-12 Nov. 2003
Volume: 2
On page(s): 2013 - 2018 Vol.2
Number of Pages: 2361
ISSN:
ISBN: 0-7803-8104-1
INSPEC Accession Number:8555396
Digital Object Identifier: 10.1109/ACSSC.2003.1292334
Posted online: 2004-05-04 13:54:35.0
Abstract
We propose a new approach for combining acoustic and visual measurements to aid in recognizing lip shapes of a person speaking. Our method relies on computing the maximum likelihoods of (a) HMM used to model phonemes from the acoustic signal, and (b) HMM used to model visual features motions from video. One significant addition in this work is the dynamic analysis with features selected by AdaBoost, on the basis of their discriminant ability. This form of integration, leading to boosted HMM, permits AdaBoost to find the best features first, and then uses HMM to exploit dynamic information inherent in the signal.

AddThis Social Bookmark Button

Funding: NSF/ITR (2002) “Analysis of Complex Audio-Visual Events Using Spatially Distributed Sensors”

October 1st, 2002 Irfan Essa Posted in Funding, James Rehg No Comments »

Award#0205507 - ITR: Analysis of Complex Audio-Visual Events Using Spatially Distributed Sensors
ABSTRACT

We propose to develop a comprehensive framework for the joint analysis of audio-visual signals obtained from spatially distributed microphones and cameras. We desire solutions to the audio-visual sensing problem that will scale to an arbitrary number of cameras and microphones and can address challenging environments in which there are multiple speech and nonspeech sound sources and multiple moving people and objects. Recently it has become relatively inexpensive to deploy tens or even hundreds of cameras and microphones in an environment. Many applications could benefit from ability to sense in both modalities. There are two levels at which joint audio-visual analysis can take place. At the signal level, the challenge is to develop representations that capture the rich dependency structure in the joint signal and deal success-fully issues such as variable sampling rates and varying temporal delays between cues. At the spatial level the challenge is to compensate for the distortions introduced by the sensor location and pool information across sensors to recover 3-D information about the spatial environment. For many applications, it is highly desirable if the solution method is self-calibrating, and does not require an extensive manual calibration process every time a new sensor is added or an old sensor is moved or replaced. Removing the burden of manual calibration also makes it possible to exploit ad hoc sensor networks which could arise, for example, from wearable microphones and cameras. We propose to address the following four research topics: 1. Representations and learning methods for signal level fusion. 2. Volumetric techniques for fusing spatially distributed audio-visual data. 3. Self-calibration of distributed microphone-camera systems 4. Applications of audio-visual sensing. For example, this proposal includes considerable work on lip and facial analysis to improve voice communications.

AddThis Social Bookmark Button