Advanced Search
CS Search Google Search
Subscribers, please login

Published Articles >> Table of Contents >> Abstract

Publication Home Page
July 2003 (Vol. 25, No. 7)   pp. 828-836
A Graphical Model for Audiovisual Object Tracking

Full Article Text: View linked HTML of full textDownload PDF of full textBuy this articleGet full text from IEEE Xplore

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPAMI.2003.1206512
Send link to a friend

Abstract
We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using off-the-shelf equipment.
References
[1] H. Attias, L. Deng, A. Acero, and J.C. Platt, A New Method for Speech Denoising and Robust Speech Recognition Using Probabilistic Models for Clean Speech and for Noise Proc. Eurospeech, 2001.
[2] H. Attias and C.E. Schreiner, Blind Source Separation and Deconvolution: The Dynamic Component Analysis Algorithm Neural Computation, vol. 10, 1998.
[3] S. Ben-Yacoub, J. Luttin, K. Jonsson, J. Matas, and J. Kittler, Audio-Visual Person Verification Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2000.
[4] A. Blake and M. Isard, Active Contours. Springer, 1998.
[5] Microphone Arrays, M. Brandstein and D. Ward, eds. Springer, 2001.
[6] M.S. Brandstein, Time-Delay Estimation of Reverberant Speech Exploiting Harmonic Structure J. Accoustic Soc. Am., vol. 105, no. 5, pp. 2914-2919, 1999.
[7] C. Bregler and Y. Konig, “‘Eigenlips’for Robust Speech Recognition,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 669-672, 1994.
[8] K. Cheok, G. Smid, and D. McCune, A Multisensor-Based Collision Avoidance System with Application to Military HMMWV Proc. IEEE Conf. Intelligent Transportation Systems, 2000.
[9] R. Cutler and L. Davis, Look Who's Talking: Speaker Detection Using Video and Audio Correlation Proc. IEEE Conf. Multimedia and Expo, 2000.
[10] R. Cutler, Y. Rui, A. Gupta, J.J. Cadiz, I. Tashev, L.-W. He, A. Colburn, Z. Zhang, Z. Liu, and S. Silverberg, Distributed Meetings: A Meeting Capture and Broadcasting System Proc. ACM Multimedia, 2002.
[11] R. Duraiswami, D. Zotkin, and L. David, Active Speech Source Localization by a Dual Coarse-to-Fine Search Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 2001.
[12] B. Frey and N. Jojic, Fast, Large-Scale Transformation-Invariant Clustering Proc. Advances in Neural Information Processing Systems 2001, vol. 14, 2002.
[13] B.J. Frey and N. Jojic, Advances in Algorithms for Inference and Learning in Complex Probability Models IEEE Trans. Pattern Analysis and Machine Intelligence, pending publication.
[14] B.J. Frey and N. Jojic, Transformation-Invariant Clustering Using the EM Algorithm IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 1, Jan. 2003.
[15] A. Garg, V. Pavlovic, and J.M. Rehg, Audio-Visual Speaker Detection Using Dynamic Bayesian Networks Proc. IEEE Conf. Automatic Face and Gesture Recognition, 2000.
[16] R. Goecke, J.B. Millar, A. Zelinsky, and J. Robert-Ribes, Stereo Vision Lip-Tracking for Audio-Video Speech Processing Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 2001.
[17] J. Hershey and M. Case, Audio-Visual Speech Separation Using Hidden Markov Models Proc. Advances in Neural Information Processing Systems 2001, vol. 14, 2002.
[18] J. Hershey and J.R. Movellan, Using Audio-Visual Synchrony to Locate Sounds Proc. Advances in Neural Information Processing Systems 1999, S.A. Solla, T.K. Leen, and K.-R. Muller, eds., vol. 12, 2000.
[19] J.W. FisherIII, T. Darrell, W.T. Freeman, and P.A. Viola, Learning Joint Statistical Models for Audio-Visual Fusion and Segregation Proc. Advances in Neural Information Processing Systems 2000, vol. 14, 2001.
[20] A.D. Jepson, D.J. Fleet, and T. El-Maraghi, Robust, On-Line Appearance Models for Vision Tracking Proc. IEEE Conf. Computer Vision and Pattern Recognition, Dec. 2001.
[21] N. Jojic and B.J. Frey, Learning Flexible Sprites in Video Layers Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 199-206, 2001.
[22] N. Jojic, N. Petrovic, B.J. Frey, and T.S. Huang, Transformed Hidden Markov Models: Estimating Mixture Models of Images and Inferring Spatial Transformations in Video Sequences Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2000.
[23] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul, An Introduction to Variational Methods for Graphical Models Learning in Graphical Models, M.I. Jordan, ed. Norwell Mass.: Kluwer Academic Publishers, 1998.
[24] K. Nakadai, K. Hidai, H. Mizoguchi, H.G. Okuno, and H. Kitano, Real-Time Auditory and Visual Multiple-Object Tracking for Robots Proc. Int'l Joint Conf. Artificial Intelligence, 2001.
[25] R.M. Neal and G.E. Hinton, A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants Learning in Graphical Models, M.I. Jordan, ed. pp. 355-368, Norwell Mass.: Kluwer Academic Publishers, 1998.
[26] H.G. Okuno, K. Nakadai, and H. Kitano, Social Interaction of Humanoid Robot Based on Audio-Visual Tracking Proc. Int'l Conf. Industrial and Eng. Applications of Artificial Intelligence and Expert Systems, 2002.
[27] G. Pingali, G. Tunali, and I. Carlborn, Audio-Visual Tracking for Natural Interfaces Proc. ACM Multimedia, 1999.
[28] Y. Rui and Y. Chen, Better Proposal Distributions: Object Tracking Using Unscented Particle Filter Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2000.
[29] M. Slaney and M. Covell, Facesync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks Proc. Advances in Neural Information Processing Systems 2000, vol. 14, 2001.
[30] D.E. Sturim, M.S. Brandstein, and H.F. Solverman, Tracking Multiple Talkers Using Microphone-Array Measurements Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 1997.
[31] J. Vermaak, M. Gangnet, A. Blake, and P. Perez, Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking Proc. IEEE Int'l Conf. Computer Vision, 2001.
[32] H. Wang and P. Chu, Voice Source Localization for Automatic Camera Pointing System in Cideoconferencing Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, 1997.
[33] K. Wilson, N. Checka, D. Demirdjian, and T. Darrell, Audio-Video Array Source Localization for Perceptual User Interfaces Proc. Workshop Perceptive User Interfaces, 2001.
[34] D.N. Zotkin, R. Duraiswami, and L.S. Davis, Joint Audio-Visual Tracking Using Particle Filters EURASIP J. Applied Signal Processing, vol. 11, pp. 1154-1164, 2002.
Additional Information
Index Terms- Audio, video, audiovisual, graphical models, generative models, probabilistic inference, Bayesian inference, variational methods, expectation-maximization (EM) algorithm, multimodal, multimedia, tracking, speaker modeling, speech, vision, microphone arrays, cameras, automatic calibrations.

Citation:  Matthew J. Beal, Nebojsa Jojic, Hagai Attias, "A Graphical Model for Audiovisual Object Tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25,  no. 7,  pp. 828-836,  Jul.,  2003

RSS Feed

Similar Articles

Abstract Contents
Abstract
References
Index Terms
Citation




Free access to

  • Abstracts
  • Selected PDFs

Electronic subscribers login to:

  • Access HTML/PDFs of full text articles

Subscription information

Get a Web account

PDFs require Adobe Acrobat Reader.

Peer Review Notice

Give us Feedback