Learning flexible sprites

Starting with a paper published in the IEEE Conference on Computer Vision at Pattern Recognition (CVPR) in 2000, BJ Frey and his colleagues have introduced a framework for learning about object appearance and shape by modeling the 3-dimensional scene as a composition of 2-dimensional layers.

In a paper titled "Filling In Scenes by Propagating Probabilities through Layers and Into Appearance Models" (CVPR 2000), Frey described how loopy belief propagation could be used to disambiguate depth ordering of image patches, by modeling textures (appearance) and shape, in the form of a transparency map. This work predated later work that used a similar approach to solve the problem of inferring stereo depth (J Sun, HY Shum and NN Zheng, ECCV 2002).

In 2001, N Jojic and BJ Frey developed a technique for automatically learning layers of "flexible sprites'' -- probabilistic 2-dimensional appearance maps and masks of moving, occluding objects. The model explains each input image as a layered composition of flexible sprites. A variational expectation maximization algorithm is used to learn a mixture of sprites from a video sequence. For each input image, probabilistic inference is used to infer the sprite class, translation, mask values and pixel intensities (including obstructed pixels) in each layer. Exact inference is intractable, but they showed how a variational inference technique could be used to process 320 x 240 images at 1 frame/second. The only inputs to the learning algorithm are the video sequence, the number of layers and the number of flexible sprites. They obtained results on several tasks, including summarizing a video sequence with sprites, point-and-click video stabilization, and point-and-click object removal.

The movie below shows the decomposition of an input video sequence into two "flexible sprites" and a background plane after learning using the variational algorithm.

cutouts
click image to see video

In 2003, BJ Frey, N Jojic and A Kannan published a paper titled "Learning appearance and transparency manifolds of occluding objects in layers" in CVPR. Often, variability in sprites is due to interesting features that lie in a much smaller dimensional subspace of the original space of images. Some examples are facial expressions, movement of legs and hands and lighting variations. Instead of treating these sources of variability as noise, Frey, Jojic and Kannan directly accounted for them by learning factors representing their effects. The generative modeling approach makes this incorporation rather straightforward. They extended the original flexible sprites model in which each sprite has a mean appearance plus Gaussian noise, by modeling the appearance and shape of each sprite using a linear subspace model (factor analyzer). The figure below shows the generative process. When the dimensionality of the subspace is zero, the original flexible sprites model is recovered.

References

  • B. J. Frey 2000. Filling In Scenes by Propagating Probabilities through Layers and Into Appearance Models, In Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2000. (CVPR 00)

  • N. Jojic and B. J. Frey 2001. Learning flexible sprites in video layers, In Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2001. (CVPR 01) [PS | PDF] [BibTeX]

  • B. J. Frey, N. Jojic and A. Kannan 2003. Learning appearance and transparency manifolds of occluded objects in layers, In Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2003. (CVPR 03) [PS.gz | PDF] [BibTeX]