Computational vision

The ‘epitome’: A new model of patterns

In 2002, BJ Frey and N Jojic were interested in incorporating models of texture into their appearance-based models of images and videos. To do this efficiently, they invented the 'epitome' model of an image, which can be viewed as a miniature, condensed image that contains many of the texture and shape properties of the input image (Frey and Jojic, UTTR 2002). More specifically, the epitome of an image consists of a smaller image of pixel intensities along with an image of pixel variances. What relates an image to its epitome? Simply this: The epitome is learned so that if small patches are sampled in an unordered fashion from the epitome, they will have nearly the same appearance as patches sampled from the original input image.

The input image is described by its epitome and the mapping of patches from the epitome to the image pixels. The mapping maps the patch in the image to a patch in the epitome so that given the epitome and the mapping, the image can be composed using the patches in the epitome. Since the epitome is of much smaller size than the input image, clearly, many patches in the image map to the same patch in the epitome.

The figure below illustrates the appearance epitome. An image (a) is epitomized in the texture (b) which is enlarged to twice its size for visual clarity. (c) is the reconstructed image using the mappings that map a patch in (a) to patch in (b). In (d) to (f), some of the learned mappings are shown, which is illustrative of the fact that a single patch in the epitome maps to multiple patches in the image:

Epitome

Epitomes of shape

We can use the epitome to represent the appearance and the shape of an object. This results in a generative model that can combine appearance and shape epitomes to provide a description of the image as a combination of sprites (see the project web page for flexible sprites). However, unlike in our flexible sprites work, due to the compression abilities of the epitome, it is possible to discover layered structure from a single input image. The generative model for a two layer description is shown below. See also the video illustrating the learning process.

Layered Epitome

In addition to segmenting the image, the model can also fill in occluded regions with similar appearance. This is because the model learns the continuity in the texture and uses this to explain the occluded regions in the patches.

Video epitomes

The epitome of a video sequence is a spatially and/or temporally compact representation of the video that retains the video's essential textural, shape, and motion components.  The figure below visually shows the manner in which a video epitome is learnt from a video.  The video is considered to be a two-dimensional image with a time dimension, that is, by stacking the video frames together, a three-dimensional construct is obtained.  Three-dimensional patches of varying spatial and temporal sizes from the video are used to learn the video epitome in an unsupervised manner.  The video epitome itself is a three-dimensional construct that can represent the video in both a spatially and temporally compact form.  Under a probabilistic generative model, the video patches are considered to have come from a smaller video sequence - the video epitome.

Learning Video Epitome

Figure (a) below shows a video (click image for video) of a toy car moving around a rectangular object.  A variety of video epitomes can be learnt from the video as the size of the epitome acts as a knob that can be turned to adjust the amount of compression in both space and time.  Four frames from one such epitome is shown in (b), where a strong emphasis is put on spatial compression.  The epitome isolates the basic horizontal motion of the car in these few frames (note the wrapping of the epitome along the edges).  Conversely, figure (c) shows a video epitome (shown 2.5 times smaller than original size) that greatly compresses the time dimension of the video.  With just a few frames to work with, the video epitome models multiple motion patterns simultaneously within its frames.  These two video epitomes contain approximately the same total number of pixels and both are 20 times smaller than the original video, but they have much different appearances.  However, in both epitomes, the essential structural and motion components are maintained.  While the video epitome can itself be useful for visual purposes, its true power arises when used within a larger model for applications such as motion analysis, super-resolution, video inpainting, and compression.

(a)  Toy Car First Frame

(b)  Toy Car Compress Space Epitome 0      Toy Car Compress Space Epitome 1      Toy Car Compress Space Epitome 2      Toy Car Compress Space Epitome 3

(c)  Toy Car Compress Time Epitome 0      Toy Car Compress Time Epitome 1      Toy Car Compress Time Epitome 2      Toy Car Compress Time Epitome 3      Toy Car Compress Time Epitome 4

Video in-painting using video epitomes

Video inpainting is the process of filling in missing portions of a video sequence, which arises in damaged films and occluded objects.  The video epitome models both spatial and temporal characteristics of the video and can be used to perform inpainting by reconstructing the missing pixels from the video epitome.

The following figure on the left shows a video of a car where some pixels have been artificially removed.  The video epitome is then used to fill-in the missing pixels, resulting in the inpainted video on the right.  Click the images to see the videos.


Car Apartment Cut Out Frame Car apartment Inpaiting Frame
Source Video Inpainted Video

A second inpainting example is shown below.  In this video, the fire hydrant is removed by considering those pixels as being missing, and then video inpainting is performed with the video epitome.  Click the images to see the videos.


Street Walking Frame Street Walking Cutout Frame Street Walking Inpainting Frame
Source Video Cut-out Inpainted Video

Video super-resolution

Video super-resolution is defined here as taking a spatially low-resolution video sequence and increasing its spatial resolution. In order to achieve super-resolution, a video epitome is learnt from a high-resolution sequence and then used to reconstruct the low-resolution video. The following figure shows a portion of one frame of a low-resolution video sequence along with the bicubic interpolation and the video epitome super-resolution result.  See the project website for more results and video demos.

Plant Superresolution

Project website

http://www.psi.toronto.edu/~vincent/videoepitome.html

References

  • BJ Frey and N Jojic. Learning the 'epitome' of an image, University of Toronto Technical Report TR PSI-2002-14, Nov 10 2002. [PDF]

  • N Jojic, BJ Frey, A Kannan 2003. Epitomic Analysis of Appearance and Shape, Proceedings of International Conference on Computer Vision (ICCV), Nice, France, Oct. 2003. [PDF] [BibTeX]

  • V Cheung, BJ Frey, N Jojic. Video Epitomes, IEEE Intern. Conf. Computer Vision and Pattern Recognition (CVPR), June 2005, honorable mention for best paper award. [PDF]

Transformation invariant clustering and dimensionality reduction

Fast transformation-invariant component analysis

Transformation invariant component analysis (TCA) is a probabilistic dimensionality reduction method that accounts for global transformations such as translations and rotations while learning local linear appearance deformations. The computational requirement for learning this model using EM algorithm is in the order of O(N2) where N is the number of elements in each training example. This is prohibitive for many applications of interest such as modeling mid to large size images.

In this work, we present an efficient EM algorithm for TCA that reduces the computational requirements to O(N logN). For 256x256 images, this is 4000 times faster!

The proposed algorithm allows TCA to be used in analysis of realistic data; In addition, this facilitates using TCA as a sub-module in other applications that requires learning transformation invariant subspace learning. An example is in modeling images using a layered decomposition, where each layer is explained using a mixture of TCA model.

Project website

Fast Transformation-invariant component analysis

References

  • A. Kannan, N. Jojic, B. J. Frey. Fast Transformation-Invariant Component Analysis, Submitted to Intl. Journal of Computer Vision special issue: Learning for Vision and Vision for Learning [PDF]

  • A. Kannan, N. Jojic, B. J. Frey. Fast Transformation-Invariant Component Analysis (or Factor Analysis), In Advances in Neural Information Processing Systems (NIPS), 2003 [PDF]

Learning layered models of images and videos

Starting with a paper published in the IEEE Conference on Computer Vision at Pattern Recognition (CVPR) in 2000, BJ Frey and his colleagues have introduced a framework for learning about object appearance and shape by modeling the 3-dimensional scene as a composition of 2-dimensional layers.

In a paper titled "Filling In Scenes by Propagating Probabilities through Layers and Into Appearance Models" (CVPR 2000), Frey described how loopy belief propagation could be used to disambiguate depth ordering of image patches, by modeling textures (appearance) and shape, in the form of a transparency map. This work predated later work that used a similar approach to solve the problem of inferring stereo depth (J Sun, HY Shum and NN Zheng, ECCV 2002).

In 2001, N Jojic and BJ Frey developed a technique for automatically learning layers of "flexible sprites'' -- probabilistic 2-dimensional appearance maps and masks of moving, occluding objects. The model explains each input image as a layered composition of flexible sprites. A variational expectation maximization algorithm is used to learn a mixture of sprites from a video sequence. For each input image, probabilistic inference is used to infer the sprite class, translation, mask values and pixel intensities (including obstructed pixels) in each layer. Exact inference is intractable, but they showed how a variational inference technique could be used to process 320 x 240 images at 1 frame/second. The only inputs to the learning algorithm are the video sequence, the number of layers and the number of flexible sprites. They obtained results on several tasks, including summarizing a video sequence with sprites, point-and-click video stabilization, and point-and-click object removal.

The movie below shows the decomposition of an input video sequence into two "flexible sprites" and a background plane after learning using the variational algorithm.

cutouts
click image to see video

In 2003, BJ Frey, N Jojic and A Kannan published a paper titled "Learning appearance and transparency manifolds of occluding objects in layers" in CVPR. Often, variability in sprites is due to interesting features that lie in a much smaller dimensional subspace of the original space of images. Some examples are facial expressions, movement of legs and hands and lighting variations. Instead of treating these sources of variability as noise, Frey, Jojic and Kannan directly accounted for them by learning factors representing their effects. The generative modeling approach makes this incorporation rather straightforward. They extended the original flexible sprites model in which each sprite has a mean appearance plus Gaussian noise, by modeling the appearance and shape of each sprite using a linear subspace model (factor analyzer). The figure below shows the generative process. When the dimensionality of the subspace is zero, the original flexible sprites model is recovered.

References

  • B. J. Frey 2000. Filling In Scenes by Propagating Probabilities through Layers and Into Appearance Models, In Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2000. (CVPR 00)

  • N. Jojic and B. J. Frey 2001. Learning flexible sprites in video layers, In Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2001. (CVPR 01) [PS | PDF] [BibTeX]

  • B. J. Frey, N. Jojic and A. Kannan 2003. Learning appearance and transparency manifolds of occluded objects in layers, In Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2003. (CVPR 03) [PS.gz | PDF] [BibTeX]