The epitome of a video sequence is a spatially and/or temporally compact representation of the video that retains the video's essential textural, shape, and motion components. The figure below visually shows the manner in which a video epitome is learnt from a video. The video is considered to be a two-dimensional image with a time dimension, that is, by stacking the video frames together, a three-dimensional construct is obtained. Three-dimensional patches of varying spatial and temporal sizes from the video are used to learn the video epitome in an unsupervised manner. The video epitome itself is a three-dimensional construct that can represent the video in both a spatially and temporally compact form. Under a probabilistic generative model, the video patches are considered to have come from a smaller video sequence - the video epitome.
Figure (a) below shows a video (click image for video) of a toy car moving around a rectangular object. A variety of video epitomes can be learnt from the video as the size of the epitome acts as a knob that can be turned to adjust the amount of compression in both space and time. Four frames from one such epitome is shown in (b), where a strong emphasis is put on spatial compression. The epitome isolates the basic horizontal motion of the car in these few frames (note the wrapping of the epitome along the edges). Conversely, figure (c) shows a video epitome (shown 2.5 times smaller than original size) that greatly compresses the time dimension of the video. With just a few frames to work with, the video epitome models multiple motion patterns simultaneously within its frames. These two video epitomes contain approximately the same total number of pixels and both are 20 times smaller than the original video, but they have much different appearances. However, in both epitomes, the essential structural and motion components are maintained. While the video epitome can itself be useful for visual purposes, its true power arises when used within a larger model for applications such as motion analysis, super-resolution, video inpainting, and compression.
(b)
(c)
Video inpainting is the process of filling in missing portions of a video sequence, which arises in damaged films and occluded objects. The video epitome models both spatial and temporal characteristics of the video and can be used to perform inpainting by reconstructing the missing pixels from the video epitome.
The following figure on the left shows a video of a car where some pixels have been artificially removed. The video epitome is then used to fill-in the missing pixels, resulting in the inpainted video on the right. Click the images to see the videos.
![]() |
![]() |
| Source Video | Inpainted Video |
A second inpainting example is shown below. In this video, the fire hydrant is removed by considering those pixels as being missing, and then video inpainting is performed with the video epitome. Click the images to see the videos.
![]() |
![]() |
![]() |
| Source Video | Cut-out | Inpainted Video |
Video super-resolution is defined here as taking a spatially low-resolution video sequence and increasing its spatial resolution. In order to achieve super-resolution, a video epitome is learnt from a high-resolution sequence and then used to reconstruct the low-resolution video. The following figure shows a portion of one frame of a low-resolution video sequence along with the bicubic interpolation and the video epitome super-resolution result. See the project website for more results and video demos.
V. Cheung, B. J. Frey, and N. Jojic 2005. Video Epitomes, IEEE Intern. Conf. Computer Vision and Pattern Recognition. (Best paper honourable mention award) [PDF]
Learning object-based appearance/shape models and estimating motion fields are highly interdependent problems. At the extreme, all the motion can be represented as an excessively large set of appearance exemplars. However, a more efficient representation of a video sequence would save on frame description if it described the motion from the previous frame instead. The extreme in this direction is also problematic as there are usually causes of appearance variability other than motion. The flexible sprite model illustrates the benefits of joint modelling of motion, shape and appearance using very simple models. The advantage of such a model is that each part of the model tries to capture some of the variability in the data until all the variability is decomposed and explained through either appearance, shape or transformation changes. Yet, the set of motions modelled is very limited, and the residual motion is simply captured in the variance maps of the sprites. Despite this appearance flexibility, the model requires an excessive number of appearance classes to capture many types of nonuniform large motions for which the translational variable is not a sufficient descriptor. In this work, we develop a better balance between the transformation and appearance model by explicitly modelling arbitrary large, non-uniform motion.
Figure below shows the generative model of image formation using two layers. For each layer, an appearance and a mask are generated from appropriate prior distributions associated with object classes. We sample deformation vectors for each pixel. The deformation field is then applied to both the appearance and the mask. The position variables are randomly selected and the appropriate latent images shifted in accordance. The final image is composed from the layers according to the masks, which can be either continuous of discrete.

The deformation model is fully expressive and non linear. The additional level of global transformation provides regularization and computational adbvantage, but could be absorbed into deformation model. The images used in the model are actually inferred using a video sequence of a person walking towards the camera diagonally ( an example of which is shown in the last level, where the observed image is formed)
Epitome of an image is its miniature, condensed version containing the essence of the textural and shape properties of the image. It is a novel simple appearance and shape model that is considerably smaller than the size of the image or the object it represents, but has all the constituent elements needed to reconstruct the image.
The epitome of a M x N image is its condensed version of size Me x Ne (Me << M & Ne << N) that retains the textural information in the original image. The image is described by its epitome and the mapping from the epitome to the image pixels. The mapping maps the patch in the image to a patch in the epitome so that given the epitome and the mapping, image can be composed using patches in epitome to a larger image. Since the epitome is of much smaller size (usually about one-fourth the size of the original image), clearly, many patches in the image map to the same patch in the epitome.
The figure below illustrates the appearance epitome. An image (a) is epitomized in the texture (b) which is enlarged twice its size. (c) is the reconstructed image using the mappings that map a patch in (a) to patch in (b). In (d) to (f), some of the learned mappings are shown, which is illustrative of the fact that a single patch in epitome maps to multiple patches in the image:

We can use the epitomic representation for object appearances and shapes in multiple image layers. This results in a generative model that can composite appearance and shape epitomes to provide a description of the image as a combination of sprites as in flexible sprites . However, due to the compression abilities of epitome, it is possible to discover layer structure in a single image. The generative model for two layer description is shown below. See also the video illustration of the learning.

In addition to segmenting the image, the model can also fill in occluded regions with similar appearance. This is because the model learns the continuity in the texture and uses this to explain the occluded regions in the patches.
An interesting and potentially useful vision/graphics task is to render an input image in an enhanced form or also in a unusual style; for example with increased sharpness or with some artistic qualities. In previous work, researchers showed that by estimating the mapping from an input image to a registered (aligned) image of the same scene in a different style or resolution, the mapping could be used to render a new input image.
Frequently a registered pair is not available, but instead the user may have (beside the input image) only a source image of an unrelated scene that contains the desired style. In this case, the task of inferring the output image is much more difficult since the algorithm must both infer correspondences between features in the input image and the source image, and infer the unknown mapping between the images. Given the style (A) and the input image (B) in the above example, we want to be able to infer the image that correspond to applying the style in B on C. The output image is shown above A and B.
We describe a Bayesian technique for inferring the most likely output image. The prior P(X) on the output image is a patch-based Markov random field obtained from the source image. The likelihood of the input P(Y|X ) is a Bayesian network that can represent different rendering styles (lq). The graphical model (chain graph) is shown below.

We describe a computationally efficient, probabilistic inference and learning algorithm for inferring the most likely output image and learning the rendering style.
We also show that current techniques for image restoration or reconstruction proposed in the vision literature (e.g., image super-resolution or de-noising) and image-based non-photorealistic rendering could be seen a special cases of our model.
| Style transfer click image to enlarge | ||
![]() Desired style |
![]() Input image |
|
![]() Output image with desired style |
||
R. Rosales, K. Achan, and B. J. Frey 2003 Unsupervised Image Translation, in Proceedings of International Conference on Computer Vision (ICCV 03), Nice, France, Oct. 2003. [PS.gz | PDF] [BibTeX]
R. Rosales, K. Achan, and B. J. Frey 2003 Translating Images by Unsupervised Estimation of Switching Filters, invited paper in Proceedings of Workshop on Statistical Signal Processing (SSP), Sep. 2003. [PS.gz | PDF]
Transformation invariant component analysis (TCA) is a probabilistic dimensionality reduction method that accounts for global transformations such as translations and rotations while learning local linear appearance deformations. The computational requirement for learning this model using EM algorithm is in the order of O(N2) where N is the number of elements in each training example. This is prohibitive for many applications of interest such as modeling mid to large size images.
In this work, we present an efficient EM algorithm for TCA that reduces the computational requirements to O(N logN). For 256x256 images, this is 4000 times faster!
The proposed algorithm allows TCA to be used in analysis of realistic data; In addition, this facilitates using TCA as a sub-module in other applications that requires learning transformation invariant subspace learning. An example is in modeling images using a layered decomposition, where each layer is explained using a mixture of TCA model.
A. Kannan, N. Jojic, B. J. Frey. Fast Transformation-Invariant Component Analysis, Submitted to Intl. Journal of Computer Vision special issue: Learning for Vision and Vision for Learning [PDF]
A. Kannan, N. Jojic, B. J. Frey. Fast Transformation-Invariant Component Analysis (or Factor Analysis), In Advances in Neural Information Processing Systems (NIPS), 2003 [PDF]
We propose a technique for automatically learning layers of ``flexible sprites'' -- probabilistic 2-dimensional appearance maps and masks of moving, occluding objects. The model explains each input image as a layered composition of flexible sprites. A variational expectation maximization algorithm is used to learn a mixture of sprites from a video sequence. For each input image, probabilistic inference is used to infer the sprite class, translation, mask values and pixel intensities (including obstructed pixels) in each layer. Exact inference is intractable, but we show how a variational inference technique can be used to process 320 x 240 images at 1 frame/second. The only inputs to the learning algorithm are the video sequence, the number of layers and the number of flexible sprites. We have obtained results on several tasks, including summarizing a video sequence with sprites, point-and-click video stabilization, and point-and-click object removal.
The movie below shows the decomposition of the original sequence into two "flexible sprites" and a background image after learning using the variational algorithm.
Often, variability in the "sprites" are due to interesting features that lie in a much smaller dimensional subspace of the original space of images. Some examples are facial expressions, movement of legs and hands. Instead of treating these variablity as noise, we can directly account for them. Our generative modelling approach makes this incorporation rather straightforward. Thus, we extend the flexible sprites which models each sprite as a Gaussian, by modeling the sprites as a factor analyzer. Figure below shows the generative process. When the dimensionality of the subspace is '0', we recover the flexible sprites model.

N. Jojic and B. J. Frey 2001 Learning flexible sprites in video layers, In Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2001. (CVPR 01) [PS | PDF] [BibTeX]
B. J. Frey, N. Jojic and A. Kannan 2003 Learning appearance and transparency manifolds of occluded objects in layers, In Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2003. (CVPR 03) [PS.gz | PDF] [BibTeX]