FAST Transformation-Invariant Component Analysis

 

Anitha Kannan, Nebojsa Jojic & Brendan Frey

 

Transformation invariant component analysis (TCA) [1] is a probabilistic dimensionality reduction method that accounts for global transformations such as translations and rotations while learning local linear appearance deformations. The computational requirement for learning this model using EM algorithm is in the order of O(N2) where N is the number of elements in each training example. This is prohibitive for many applications of interest such as modeling mid to large size images.

 

In this work, we present an efficient EM algorithm for TCA that reduces the computational requirements to O(N logN).  For 256x256 images, this is 4000 times faster!

 

The proposed algorithm allows TCA to be used in analysis of realistic data; In addition, this facilitates using TCA as a sub-module in other applications that requires learning transformation invariant subspace learning. An example is in modeling images using a layered decomposition, where each layer is explained using a mixture of TCA model.

 

Publication:

Fast Transformation-Invariant Component Analysis [Long version pdf]

A.Kannan, N.Jojic, B.Frey

Submitted to Intl. Journal of Computer Vision special issue: Learning for Vision and Vision for Learning

 

Fast Transformation-Invariant Component Analysis (or Factor Analysis) [pdf]

A.Kannan, N.Jojic, B.Frey

In Advances in Neural Information Processing Systems (NIPS) 2003

MATLAB® software

 

Illustration of the model:

 

                                               

(a): Generative model for mixture of TCA:

 

The class index ‘c’ is sampled from discrete distribution p(c), the subspace coordinates y are generated from p(y) ~N(y; 0, I) to obtain the latent image:

                          z = mc +Lcy + noise.

z is transformed according to transformation T sampled from P(T) to obtain

                            x = Tz + noise.

 

We can view T as a transformation matrix that transforms the latent image z, to generate the observation, x.

 

An illustration of TCA is shown in (b).  The subspace coordinates y and the image position Tx and Ty are inferred by training the model on a captured video sequence.

      

 
    

 

 

 

Ideas/ mild assumptions behind the speedup:

 

Central Idea:

             ALL expensive computations involve computing correlation or convolution! When performed using the Fast Fourier transforms, these are O(NlogN) as opposed to O(N2)  when done naively.  The paper shows how these computations come in when some mild assumptions are made:

  

Assumptions:

 

 

Experiments:

        NOTE: The datasets used in these experiments can not be processed using the TCA model without the tremendous speedup we have obtained using the current work. 

 

Modeling a walk sequence:

TCA model is trained on a 250 frames of 165x285 dimensional video sequence of a person walking. The goal here is to learn a compact representation for the dynamically and periodically changing hand and leg movements.  The video shows the entire inference process, along with the learned parameters. The inferred subspace coordinates and the position estimates over the entire sequence can be used to learn an autoregressive model that can then be used to generate a longer sequence, also known as video textures – this video shows the video texture so obtained, but in one case we use the learned parameters to generate from the model, while in the other, we replace with an example from the training set, whose  subspace coordinates matches with the coordinates generated using the autoregressive model ( up to 5th order match). 

 

Modeling motion of the lips:

While a person speaks or reads out a text, it is difficult to fixate while a video camera records the motion of the mouth.  While preprocessing for alignment is rather arduous task, our approach jointly estimates the position and the appearance in an unsupervised fashion. As shown in the figure below, we learn a much sharper mean, and extract more interesting subspace representation as opposed to PCA. We can also see a video of the position normalized sequence along with the original data to get a sense for the variability in the position.

 

 

Clustering face poses:

We can use the mixture of TCA model to cluster objects, while at the same learn the linear subspace representation for each cluster. Figure below shows the results of clustering three different poses of a person walking across a cluttered background.  The variations such as changes in lighting, small out of plane rotations that are present in this sequence are captured in the subspace.  TMG performs clustering invariant to transformations, and as it can capture changes only in the noise, or through more number of classes, it does not perform as well as TCA when the number of classes is fixed to be the same:

(a): Parameters of a 3 cluster TMG and 3 cluster TCA model

 

(b): Some frames from the video sequence.  Corresponding representation in TMG and MTCA model, where the class with highest posterior probability is chosen.

 

(c): Illustration of the role of components, in the first class.

Factor y1 tends to model lighting variation while y2 tends to model small out of plane rotations