FAST
Transformation-Invariant Component Analysis
Anitha Kannan, Nebojsa Jojic
& Brendan Frey
Transformation
invariant component analysis (TCA) [1] is a probabilistic dimensionality
reduction method that accounts for global transformations such as translations
and rotations while learning local linear appearance deformations. The computational
requirement for learning this model using EM algorithm is in the order of O(N2)
where N is the number of elements in each training example. This is
prohibitive for many applications of interest such as modeling mid to large
size images.
In
this work, we present an efficient EM algorithm for TCA that reduces the
computational requirements to O(N logN). For 256x256 images, this is 4000 times
faster!
The
proposed algorithm allows TCA to be used in analysis of
realistic data; In addition, this facilitates using TCA as a sub-module in
other applications that requires learning transformation invariant subspace
learning. An example is in modeling images using a layered decomposition,
where each layer is explained using a mixture of TCA model.
Publication:
Fast
Transformation-Invariant Component Analysis [Long
version pdf]
A.Kannan,
N.Jojic, B.Frey
Submitted to Intl. Journal of Computer
Vision special issue: Learning for Vision and Vision for Learning
Fast
Transformation-Invariant Component Analysis (or Factor Analysis) [pdf]
A.Kannan,
N.Jojic, B.Frey
In
Advances in Neural Information Processing Systems (NIPS) 2003
MATLAB®
software
Illustration
of the model:
(a): Generative model for mixture of TCA: The class index ‘c’ is sampled from discrete distribution
p(c), the subspace coordinates y
are generated from p(y) ~N(y; 0, I) to obtain the latent
image:
z = mc +Lcy + noise. z is
transformed according to transformation T sampled from P(T) to obtain
x = Tz + noise. We can view T as a transformation matrix that transforms the
latent image z, to generate the
observation, x. An illustration of TCA is shown in (b). The subspace coordinates y and the image position Tx and Ty are
inferred by training the model on a captured video sequence.
Ideas/ mild
assumptions behind the speedup:
Central
Idea:
ALL expensive computations involve computing correlation or convolution!
When performed using the Fast Fourier transforms, these are O(NlogN) as opposed
to O(N2) when done
naively. The paper shows how these
computations come in when some mild assumptions are made:
Assumptions:
Experiments:
NOTE:
The datasets used in these experiments can not be processed using the TCA model
without the tremendous speedup we have obtained using the current work.
Modeling a walk sequence:
TCA
model is trained on a 250 frames of 165x285 dimensional video sequence of a
person walking. The goal here is to learn a compact representation for the
dynamically and periodically changing hand and leg movements. The video
shows the entire inference process, along with the learned parameters. The
inferred subspace coordinates and the position estimates over the entire sequence can be used
to learn an autoregressive model that can then be used to generate a longer
sequence, also known as video textures – this video
shows the video texture so obtained, but in one case we use the learned
parameters to generate from the model, while in the other, we replace with an
example from the training set, whose
subspace coordinates matches with the coordinates generated using the
autoregressive model ( up to 5th order match).
Modeling motion of the lips:
While
a person speaks or reads out a text, it is difficult to fixate while a video
camera records the motion of the mouth.
While preprocessing for alignment is rather arduous task, our approach
jointly estimates the position and the appearance in an unsupervised fashion.
As shown in the figure below, we learn a much sharper mean, and extract more
interesting subspace representation as opposed to PCA. We can also see a video of the position normalized sequence
along with the original data to get a sense for the variability in the position.

Clustering face poses:
We
can use the mixture of TCA model to cluster objects, while at the same learn
the linear subspace representation for each cluster. Figure below shows the
results of clustering three different poses of a person walking across a
cluttered background. The variations
such as changes in lighting, small out of plane rotations that are present in
this sequence are captured in the subspace.
TMG performs clustering invariant to transformations, and as it can
capture changes only in the noise, or through more number of classes, it does
not perform as well as TCA when the number of classes is fixed to be the same:

(a):
Parameters of a 3 cluster TMG and 3 cluster TCA model (b):
Some frames from the video sequence.
Corresponding representation in TMG and MTCA model, where the class
with highest posterior probability is chosen. (c):
Illustration of the role of components, in the first class. Factor
y1 tends to model
lighting variation while y2
tends to model small out of plane rotations