How would you identify a
small number of face images that together accurately represent a data set of
face images? How would you identify a small number of sentences that
accurately reflect the content of a document? How would you identify a small
number of cities that are most easily accessible from all other cities by
commercial airline? How would you identify segments of DNA that reflect the
expression properties of genes? Data centers, or exemplars, are
traditionally found by randomly choosing an initial subset of data points
and then iteratively refining it, but this only works well if that initial
choice is close to a good solution. Affinity propagation is a new algorithm
that takes as input measures of similarity between pairs of data points and
simultaneously considers all data points as potential exemplars.
Real-valued messages are exchanged between data points until a high-quality
set of exemplars and corresponding clusters gradually emerges. We have used
affinity propagation to solve a variety of clustering problems and we found
that it uniformly found clusters with much lower error than those found by
other methods, and it did so in less than one-hundredth the amount of time.
Because of its simplicity, general applicability, and performance, we
believe affinity propagation will prove to be of broad value in science and
engineering.
The following
video shows affinity propagation gradually identifying three
examplars among a small set of two-dimensional data points.
(Higher-quality video)
Software
OUR BEST SOFTWARE TO DATE
OTHER MATLAB FUNCTIONS
Click here for M
ATLAB function
preferenceRange.m,
which tells you what range of preferences is appropriate for your data set
Click here for M
ATLAB function
apclusterK.m,
which finds a user-specified number of clusters (apcluster.m is called by this function).
Warning: This function calls apcluster.m many times, while searching for the preference value that gives the requested number of clusters. If you really want to find exactly
K clusters, this function may work for your problem, but if you're willing to play with the preference value yourself (recommended), we recommend you use the basic apcluster.m script.
Click here for an older (Feb. 2007) version of the M
ATLAB function or
unsupported C-code (warning: does not handle sparse or infinite cases reliably).
Click here for an archive of other methods (e.g.
k-centers, quantized
k-means, etc.)
Click here for the Leveraged Affinity Propagation script (see
FAQ for details)
Comparison between Affinity Propagation and the Vertex Substitution Heuristic (VSH).
ULTRA-SHORT MATLAB SCRIPT
Click here for MATLAB script
The only input to this 24-line version is the M
ATLAB matrix of similarities,
S, where
S(
i,
k) is
the similarity that data instance
k is
the exemplar for data instance
i (e.g.,
S(
i,
k) = -(
xi -
xk)
2
is the similarity based on squared error). The preferences should be placed
on the diagonal of this matrix; if appropriate preferences are not known,
usually, a good choice for the preferences is to set them all equal to the
median of the other similarities. The M
ATLAB code executes 100 iterations of
affinity propagation. After execution, the combined evidence
r(
i,
k)+
a(
i,
k)
is stored in the
N×
N
matrix
E, the number of exemplars is stored in
K, the data instance indices of the exemplars
are stored in the
K-vector
I, and the exemplar indices of the data
instances are stored in the
N-vector
idx.
(Note, instance
i
is assigned to the data instance with index
idx(i).)
Data Sets
CLUSTERING TWO-DIMENSIONAL DATA POINTS
The similarity between every pair of 2D data points was set to the negative
squared distance between the points. To prevent degenerate solutions, where
affinity propagation tries to place two points in one cluster, but both data
points are equally good as cluster centers, Gaussian noise with σ=10
-12 was added to the
similarities, before affinity propagation was applied.
CLUSTERING IMAGES DERIVED FROM OLIVETTI FACE DATABASE
Each 64×64 face image from the first 100 images in the Olivetti database was
smoothed using a Gaussian kernel with σ=0.5 and then rotated by
-10°, 0° and 10° and scaled by a factor of 0.9, 1.0 and 1.1 (using nearest-neighbor
interpolation), to produce a total of 900 images. To avoid including the
background behind each face, a central window of size 50×50 pixels was
extracted. Finally, the pixels in each 50×50 image were normalized to have
mean 0 and variance 0.1. The similarity between two images was set to the
negative sum of squared pixel differences.
FINDING GENES AND EXONS USING PUTATIVE EXON EXPRESSION DATA
The microarray data reported in Frey et al. (
Nature Genetics, 2005) and
available at
http://www.psi.toronto.edu/genrate
was used to construct the following set of pair-wise similarities for 75,066
putative exons from mouse Chromosome 1. By clustering similar putative exons,
it is possible to detect bona fide exons and genes. 'Only' 13 million
pair-wise similarities were computed, because it is biologically implausible
that putative exons that are far apart in the genome are part of the same
gene and thus the similarities between distant putative exons need not be
computed. To vary the number of detected exons (and genes) increase or
decrease all preferences by the same common value, before running affinity
propagation. Make sure to use the 'sparse' version, or the software will try to
create several 75,067×75,067 full matrices! Also, M
ATLAB requires 1.2 GB of
available RAM for this data, so a clean workspace is important.
DETECTING REPRESENTATIVE SENTENCES IN A MANUSCRIPT
Sentences from the main text of the paper 'Clustering by passing messages
between data points' (
Science 315, 972976) were used to construct a
similarity matrix, based on how useful one sentence was for encoding the
words in another sentence.
DETECTING CITIES THAT ARE EASILY ACCESSED FROM OTHER CITIES BY COMMERCIAL
AIRLINES
A web site providing estimated airline travel times (including stop-overs,
headwinds, etc) between pairs of Canadian and American cities was queried to
obtain a pair-wise similarity matrix, where the similarity of city
i to city
k was set to the negative estimated travel time.
Note: Results reported in Science were obtained using a
damping factor of 0.8.