Interested in a commercial license for the extended software toolkit?
Send email to email to Brendan Frey
Data sets and software are provided at the bottom of this page
Clustering by Passing Messages Between Data Points. [PDF] [BibTeX]
Brendan J. Frey and Delbert Dueck, University of Toronto
Science 315, 972976, February 2007
Applications and Extensions
Constructing Treatment Portfolios Using Affinity Propagation.
Delbert Dueck, Brendan J. Frey, Nebojsa Jojic, Vladimir Jojic, Guri Giaever, Andrew Emili, Gabe Musso, Robert Hegele
International Conference on Research in Computational Molecular Biology (RECOMB), March 2008, Singapore
Non-metric affinity propagation for unsupervised image categorization.
Delbert Dueck and Brendan J. Frey
International Conference on Computer Vision (ICCV), October 2007, Rio de Janeiro, Brazil
FLoSS: Facility Location for Subspace Segmentation
Nevena Lazic, Inmar E. Givoni, Parham Aarabi and Brendan J. Frey
12th International Conference on Computer Vision (ICCV), October 2009, Kyoto, Japan.
Semi-supervised Affinity Propagation With Instance-Level Constraints
Inmar E. Givoni and Brendan J. Frey
12th International Conference on Artificial Intelligence and Statistics (AISTATS), April 2009, Clearwater, FL.
Solving the Uncapacitated Facility Location Problem Using Message Passing Algorithms.
Nevena Lazic, Brendan J. Frey, Parham Aarabi
International Conference on Artificial Intelligence and Statistics (AISTATS), May 2010, Sardinia
Hierarchical Affinity Propagation
Inmar E. Givoni, Clement Chung, Brendan J. Frey
27th Conference on Uncertainty in Artificial Intelligence (UAI), July 2011, Barcelona, Spain.
What is Affinity Propagation?
How would you identify a small number of face images that together accurately represent a data set of face images? How would you identify a small number of sentences that accurately reflect the content of a document? How would you identify a small number of cities that are most easily accessible from all other cities by commercial airline? How would you identify segments of DNA that reflect the expression properties of genes? Data centers, or exemplars, are traditionally found by randomly choosing an initial subset of data points and then iteratively refining it, but this only works well if that initial choice is close to a good solution.
Affinity propagation is a new algorithm that takes as input measures of similarity between pairs of data points and simultaneously considers all data points as potential exemplars. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We have used affinity propagation to solve a variety of clustering problems and we found that it uniformly found clusters with much lower error than those found by other methods, and it did so in less than one-hundredth the amount of time. Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value in science and engineering.
The following video shows affinity propagation gradually identifying three examplars among a small set of two-dimensional data points.
OUR BEST SOFTWARE TO DATE
|MATLAB functions||DOWNLOAD apcluster.m
(easy to use/modify, but no sparse capability)
(includes sparse capability)
|R package||APCluster is available through CRAN
(ported from the Matlab code listed above)
|Compiled MEX-file for MATLAB
(faster + sparse, but more complicated)
|Executables and Libraries
(sparse + no MATLAB, but complicated)
.EXE + .DLL
.bin + .so
.bin + .so
Click here for MATLAB function preferenceRange.m, which tells you what range of preferences is appropriate for your data set
Click here for MATLAB function apclusterK.m, which finds a user-specified number of clusters (apcluster.m is called by this function). Warning: This function calls apcluster.m many times, while searching for the preference value that gives the requested number of clusters. If you really want to find exactly K clusters, this function may work for your problem, but if you're willing to play with the preference value yourself (recommended), we recommend you use the basic apcluster.m script.
Click here for an older (Feb. 2007) version of the MATLAB function or unsupported C-code (warning: does not handle sparse or infinite cases reliably).
Click here for an archive of other methods (e.g. k-centers, quantized k-means, etc.)
Click here for the Leveraged Affinity Propagation script (see FAQ for details)
Comparison between Affinity Propagation and the Vertex Substitution Heuristic (VSH).
ULTRA-SHORT MATLAB SCRIPT
Click here for MATLAB script
The only input to this 24-line version is the MATLAB matrix of similarities, S, where S(i,k) is the similarity that data instance k is the exemplar for data instance i (e.g., S(i,k) = -(xi - xk)2 is the similarity based on squared error). The preferences should be placed on the diagonal of this matrix; if appropriate preferences are not known, usually, a good choice for the preferences is to set them all equal to the median of the other similarities. The MATLAB code executes 100 iterations of affinity propagation. After execution, the combined evidence r(i,k)+a(i,k) is stored in the N×N matrix
E, the number of exemplars is stored in
K, the data instance indices of the exemplars are stored in the K-vector
I, and the exemplar indices of the data instances are stored in the N-vector
idx. (Note, instance i is assigned to the data instance with index
CLUSTERING TWO-DIMENSIONAL DATA POINTS
The similarity between every pair of 2D data points was set to the negative squared distance between the points. To prevent degenerate solutions, where affinity propagation tries to place two points in one cluster, but both data points are equally good as cluster centers, Gaussian noise with σ=10-12 was added to the similarities, before affinity propagation was applied.
- Text files containing 2-D data points, similarities, and preferences (median of S)
- MATLAB file containing data, similarities and preferences
CLUSTERING IMAGES DERIVED FROM OLIVETTI FACE DATABASE
Each 64×64 face image from the first 100 images in the Olivetti database was smoothed using a Gaussian kernel with σ=0.5 and then rotated by -10°, 0° and 10° and scaled by a factor of 0.9, 1.0 and 1.1 (using nearest-neighbor interpolation), to produce a total of 900 images. To avoid including the background behind each face, a central window of size 50×50 pixels was extracted. Finally, the pixels in each 50×50 image were normalized to have mean 0 and variance 0.1. The similarity between two images was set to the negative sum of squared pixel differences.
- Image showing all the data instances
- Text files containing similarities (5MB) and preferences
- MATLAB file containing similiarities based on squared error (19MB)
FINDING GENES AND EXONS USING PUTATIVE EXON EXPRESSION DATA
The microarray data reported in Frey et al. (Nature Genetics, 2005) and available at http://www.psi.toronto.edu/genrate was used to construct the following set of pair-wise similarities for 75,066 putative exons from mouse Chromosome 1. By clustering similar putative exons, it is possible to detect bona fide exons and genes. 'Only' 13 million pair-wise similarities were computed, because it is biologically implausible that putative exons that are far apart in the genome are part of the same gene and thus the similarities between distant putative exons need not be computed. To vary the number of detected exons (and genes) increase or decrease all preferences by the same common value, before running affinity propagation. Make sure to use the 'sparse' version, or the software will try to create several 75,067×75,067 full matrices! Also, MATLAB requires 1.2 GB of available RAM for this data, so a clean workspace is important.
- MATLAB file containing similiarities based on proximity of putative exon probes in the genome and similarity in expression of putative exon probes across 12 tissue samples (174MB!)
- Click here for a script that computes the true positive and false positive exon detection rates
DETECTING REPRESENTATIVE SENTENCES IN A MANUSCRIPT
Sentences from the main text of the paper 'Clustering by passing messages between data points' (Science 315, 972976) were used to construct a similarity matrix, based on how useful one sentence was for encoding the words in another sentence.
- Text file containing sentences
- Text file containing similarities based on information coding costs (see supporting online material in Science paper)
- Text file containing the preference that each sentence is a summary sentence, based on information coding costs (longer summary sentences have lower preferences, because they take more bits to encode)
- MATLAB file containing similarities and summary sentence preferences, based on information coding costs
DETECTING CITIES THAT ARE EASILY ACCESSED FROM OTHER CITIES BY COMMERCIAL AIRLINES
A web site providing estimated airline travel times (including stop-overs, headwinds, etc) between pairs of Canadian and American cities was queried to obtain a pair-wise similarity matrix, where the similarity of city i to city k was set to the negative estimated travel time. Note: Results reported in Science were obtained using a damping factor of 0.8.
- Text files containing city names, similarities, and preferences.
- MATLAB file containing similarities and preferences, based on commercial airline transit time between cities
ADDITIONAL DATA SETS CAN BE FOUND HERE