How would you identify a small number of face images that together accurately represent a data set of face images? How would you identify a small number of sentences that accurately reflect the content of a document? How would you identify a small number of cities that are most easily accessible from all other cities by commercial airline? How would you identify segments of DNA that reflect the expression properties of genes? Data centers, or exemplars, are traditionally found by randomly choosing an initial subset of data points and then iteratively refining it, but this only works well if that initial choice is close to a good solution. Affinity propagation is a new algorithm that takes as input measures of similarity between pairs of data points and simultaneously considers all data points as potential exemplars. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We have used affinity propagation to solve a variety of clustering problems and found that it uniformly identifies clusters with much lower error than those found by other methods, and it did so in less than one-hundredth the amount of time. Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value in science and engineering.
Traditional graphical models express a decomposition of a joint distribution as a product of functions of subsets of variables. However, in some situations, such as when modeling uncertainties in ratings data, it is more natural to think of each function as providing evidence about possible orderings of its variables. We formalize this by introducing a new type of graphical model called a `cumulative distribution network' (CDN), which uses a product of functions of subsets of variables to describe the joint cumulative distribution. The conditional independence properties of CDNs are quite different from other graphical models. Unlike Bayesian networks, Markov networks and factor graphs, in a CDN the marginal distribution of any variable can be computed in constant time whilst allowing for complex conditional dependencies.

J.C. Huang and B.J. Frey. (2008) Cumulative distribution networks and the derivative-sum-product algorithm. Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI). (Best Student Paper Runner-up) [PDF]
We propose a new method of finding sparse matrix factorizations employing probabilistic inference techniques, and apply them to discover structures underlying gene expression data.
Given a data matrix X∈ℜG×T, containing G T-dimensional data points, we perform matrix factorization (X≈Y·Z). Specifically, we find a matrix of hidden factors (or causes), Z∈ℜC×T, and a factor loading matrix, Y∈ℜG×C, under a structural sparseness constraint that each row of Y contains at most N (of C possible) non-zero entries. Intuitively, this corresponds to explaining each row-vector of X as a linear combination (weighted by the corresponding row in Y) of a small subset of factors (causes) given by rows of Z. Note that this framework includes simple clustering (N=1) and ordinary low-rank approximation (N=C) as special cases.
We construct a probability model presuming Gaussian sensor noise in X (i.e. X=Y·Z+noise) and uniformly distributed factor assignments. We utilize a factorized variational method to perform tractable inference on the latent variables, and also account for noise in the factor matrix (Z) and uncertainty in assigning factors (Y). We explore several different factorization depths to allow a tradeoff between increased model accuracy and decreased computational complexity.

We apply our algorithm to mouse gene expression data from the Hughes Lab, as shown in the figure above. We show a data set of G=22709 gene expression profiles (expressed across T=55 tissues) approximated by linear combinations of at most N=3 (of a possible C=50) hidden transcription factor profiles. Experimental results demonstrate that our method can identify genetic functional categories with higher statistical significance than the method currently in greatest use, hierarchical agglomerative clustering.
D. Dueck, J. Huang, Q. D. Morris, and B. J. Frey 2004 Iterative Analysis of Microarray Data 42nd Annual Allerton Conference on Communication, Control, and Computing, September 2004 [PDF]
D. Dueck and B. J. Frey 2004 Probabilistic sparse matrix factorization technical report PSI-2004-23, University of Toronto [PDF]
This work introduces an approach for clustering/classification which is based on the use of local, high-order structure present in the data. We define a cluster as a collection of data points that have similar local properties. Local properties are defined in terns of a relevant probability distribution over subsets of the points in the cluster. This probability distribution serves as a representation of local structure.
For some problems, this local structure might be more relevant for classification than other measures of point similarity used by popular unsupervised and semi-supervised clustering methods. Under this approach, changes in the class label are associated to changes in the local properties of the data. Using this idea, we also pursue to learn how to cluster given examples of clustered data (including from different datasets). We make these concepts formal by presenting a probability model that captures their fundamentals and show that in this setting, learning to cluster is a well defined and tractable task. Based on probabilistic inference methods, we also present an algorithm for computing the posterior probability distribution of class labels for each data point. We have tested this approach in several experiments in the domain of spatial grouping, manifold-noise separation, and functional gene classification.
The figure below illustrates how our method separates a sampled manifold from noise points using a training data set that helps revealing the structure of the data.
![]() Sampled manifold+noise |
![]() Training clusters |
![]() Separated samples |
![]() Ground-truth |
A factor graph (Kschischang, Frey, Loeliger, IEEE Trans Info Theory 2001) is a graphical model that more clearly decomposes interactions between variables, compared to Bayesian networks (BNs) and Markov random fields (MRFs). While functional relationships between variables in BNs and MRFs must be determined by identifying parent-child clusters or maximal cliques, factor graphs explicitly identify functional relationships. Consequently, factor graphs are more useful for describing models that involve a large number of overlapping relationships between variables. Areas where factor graphs are now being used include: gene interaction networks, low-level vision, signal processing, error-correcting codes and n-SAT (satisfiability) problems. In each of these applications, a large number of sets of interacting variables are subsets of other sets of interacting variables, so that many of the relationships are hidden in the conditional probability tables and the clique potentials used in BNs and MRFs. In contrast, every relationship is identified by a function node in a factor graph. The following figure shows a simple example of a factor graph (a) that describes a function on x, y and z of the form f(x,y)g(y,z)h(z,x). To describe these relationships, the MRF on the same 3 variables must have the form shown in (b), but this MRF indicates a functional form m(x,y,z), which does not convey information about the decomposition f(x,y)g(y,z)h(z,x). The BN on the same 3 variables must have the form shown in (c) (in fact, there are 3 forms that are the same up to an exchange of variables). However, this BN indicates a functional form b(x)q(y)s(x,y,z), which does not identify the correct decomposition.

While this example is quite trivial, it illustrates the kind of problem that arises in larger-scale models. For example, in a gene interaction network or an error-correcting code, the overall complexity of the interactions may be so large that the BN or MRF is fully-connected. In contrast, the factor graph may be sparsely-connected. Additional dummy variables may be added to MRFs and BNs so that they can emulate the variable-function decomposition of factor graphs, but since there is no common notation for doing this, the resulting graphical models are not easily interpreted.
Another advantage of factor graphs is that optimization algorithms for inferring the values of hidden variables can take advantage of the extensive decomposition of relationships between variables. In particular, the sum-product algorithm (a.k.a. belief propagation, loopy belief propagation, etc.) and its max-product or min-sum variant are naturally described as message-passing algorithms between the variable nodes and function nodes in the factor graph. The message update rules for BNs and MRFs are more complex than for factor graphs. More importantly, if a BN or MRF is not extended by including dummy variables, optimization in the BN or MRF can (and in many applications does) produce significantly worse results than in the factor graph.
Until recently, the main advantage of BNs and MRFs over factor graphs was their ability to succinctly describe conditional- and marginal-independence relationships. However, an extension of the factor graph notation to include directed edges, has enabled factor graphs to express all conditional- and marginal-independence relationships that can be expressed by BNs and MRFs. Interestingly, the rules for determining independence in a directed factor graph can be viewed as the intersection of the rules for BNs and MRFs, so that the resulting set of rules is simpler, than for BNs or MRFs on there own. The following figure shows an example of a BN (a), its factor graph (b) and its directed factor graph (c). In (d), this directed factor graph is modified to capture an undirected relationship between u and v, a relationship that cannot be expressed by a BN. The MRF corresponding to this modified directed factor graph is shown in (e). The MRF captures the undirected relationship between u and v, but does not capture the directed relationships in the other part of the model. The MRF in (e) is converted to a factor graph in (f), which further illustrates that the MRF has lost information.

Interestingly, it turns out there are useful probability models whose independencies can be exactly expressed by directed factor graphs, but which cannot be expressed by any BN or MRF. It turns out that factor graphs for a strict super-set of BNs and MRFs, both in terms of expressing factorization and expressing independencies.
F. R. Kschischang, B. J. Frey, H. A. Loeliger 2001 Factor graphs and the sum-product algorithm, IEEE Transactions on Information Theory 47:2, 498-519, February. [PDF]
B. J. Frey 2003 Extending factor graphs so as to unify directed and undirected graphical models, in Proceedings of Conference on Uncertainty in Artificial Intelligence 19 (UAI 03), Morgan Kaufmann, CA, Acapulco, Mexico, August