Thanks for the reference to Newman et al. Indeed,...

2008-12-30T15:16:00.000-08:00

Thanks for the reference to Newman et al. Indeed, that work and this one trace their lineage to Blei and Jordan's SIGIR '03 paper "Modeling annotated data," which extends LDA with the same trick of generating more than one set of observations from the per-document multinomial (in that case, the extra observations were image regions). I look forward to checking out the additions to LingPipe.

What you're calling "MM-LDA" is what Newman, Chemu...

2008-12-08T10:00:00.000-08:00

What you're calling "MM-LDA" is what Newman, Chemudugunta and Smyth in their KDD '06 paper "Statistical Entity-Topic Models" called "CI-LDA". The "CI" was for "conditionally independent", because the two views (tags and words here; entities and words in their paper) are being generated independently given the document's topic distribution.

They found the CI-LDA model tended to focus some of the K topics on entities and some on words, then went on to develop several more latent topic models to account for the correlations. They also cite a bunch of previous work on related author-topic models.

There's also a bunch of somewhat related work on co-clustering (or bi-clustering) in two dimensions simultaneously, which is popular in genomic microarray analysis. It's aimed at clustering rows and columns of a co-occurrence matrix to get a single clustering to explain both.

Part of our NIH SBIR involves using similar latent topic models to deal with text along with gene mentions (entities) plus MeSH terms (tags) to cluster query-specific subsets of MEDLINE. I'll be adding some more collapsed Gibbs samplers to LingPipe over the next nine months to deal with a range of these entity-topic-tag models.

Comments on Stanford InfoBlog: Clustering the Tagged Web (Posted by Daniel Ramage)

Thanks for the reference to Newman et al. Indeed,...

What you're calling "MM-LDA" is what Newman, Chemu...