original:http://sujitpal.blogspot.com/2011/10/computing-document-similarity-using.html
Someone asked me a question recently about implementing document similarity, and since he was using Lucene, I pointed him to the Lucene Term Vector API. I hadn’t used the API myself, but I knew in general how it worked, so when he came back and told me that he could not make it work, I wanted to try it out for myself, to give myself a basis for asking further questions.
I already had a Lucene index (built by SOLR) of about 3000 medical articles for whose content field I had enabled term vectors as part of something I was trying out for highlighting, so I decided to use that. If you want to follow along and have to build your index from scratch, you can either use a field definition in your SOLR schema.xml file similar to this:
1 2 |
<field name="content" type="text" indexed="true" stored="true"
termVectors="true" termPositions="true" termOffsets="true"/>
|
or enable term vectors in your Lucene indexing code like so:
1 2 |
doc.addField(new Field("content", content, Store.YES, Index.ANALYZED,
TermVector.WITH_POSITIONS_OFFSETS));
|
Note that you don’t need positions and offsets enabled for term vectors for doing what I am doing, but I had it on in my schema.xml file already, so decided (to avoid confusion) to show the equivalent Lucene call in the code snippet above.
Once you have the index, find the list of all the terms in the “content” field across the entire index. These terms would represent the entries in the document vector for a document. Then for the documents which you want to consider for your similarity computation, extract its term vector. The term vector gives you two arrays, an array of terms within this document, and the corresponding frequency of that term in this document. Using these three data structures, it is easy to construct a (sparse) document vector representing the document(s).
Once you have the two document vectors, the similarity between the two can be computed using Cosine Similarity (or other measure – most of them work with document vectors).
For my test, I chose 4 documents out of my corpus, the titles of which (and their associated Lucene document IDs) are shown below:
1 2 3 4 |
Document#0: Mitral valve surgery - minimally invasive (31825) Document#1: Mitral valve surgery - open (31835) Document#2: Laryngectomy (31706) Document#3: Shaken baby syndrome (30) |
Notice that the first two are both about surgery and about the same body part, so intuitively I should expect a very high similarity between the documents 0 and 1. Document 2 is still about surgery, so I should expect a good, but smaller similarity value between documents 0 and 2. Document 3 is completely different, so I should expect an even smaller similarity value between documents 0 and 3.
The code for this is shown below (as a JUnit test).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
package com.mycompany.termpos;
import java.io.File;
import java.util.HashMap;
import java.util.Map;
import org.apache.commons.math.linear.OpenMapRealVector;
import org.apache.commons.math.linear.RealVectorFormat;
import org.apache.commons.math.linear.SparseRealVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.TermFreqVector;
import org.apache.lucene.store.FSDirectory;
import org.junit.Assert;
import org.junit.Test;
public class TermVectorBasedSimilarityTest {
@Test
public void testSimilarity() throws Exception {
IndexReader reader = IndexReader.open(
FSDirectory.open(new File("/path/to/my/index")));
// first find all terms in the index
Map<String,Integer> terms = new HashMap<String,Integer>();
TermEnum termEnum = reader.terms(new Term("content"));
int pos = 0;
while (termEnum.next()) {
Term term = termEnum.term();
if (! "content".equals(term.field()))
break;
terms.put(term.text(), pos++);
}
int[] docIds = new int[] {31825, 31835, 31706, 30};
DocVector[] docs = new DocVector[docIds.length];
int i = 0;
for (int docId : docIds) {
TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
Assert.assertTrue(tfvs.length == 1);
docs[i] = new DocVector(terms);
for (TermFreqVector tfv : tfvs) {
String[] termTexts = tfv.getTerms();
int[] termFreqs = tfv.getTermFrequencies();
Assert.assertEquals(termTexts.length, termFreqs.length);
for (int j = 0; j < termTexts.length; j++) {
docs[i].setEntry(termTexts[j], termFreqs[j]);
}
}
docs[i].normalize();
i++;
}
// now get similarity between doc[0] and doc[1]
double cosim01 = getCosineSimilarity(docs[0], docs[1]);
System.out.println("cosim(0,1)=" + cosim01);
// between doc[0] and doc[2]
double cosim02 = getCosineSimilarity(docs[0], docs[2]);
System.out.println("cosim(0,2)=" + cosim02);
// between doc[0] and doc[3]
double cosim03 = getCosineSimilarity(docs[0], docs[3]);
System.out.println("cosim(0,3)=" + cosim03);
reader.close();
}
private double getCosineSimilarity(DocVector d1, DocVector d2) {
return (d1.vector.dotProduct(d2.vector)) /
(d1.vector.getNorm() * d2.vector.getNorm());
}
private class DocVector {
public Map<String,Integer> terms;
public SparseRealVector vector;
public DocVector(Map<String,Integer> terms) {
this.terms = terms;
this.vector = new OpenMapRealVector(terms.size());
}
public void setEntry(String term, int freq) {
if (terms.containsKey(term)) {
int pos = terms.get(term);
vector.setEntry(pos, (double) freq);
}
}
public void normalize() {
double sum = vector.getL1Norm();
vector = (SparseRealVector) vector.mapDivide(sum);
}
public String toString() {
RealVectorFormat formatter = new RealVectorFormat();
return formatter.format(vector);
}
}
}
|
The results of running this test are shown below, and match up with my expectations, as shown below:
1 2 3 |
cosim(0,1)=0.9672563304581603 cosim(0,2)=0.7496880467405691 cosim(0,3)=0.23968406019250035 |
Using Lucene’s term vectors to generate document vectors can be useful not only for similarity calculations, but for other tasks where you need document vectors, such as clustering and classification. Computing document vectors from raw data is typically more involved (requires development time and can have scalability issues) than loading data into a Lucene index.
The other scalability issue is the size of the document vectors. My index is relatively tiny (about 3.000 documents), but we are still talking 10,000 unique terms after stopwording and stemming. However, document vectors are typically quite sparse (my test document vectors have an average density of 0.01), so an OpenMapRealVector is a good choice of data structure, and should scale to much larger sizes as well.