Sunday, 18 August 2013

Construct document-term matrix via Java and MapReduce

Construct document-term matrix via Java and MapReduce

Background:
I'm trying to make a "document-term" matrix in Java on Hadoop using
MapReduce. A document-term matrix is like a huge table where each row
represents a document and each column represents a possible word/term.
Problem Statement:
Assuming that I already have a term index list (so that I know which term
is associated with which column number), what is the best way to look up
the index for each term in each document so that I can build the matrix
row-by-row (i.e., document-by-document)?
So far I can think of two approaches:
Approach #1:
Store the term index list on the Hadoop distributed file system. Each time
a mapper reads a new document for indexing, spawn a new MapReduce job --
one job for each unique word in the document, where each job queries the
distributed terms list for its term. This approach sounds like overkill,
since I am guessing there is some overhead associated with starting up a
new job, and since this approach might call for tens of millions of jobs.
Also, I'm not sure if it's possible to call a MapReduce job within another
MapReduce job.
Approach #2:
Append the term index list to each document so that each mapper ends up
with a local copy of the term index list. This approach is pretty wasteful
with storage (there will be as many copies of the term index list as there
are documents). Also, I'm not sure how to merge the term index list with
each document -- would I merge them in a mapper or in a reducer?

No comments:

Post a Comment