Discuss New Concept,New Technic,New Tools, Including EAI,BPM,SOA,Tibco,IBM MQ,Tuxedo, Cloud,Hadoop,NoSQL,J2EE,Ruby,Scala,Python, Performance,Scalability,Distributed,HA, Social Network,Machine Learning.

NLP

Apr 302013
 
 [repost ]A Survey on Automatic Text Summarization  April 30, 2013  Posted by on April 30, 2013 at 10:56 pm NLP Tagged with:  No Responses »

original:http://www.cs.cmu.edu/~nasmith/LS2/das-martins.07.pdf

A Survey on Automatic Text Summarization
Dipanjan Das Andre F.T. Martins
Language Technologies Institute
Carnegie Mellon University
fdipanjan, afmg@cs.cmu.edu
November 21, 2007
Abstract
The increasing availability of online information has necessitated intensive
research in the area of automatic text summarization within the Natural Lan-
guage Processing (NLP) community. Over the past half a century, the prob-
lem has been addressed from many di erent perspectives, in varying domains
and using various paradigms. This survey intends to investigate some of the
most relevant approaches both in the areas of single-document and multiple-
document summarization, giving special emphasis to empirical methods and
extractive techniques. Some promising approaches that concentrate on speci c
details of the summarization problem are also discussed. Special attention is
devoted to automatic evaluation of summarization systems, as future research
on summarization is strongly dependent on progress in this area.
1 Introduction
The sub eld of summarization has been investigated by the NLP community for
nearly the last half century. Radev et al. (2002) de ne a summary as \a text that
is produced from one or more texts, that conveys important information in the
original text(s), and that is no longer than half of the original text(s) and usually
signi cantly less than that”. This simple de nition captures three important aspects
that characterize research on automatic summarization:
 Summaries may be produced from a single document or multiple documents,
 Summaries should preserve important information,
 Summaries should be short.
Even if we agree unanimously on these points, it seems from the literature that
any attempt to provide a more elaborate de nition for the task would result in
disagreement within the community. In fact, many approaches di er on the manner
of their problem formulations. We start by introducing some common terms in the
1summarization dialect: extraction is the procedure of identifying important sections
of the text and producing them verbatim; abstraction aims to produce important
material in a new way; fusion combines extracted parts coherently; and compression
aims to throw out unimportant sections of the text (Radev et al., 2002).
Earliest instances of research on summarizing scienti c documents proposed
paradigms for extracting salient sentences from text using features like word and
phrase frequency (Luhn, 1958), position in the text (Baxendale, 1958) and key
phrases (Edmundson, 1969). Various work published since then has concentrated on
other domains, mostly on newswire data. Many approaches addressed the problem
by building systems depending of the type of the required summary. While extractive
summarization is mainly concerned with what the summary content should be, usu-
ally relying solely on extraction of sentences, abstractive summarization puts strong
emphasis on the form, aiming to produce a grammatical summary, which usually
requires advanced language generation techniques. In a paradigm more tuned to
information retrieval (IR), one can also consider topic-driven summarization, that
assumes that the summary content depends on the preference of the user and can
be assessed via a query, making the nal summary focused on a particular topic.
A crucial issue that will certainly drive future research on summarization is
evaluation. During the last fteen years, many system evaluation competitions like
TREC,1 DUC2 and MUC3 have created sets of training material and have estab-
lished baselines for performance levels. However, a universal strategy to evaluate
summarization systems is still absent.
In this survey, we primarily aim to investigate how empirical methods have been
used to build summarization systems. The rest of the paper is organized as fol-
lows: Section 2 describes single-document summarization, focusing on extractive
techniques. Section 3 progresses to discuss the area of multi-document summariza-
tion, where a few abstractive approaches that pioneered the eld are also considered.
Section 4 brie y discusses some unconventional approaches that we believe can be
useful in the future of summarization research. Section 5 elaborates a few eval-
uation techniques and describes some of the standards for evaluating summaries
automatically. Finally, Section 6 concludes the survey.
2 Single-Document Summarization
Usually, the ow of information in a given document is not uniform, which means
that some parts are more important than others. The major challenge in summa-
rization lies in distinguishing the more informative parts of a document from the
less ones. Though there have been instances of research describing the automatic
creation of abstracts, most work presented in the literature relies on verbatim ex-
traction of sentences to address the problem of single-document summarization. In
1See http://trec.nist.gov/.
2See http://duc.nist.gov/.
3See http://www.itl.nist.gov/iad/894.02/related projects/muc/proceedings/.
muc 7 toc.html
2this section, we describe some eminent extractive techniques. First, we look at early
work from the 1950s and 60s that kicked o research on summarization. Second,
we concentrate on approaches involving machine learning techniques published in
the 1990s to today. Finally, we brie y describe some techniques that use a more
complex natural language analysis to tackle the problem.
2.1 Early Work
Most early work on single-document summarization focused on technical documents.
Perhaps the most cited paper on summarization is that of (Luhn, 1958), that de-
scribes research done at IBM in the 1950s. In his work, Luhn proposed that the
frequency of a particular word in an article provides an useful measure of its sig-
ni cance. There are several key ideas put forward in this paper that have assumed
importance in later work on summarization. As a rst step, words were stemmed to
their root forms, and stop words were deleted. Luhn then compiled a list of content
words sorted by decreasing frequency, the index providing a signi cance measure of
the word. On a sentence level, a signi cance factor was derived that re ects the
number of occurrences of signi cant words within a sentence, and the linear distance
between them due to the intervention of non-signi cant words. All sentences are
ranked in order of their signi cance factor, and the top ranking sentences are nally
selected to form the auto-abstract.
Related work (Baxendale, 1958), also done at IBM and published in the same
journal, provides early insight on a particular feature helpful in nding salient parts
of documents: the sentence position. Towards this goal, the author examined 200
paragraphs to nd that in 85% of the paragraphs the topic sentence came as the rst
one and in 7% of the time it was the last sentence. Thus, a naive but fairly accurate
way to select a topic sentence would be to choose one of these two. This positional
feature has since been used in many complex machine learning based systems.
Edmundson (1969) describes a system that produces document extracts. His
primary contribution was the development of a typical structure for an extractive
summarization experiment. At rst, the author developed a protocol for creating
manual extracts, that was applied in a set of 400 technical documents. The two
features of word frequency and positional importance were incorporated from the
previous two works. Two other features were used: the presence of cue words
(presence of words like signi cant, or hardly), and the skeleton of the document
(whether the sentence is a title or heading). Weights were attached to each of these
features manually to score each sentence. During evaluation, it was found that about
44% of the auto-extracts matched the manual extracts.
2.2 Machine Learning Methods
In the 1990s, with the advent of machine learning techniques in NLP, a series of semi-
nal publications appeared that employed statistical techniques to produce document
extracts. While initially most systems assumed feature independence and relied on
naive-Bayes methods, others have focused on the choice of appropriate features and
3on learning algorithms that make no independence assumptions. Other signi cant
approaches involved hidden Markov models and log-linear models to improve ex-
tractive summarization. A very recent paper, in contrast, used neural networks and
third party features (like common words in search engine queries) to improve purely
extractive single document summarization. We next describe all these approaches
in more detail.
2.2.1 Naive-Bayes Methods
Kupiec et al. (1995) describe a method derived from Edmundson (1969) that is able
to learn from data. The classi cation function categorizes each sentence as worthy
of extraction or not, using a naive-Bayes classi er. Let s be a particular sentence,
S the set of sentences that make up the summary, and F1; : : : ; Fk the features.
Assuming independence of the features:
P(s 2 S j F1; F2; ::Fk) =
Qk
i=1 P(Fi
j s 2 S)  P(s 2 S)
Qk
i=1 P(Fi)
(1)
The features were compliant to (Edmundson, 1969), but additionally included the
sentence length and the presence of uppercase words. Each sentence was given a
score according to (1), and only the n top sentences were extracted. To evaluate
the system, a corpus of technical documents with manual abstracts was used in
the following way: for each sentence in the manual abstract, the authors manually
analyzed its match with the actual document sentences and created a mapping
(e.g. exact match with a sentence, matching a join of two sentences, not matchable,
etc.). The auto-extracts were then evaluated against this mapping. Feature analysis
revealed that a system using only the position and the cue features, along with the
sentence length sentence feature, performed best.
Aone et al. (1999) also incorporated a naive-Bayes classi er, but with richer
features. They describe a system called DimSum that made use of features like
term frequency (tf ) and inverse document frequency (idf) to derive signature words.
4
The idf was computed from a large corpus of the same domain as the concerned
documents. Statistically derived two-noun word collocations were used as units for
counting, along with single words. A named-entity tagger was used and each entity
was considered as a single token. They also employed some shallow discourse analysis
like reference to same entities in the text, maintaining cohesion. The references
were resolved at a very shallow level by linking name aliases within a document
like \U.S.” to \United States”, or \IBM” for \International Business Machines”.
Synonyms and morphological variants were also merged while considering lexical
terms, the former being identi ed by using Wordnet (Miller, 1995). The corpora
used in the experiments were from newswire, some of which belonged to the TREC
evaluations.
4Words that indicate key concepts in a document.
42.2.2 Rich Features and Decision Trees
Lin and Hovy (1997) studied the importance of a single feature, sentence position.
Just weighing a sentence by its position in text, which the authors term as the
\position method”, arises from the idea that texts generally follow a predictable
discourse structure, and that the sentences of greater topic centrality tend to occur in
certain speci able locations (e.g. title, abstracts, etc). However, since the discourse
structure signi cantly varies over domains, the position method cannot be de ned
as naively as in (Baxendale, 1958). The paper makes an important contribution by
investigating techniques of tailoring the position method towards optimality over a
genre and how it can be evaluated for e ectiveness. A newswire corpus was used, the
collection of Zi -Davis texts produced from the TIPSTER5 program; it consists of
text about computer and related hardware, accompanied by a set of key topic words
and a small abstract of six sentences. For each document in the corpus, the authors
measured the yield of each sentence position against the topic keywords. They then
ranked the sentence positions by their average yield to produce the Optimal Position
Policy (OPP) for topic positions for the genre.
Two kinds of evaluation were performed. Previously unseen text was used for
testing whether the same procedure would work in a di erent domain. The rst
evaluation showed contours exactly like the training documents. In the second eval-
uation, word overlap of manual abstracts with the extracted sentences was measured.
Windows in abstracts were compared with windows on the selected sentences and
corresponding precision and recall values were measured. A high degree of coverage
indicated the e ectiveness of the position method.
In later work, Lin (1999) broke away from the assumption that features are
independent of each other and tried to model the problem of sentence extraction
using decision trees, instead of a naive-Bayes classi er. He examined a lot of fea-
tures and their e ect on sentence extraction. The data used in this work is a
publicly available collection of texts, classi ed into various topics, provided by the
TIPSTER-SUMMAC6
evaluations, targeted towards information retrieval systems.
The dataset contains essential text fragments (phrases, clauses, and sentences) which
must be included in summaries to answer some TREC topics. These fragments were
each evaluated by a human judge. The experiments described in the paper are with
the SUMMARIST system developed at the University of Southern California. The
system extracted sentences from the documents and those were matched against
human extracts, like most early work on extractive summarization.
Some novel features were the query signature (normalized score given to sen-
tences depending on number of query words that they contain), IR signature (the
m most salient words in the corpus, similar to the signature words of (Aone et al.,
1999)), numerical data (boolean value 1 given to sentences that contained a num-
ber in them), proper name (boolean value 1 given to sentences that contained a
proper name in them), pronoun or adjective (boolean value 1 given to sentences
5See http://www.itl.nist.gov/iaui/894.02/related_projects/tipster/.
6See http://www-nlpir.nist.gov/related_projects/tipster_summac/index.html.
5that contained a pronoun or adjective in them), weekday or month (similar as pre-
vious feature) and quotation (similar as previous feature). It is worth noting that
some features like the query signature are question-oriented because of the setting
of the evaluation, unlike a generalized summarization framework.
The author experimented with various baselines, like using only the positional
feature, or using a simple combination of all features by adding their values. When
evaluated by matching machine extracted and human extracted sentences, the deci-
sion tree classi er was clearly the winner for the whole dataset, but for three topics,
a naive combination of features beat it. Lin conjectured that this happened because
some of the features were independent of each other. Feature analysis suggested
that the IR signature was a valuable feature, corroborating the early ndings of
Luhn (1958).
2.2.3 Hidden Markov Models
In contrast with previous approaches, that were mostly feature-based and non-
sequential, Conroy and O’leary (2001) modeled the problem of extracting a sentence
from a document using a hidden Markov model (HMM). The basic motivation for
using a sequential model is to account for local dependencies between sentences.
Only three features were used: position of the sentence in the document (built into
the state structure of the HMM), number of terms in the sentence, and likeliness of
the sentence terms given the document terms.
no 1 no 2 no 3 no
Figure 1: Markov model to extract to three summary sentences from a document
(Conroy and O’leary, 2001).
The HMM was structured as follows: it contained 2s + 1 states, alternating be-
tween s summary states and s+ 1 nonsummary states. The authors allowed \hesita-
tion” only in nonsummary states and \skipping next state” only in summary states.
Figure 1 shows an example HMM with 7 nodes, corresponding to s = 3. Using the
TREC dataset as training corpus, the authors obtained the maximum-likelihood
estimate for each transition probability, forming the transition matrix estimate M^ ,
whose element (i; j) is the empirical probability of transitioning from state i to j.
Associated with each state i was an output function, bi(O) = Pr(O j state i) where
O is an observed vector of features. They made a simplifying assumption that the
features are multivariate normal. The output function for each state was thus esti-
mated by using the training data to compute the maximum likelihood estimate of
its mean and covariance matrix. They estimated 2s+ 1 means, but assumed that all
of the output functions shared a common covariance matrix. Evaluation was done
6by comparing with human generated extracts.
2.2.4 Log-Linear Models
Osborne (2002) claims that existing approaches to summarization have always as-
sumed feature independence. The author used log-linear models to obviate this
assumption and showed empirically that the system produced better extracts than
a naive-Bayes model, with a prior appended to both models. Let c be a label, s
the item we are interested in labeling, fi the i-th feature, and i the corresponding
feature weight. The conditional log-linear model used by Osborne (2002) can be
stated as follows:
P(c j s) = 1
Z(s)
exp X
i
ifi(c; s)
!
; (2)
where Z(s) = P
c
exp (P
i
ifi(c; s)). In this domain, there are only two possible
labels: either the sentence is to be extracted or it is not. The weights were trained
by conjugate gradient descent. The authors added a non-uniform prior to the model,
claiming that a log-linear model tends to reject too many sentences for inclusion in
a summary. The same prior was also added to a naive-Bayes model for comparison.
The classi cation took place as follows:
label(s) = arg max
c2C
P(c)  P(s; c) = arg max
c2C

log P(c) +X
i
ifi(c; s)
!
: (3)
The authors optimized the prior using the f2 score of the classi er as an objective
function on a part of the dataset (in the technical domain). The summaries were
evaluated using the standard f2 score where f2 = 2pr
p+r
, where the precision and recall
measures were measured against human generated extracts. The features included
word pairs (pairs of words with all words truncated to ten characters), sentence
length, sentence position, and naive discourse features like inside introduction or
inside conclusion. With respect to f2 score, the log-linear model outperformed the
naive-Bayes classi er with the prior, exhibiting the former’s e ectiveness.
2.2.5 Neural Networks and Third Party Features
In 2001-02, DUC issued a task of creating a 100-word summary of a single news
article. However, the best performing systems in the evaluations could not outper-
form the baseline with statistical signi cance. This extremely strong baseline has
been analyzed by Nenkova (2005) and corresponds to the selection of the rst n
sentences of a newswire article. This surprising result has been attributed to the
journalistic convention of putting the most important part of an article in the initial
paragraphs. After 2002, the task of single-document summarization for newswire
was dropped from DUC. Svore et al. (2007) propose an algorithm based on neu-
ral nets and the use of third party datasets to tackle the problem of extractive
summarization, outperforming the baseline with statistical signi cance.
7The authors used a dataset containing 1365 documents gathered from CNN.com,
each consisting of the title, timestamp, three or four human generated story high-
lights and the article text. They considered the task of creating three machine
highlights. The human generated highlights were not verbatim extractions from the
article itself. The authors evaluated their system using two metrics: the rst one
concatenated the three highlights produced by the system, concatenated the three
human generated highlights, and compared these two blocks; the second metric con-
sidered the ordering and compared the sentences on an individual level.
Svore et al. (2007) trained a model from the labels and the features for each
sentence of an article, that could infer the proper ranking of sentences in a test
document. The ranking was accomplished using RankNet (Burges et al., 2005), a
pair-based neural network algorithm designed to rank a set of inputs that uses the
gradient descent method for training. For the training set, they used ROUGE-1
(Lin, 2004) to score the similarity of a human written highlight and a sentence
in the document. These similarity scores were used as soft labels during training,
contrasting with other approaches where sentences are \hard-labeled”, as selected
or not.
Some of the used features based on position or n-grams frequencies have been
observed in previous work. However, the novelty of the framework lay in the use
of features that derived information from query logs from Microsoft’s news search
engine7 and Wikipedia8
entries. The authors conjecture that if a document sentence
contained keywords used in the news search engine, or entities found in Wikipedia
articles, then there is a greater chance of having that sentence in the highlight. The
extracts were evaluated using ROUGE-1 and ROUGE-2, and showed statistically
signi cant improvements over the baseline of selecting the rst three sentences in a
document.
2.3 Deep Natural Language Analysis Methods
In this subsection, we describe a set of papers that detail approaches towards single-
document summarization involving complex natural language analysis techniques.
None of these papers solve the problem using machine learning, but rather use a set
of heuristics to create document extracts. Most of these techniques try to model the
text’s discourse structure.
Barzilay and Elhadad (1997) describe a work that used considerable amount of
linguistic analysis for performing the task of summarization. For a better under-
standing of their method, we need to de ne a lexical chain: it is a sequence of related
words in a text, spanning short (adjacent words or sentences) or long distances (en-
tire text). The authors’ method progressed with the following steps: segmentation
of the text, identi cation of lexical chains, and using strong lexical chains to identify
the sentences worthy of extraction. They tried to reach a middle ground between
(McKeown and Radev, 1995) and (Luhn, 1958) where the former relied on deep
7See http://search.live.com/news.
8See http://en.wikipedia.org.
8semantic structure of the text, while the latter relied on word statistics of the doc-
uments. The authors describe the notion of cohesion in text as a means of sticking
together di erent parts of the text. Lexical cohesion is a notable example where
semantically related words are used. For example, let us take a look at the following
sentence.9
John bought a Jag: He loves the car: (4)
Here, the word car refers to the word Jag in the previous sentence, and exempli es
lexical cohesion. The phenomenon of cohesion occurs not only at the word level,
but at word sequences too, resulting in lexical chains, which the authors used as
a source representation for summarization. Semantically related words and word
sequences were identi ed in the document, and several chains were extracted, that
form a representation of the document. To nd out lexical chains, the authors used
Wordnet (Miller, 1995), applying three generic steps:
1. Selecting a set of candidate words.
2. For each candidate word, nding an appropriate chain relying on a relatedness
criterion among members of the chains,
3. If it is found, inserting the word in the chain and updating it accordingly.
The relatedness was measured in terms of Wordnet distance. Simple nouns and
noun compounds were used as starting point to nd the set of candidates. In the
nal steps, strong lexical chains were used to create the summaries. The chains were
scored by their length and homogeneity. Then the authors used a few heuristics to
select the signi cant sentences.
In another paper, Ono et al. (1994) put forward a computational model of dis-
course for Japanese expository writings, where they elaborate a practical procedure
for extracting the discourse rhetorical structure, a binary tree representing relations
between chunks of sentences (rhetorical structure trees are used more intensively in
(Marcu, 1998a), as we will see below). This structure was extracted using a series
of NLP steps: sentence analysis, rhetorical relation extraction, segmentation, can-
didate generation and preference judgement. Evaluation was based on the relative
importance of rhetorical relations. In the following step, the nodes of the rhetori-
cal structure tree were pruned to reduce the sentence, keeping its important parts.
Same was done for paragraphs to nally produce the summary. Evaluation was done
with respect to sentence coverage and 30 editorial articles of a Japanese newspaper
were used as the dataset. The articles had corresponding sets of key sentences and
most important key sentences judged by human subjects. The key sentence coverage
was about 51% and the most important key sentence coverage was 74%, indicating
encouraging results.
Marcu (1998a) describes a unique approach towards summarization that, unlike
most other previous work, does not assume that the sentences in a document form
a at sequence. This paper used discourse based heuristics with the traditional
9Example from http://www.cs.ucd.ie/staff/jcarthy/home/Lex.html.
9features that have been used in the summarization literature. The discourse theory
used in this paper is the Rhetorical Structure Theory (RST) that holds between
two non-overlapping pieces of text spans: the nucleus and the satellite. The author
mentions that the distinction between nuclei and satellites comes from the empir-
ical observation that the nucleus expresses what is more essential to the writer’s
purpose than the satellite; and that the nucleus of a rhetorical relation is compre-
hensible independent of the satellite, but not vice versa. Marcu (1998b) describes
the details of a rhetorical parser producing a discourse tree. Figure 2 shows an
example discourse tree for a text example detailed in the paper. Once such a disAntithesis
2
Elaboration
Elaboration
2
2
Elaboration
3
Justification
8
Exemplification
1 2 3 4 5
7 8
4 5
8 10
9 10
5 6
Contrast
Evidence
Concession
Figure 2: Example of a discourse tree from Marcu (1998a). The numbers in the
nodes denote sentence numbers from the text example. The text below the number
in selected nodes are rhetorical relations. The dotted nodes are SATELLITES and
the normals ones are the NUCLEI.
course structure is created, a partial ordering of important units can be developed
from the tree. Each equivalence class in the partial ordering is derived from the
new sentences at a particular level of the discourse tree. In Figure 2, we observe
that sentence 2 is at the root, followed by sentence 8 in the second level. In the
third level, sentence 3 and 10 are observed, and so forth. The equivalence classes
are 2 > 8 > 3;10 > 1;4;5;7;9 > 6.
If it is speci ed that the summary should contain the top k% of the text, the rst
k% of the units in the partial ordering can be selected to produce the summary. The
author talks about a summarization system based just on this method in (Marcu,
1998b) and in one of his earlier papers. In this paper, he merged the discourse
based heuristics with traditional heuristics. The metrics used were clustering based
10metric (each node in the discourse tree was assigned a cluster score; for leaves the
score was 0, for the internal nodes it was given by the similarity of the immediate
children; discourse tree A was chosen to be better than B if its clustering score
was higher), marker based metric (a discourse structure A was chosen to be better
than a discourse structure B if A used more rhetorical relations than B), rhetorical
clustering based technique (measured the similarity between salient units of two text
spans), shape based metric (preferred a discourse tree A over B if A was more skewed
towards the right than B), title based metric, position based metric, connectedness
based metric (cosine similarity of an unit to all other text units, a discourse structure
A was chosen to be better than B if its connectedness measure was more than B).
A weighted linear combination of all these scores gave the score of a discourse
structure. To nd the best combination of heuristics, the author computed the
weights that maximized the F-score on the training dataset, which was constituted
by newswire articles. To do this, he used a GSAT-like algorithm (Selman et al.,
1992) that performed a greedy search in a seven dimensional space of the metrics.
For a part of his corpus (the TREC dataset), a best F-score of 75.42% was achieved
for the 10% summaries which was 3.5% higher than a baseline lead based algorithm,
which was very encouraging.
3 Multi-Document Summarization
Extraction of a single summary from multiple documents has gained interest since
mid 1990s, most applications being in the domain of news articles. Several Web-
based news clustering systems were inspired by research on multi-document summa-
rization, for example Google News,
10 Columbia NewsBlaster,
11 or News In Essence.
12
This departs from single-document summarization since the problem involves mul-
tiple sources of information that overlap and supplement each other, being contra-
dictory at occasions. So the key tasks are not only identifying and coping with
redundancy across documents, but also recognizing novelty and ensuring that the
nal summary is both coherent and complete.
The eld seems to have been pioneered by the NLP group at Columbia University
(McKeown and Radev, 1995), where a summarization system called SUMMONS13
was developed by extending already existing technology for template-driven message
understanding systems. Although in that early stage multi-document summariza-
tion was mainly seen as a task requiring substantial capabilities of both language
interpretation and generation, it later gained autonomy, as people coming from dif-
ferent communities added new perspectives to the problem. Extractive techniques
have been applied, making use of similarity measures between pairs of sentences.
Approaches vary on how these similarities are used: some identify common themes
through clustering and then select one sentence to represent each cluster (McKeown
10See http://news.google.com.
11See http://newsblaster.cs.columbia.edu.
12See http://NewsInEssence.com.
13SUMMarizing Online NewS articles.
11et al., 1999; Radev et al., 2000), others generate a composite sentence from each
cluster (Barzilay et al., 1999), while some approaches work dynamically by includ-
ing each candidate passage only if it is considered novel with respect to the previous
included passages, via maximal marginal relevance (Carbonell and Goldstein, 1998).
Some recent work extends multi-document summarization to multilingual environ-
ments (Evans, 2005).
The way the problem is posed has also varied over time. While in some pub-
lications it is claimed that extractive techniques would not be e ective for multi-
document summarization (McKeown and Radev, 1995; McKeown et al., 1999), some
years later that claim was overturned, as extractive systems like MEAD14 (Radev
et al., 2000) achieved good performance in large scale summarization of news arti-
cles. This can be explained by the fact that summarization systems often distinguish
among themselves about what their goal actually is. While some systems, like SUM-
MONS, are designed to work in strict domains, aiming to build a sort of brie ng
that highlights di erences and updates accross di erent news reports, putting much
emphasis on how information is presented to the user, others, like MEAD, are large
scale systems that intend to work in general domains, being more concerned with
information content rather than form. Consequently, systems of the former kind re-
quire a strong e ort on language generation to produce a grammatical and coherent
summary, while latter systems are probably more close to the information retrieval
paradigm. Abstractive systems like SUMMONS are dicult to replicate, as they
heavily rely on the adaptation of internal tools to perform information extraction
and language generation. On the other hand, extractive systems are generally easy
to implement from scratch, and this makes them appealing when sophisticated NLP
tools are not available.
3.1 Abstraction and Information Fusion
As far as we know, SUMMONS (McKeown and Radev, 1995; Radev and McKeown,
1998) is the rst historical example of a multi-document summarization system. It
tackles single events about a narrow domain (news articles about terrorism) and
produces a brie ng merging relevant information about each event and how reports
by di erent news agencies have evolved over time. The whole thread of reports is
then presented, as illustrated in the following example of a \good” summary:
\In the afternoon of February 26, 1993, Reuters reported that a suspect
bomb killed at least ve people in the World Trade Center. However,
Associated Press announced that exactly ve people were killed in the
blast. Finally, Associated Press announced that Arab terrorists were
possibly responsible for the terrorist act.”
Rather than working with raw text, SUMMONS reads a database previously
built by a template-based message understanding system. A full multi-document
14Available for download at http://www.summarization.com/mead/.
12summarizer is built by concatenating the two systems, rst processing full text as
input and lling template slots, and then synthesizing a summary from the extracted
information. The architecture of SUMMONS consists of two major components: a
content planner that selects the information to include in the summary through
combination of the input templates, and a linguistic generator that selects the right
words to express the information in grammatical and coherent text. The latter
component was devised by adapting existing language generation tools, namely the
FUF/SURGE system15. Content planning, on the other hand, is made through
summary operators, a set of heuristic rules that perform operations like \change of
perspective”, \contradiction”, \re nement”, etc. Some of these operations require
resolving con icts, i.e., contradictory information among di erent sources or time
instants; others complete pieces of information that are included in some articles
and not in others, combining them into a single template. At the end, the linguis-
tic generator gathers all the combined information and uses connective phrases to
synthesize a summary.
While this framework seems promising when the domain is narrow enough so that
the templates can be designed by hand, a generalization for broader domains would
be problematic. This was improved later by McKeown et al. (1999) and Barzilay
et al. (1999), where the input is now a set of related documents in raw text, like
those retrieved by a standard search engine in response to a query. The system starts
by identifying themes, i.e., sets of similar text units (usually paragraphs). This is
formulated as a clustering problem. To compute a similarity measure between text
units, these are mapped to vectors of features, that include single words weighted
by their TF-IDF scores, noun phrases, proper nouns, synsets from the Wordnet
database and a database of semantic classes of verbs. For each pair of paragraphs, a
vector is computed that represents matches on the di erent features. Decision rules
that were learned from data are then used to classify each pair of text units either
as similar or dissimilar; this in turn feeds a subsequent algorithm that places the
most related paragraphs in the same theme.
Once themes are identi ed, the system enters its second stage: information fu-
sion. The goal is to decide which sentences of a theme should be included in the
summary. Rather than just picking a sentence that is a group representative, the
authors propose an algorithm which compares and intersects predicate argument
structures of the phrases within each theme to determine which are repeated often
enough to be included in the summary. This is done as follows: rst, sentences are
parsed through Collins’ statistical parser (Collins, 1999) and converted into depen-
dency trees, which allows capturing the predicate-argument structure and identify
functional roles. Determiners and auxiliaries are dropped; Fig. 3 shows a sentence
representation.
The comparison algorithm then traverses these dependency trees recursively,
adding identical nodes to the output tree. Once full phrases (a verb with at least
two constituents) are found, they are marked to be included in the summary. If two
15FUF, SURGE, and other tools developed by the Columbia NLP group are available at

http://www1.cs.columbia.edu/nlp/tools.cgi.

13and Kan 1998]. We match two verbsthatshare the same
semantic classin this classification.
In addition to the above primitive features that all compare single itemsfromeach text unit, we use composite featuresthat combine pairs of primitive features. Our composite featuresimpose particular constraints on the order of the
two elementsin the pair, on the maximumdistance between
the two elements, and on the syntactic classes that the two
elements come from. They can vary from a simple combination (e.g., “two text units must share two words to be
similar”)to complex cases with many conditions(e.g., “two
text units must have matching noun phrases that appear in
the same order and with relative difference in position no
more than five”). In this manner, we capture information
on how similarly related elements are spaced out in the two
text units, as well as syntactic information on word combinations. Matches on composite features indicate combined
evidence forthe similarity ofthe two units.
To determinewhetherthe units match overall, we employ
amachine learning algorithm[
Cohen 1996]that induces decision rules using the featuresthat really make a difference.
A set of pairs of units already marked assimilar or not by a
human is used for training the classifier. We have manually
marked a set of 8,225 paragraph comparisonsfromtheTDT
corpusfortraining and evaluating oursimilarity classifier.
For comparison, we also use an implementation of the
TF*IDF method which isstandard for matching textsin information retrieval. We compute the totalfrequency (TF) of
wordsin each text unit and the number of unitsin ourtraining set each word appearsin (DF, or document frequency).
Then each text unit is represented as a vector of TF*IDF
scores, calculated as
TF(wordi
)· log
Total number of units
DF(wordi
)
Similarity between text units is measured by the cosine of
the angle between the corresponding two vectors (i.e., the
normalized inner product of the two vectors), and the optimal value of a threshold for judging two units as similar is
computed fromthe training set.
After all pairwise similarities between text units have
been calculated, we utilize a clustering algorithm to identify themes. As a paragraphmay belong to multiple themes,
most standard clustering algorithms, which partition their
input set, are not suitable for our task. We use a greedy,
one-pass algorithmthatfirst constructsgroupsfromthemost
similar paragraphs, seeding the groups with the fully connected subcomponents of the graph that the similarity relationship induces over the set of paragraphs, and then places
additional paragraphs within a group if the fraction of the
members of the group they are similar to exceeds a preset
threshold.
Language Generation
Given a group of similar paragraphs—a theme—the problemisto create a concise and fluentfusion ofinformation in
this theme, reflecting facts common to all paragraphs. A
straightforward method would be to pick a representative
subject
class: noun
27
class: cardinal
bombing
class: noun
McVeigh with
class: preposition
definite: yes
charge
class: verb voice :passive
tense: past polarity: +
Figure 4: Dependency grammar representation of the sentence “McVeigh, 27, was charged with the bombing”.
sentence that meets some criteria (e.g., a threshold number
of common content words). In practice, however, any representative sentence will usually include embedded phrase(s)
containing information that is not common to all sentences
in the theme. Furthermore, othersentencesin the theme often contain additional information not presented in the representative sentence. Our approach, therefore, uses intersection among theme sentencesto identify phrases common
to most paragraphs and then generates a new sentence from
identified phrases.
Intersection amongTheme Sentences
Intersection is carried out in the content planner, which uses
a parser for interpreting the input sentences, with our new
work focusing on the comparison of phrases. Theme sentences are firstrun through a statistical parser[
Collins 1996]
and then, in order to identify functional roles(e.g.,subject,
object), are converted to a dependency grammarrepresentation [
Kittredge and Mel’cuˇ k 1983], which makes predicateargumentstructure explicit.
We developed a rule-based component to produce functionalroles, which transformsthe phrase-structure output of
Collins’ parserto dependencygrammar;functionwords(determiners and auxiliaries) are eliminated from the tree and
corresponding syntactic features are updated. An example
of a theme sentence and its dependency grammar representation are shown inFigure 4. Each non-auxiliaryword in the
sentence has a node in the representation, and this node is
connected to its direct dependents.
The comparison algorithm starts with all subtreesrooted
at verbsfrom the input dependency structure, and traverses
them recursively: if two nodes are identical, they are added
to the output tree, and their children are compared. Once
a full phrase (verb with at least two constituents) has been
found, it is confirmed forinclusion in the summary.
Difficulties arisewhen two nodes are notidentical, but are
similar. Such phrases may be paraphrases of each other and
still convey essentially the same information. Since theme
sentences are a priori close semantically, this significantly
Figure 3: Dependency tree representing the sentence \McVeigh, 27, was charged
with the bombing” (extracted from (McKeown et al., 1999)).
phrases, rooted at some node, are not identical but yet similar, the hypothesis that
they are paraphrases of each other is considered; to take this into account, corpus-
driven paraphrasing rules are written to allow paraphrase intersection.16 Once the
summary content (represented as predicate-argument structures) is decided, a gram-
matical text is generated by translating those structures into the arguments expected
by the FUF/SURGE language generation system.
3.2 Topic-driven Summarization and MMR
Carbonell and Goldstein (1998) made a major contribution to topic-driven sum-
marization by introducing the maximal marginal relevance (MMR) measure. The
idea is to combine query relevance with information novelty; it may be applicable
in several tasks ranging from text retrieval to topic-driven summarization. MMR
simultaneously rewards relevant sentences and penalizes redundant ones by consid-
ering a linear combination of two similarity measures.
Let Q be a query or user pro le and R a ranked list of documents retrieved by
a search engine. Consider an incremental procedure that selects documents, one at
a time, and adds them to a set S. So let S be the set of already selected documents
in a particular step, and R n S the set of yet unselected documents in R. For each
candidate document Di 2 R n S, its marginal relevance MR(Di) is computed as:
MR(Di) := Sim1(Di
; Q) (1 ) max
Dj2S
Sim2(Di
; Dj) (5)
where  is a parameter lying in [0;1] that controls the relative importance given
to relevance versus redundancy. Sim1 and Sim2 are two similarity measures; in the
16A full description of the kind of paraphrasing rules used can be found in (Barzilay et al.,
1999). Examples are: ordering of sentence components, main clause vs. relative clause, realization
in di erent syntactic categories (e.g. classi er vs. apposition), change in grammatical features
(active/passive, time, number, etc.), head omission, transformation from one POS to another,
using semantically related words (e.g. synonyms), etc.
14