Learning Algorithms for Keyphrase Extraction

  1. (PDF, 492 KB)
  2. Get@NRC: Learning Algorithms for Keyphrase Extraction (Opens in a new window)
DOIResolve DOI: http://doi.org/10.1023/A:1009976227802
AuthorSearch for:
Journal titleInformation Retrieval
Pages303336; # of pages: 31
Subjectmachine learning; summarization; indexing; keywords; keyphrase extraction; réduction; indexation; mots clés; extraction de phrases clés.
AbstractMany academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for automatically extracting keyphrases from text. The experimental results support the claim that a custom-designed algorithm (GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a general-purpose algorithm (C4.5). Subjective human evaluation of the keyphrases generated by Extractor suggests that about 80% of the keyphrases are acceptable to human readers. This level of performance should be satisfactory for a wide variety of applications.
Publication date
AffiliationNational Research Council Canada; NRC Institute for Information Technology
Peer reviewedNo
NRC number44105
NPARC number8913713
Export citationExport as RIS
Report a correctionReport a correction
Record identifierc3c43a82-5ef9-4179-b820-763ad2d9ec62
Record created2009-04-22
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)
Date modified: