Probabilistic models for focused web crawling

  1. Get@NRC: Probabilistic models for focused web crawling (Opens in a new window)
DOIResolve DOI:
AuthorSearch for: ; Search for:
Journal titleComputational Intelligence
Pages289328; # of pages: 40
SubjectBest first search; Conditional random field; Domain specific; Experimental validations; Focused crawler; Focused crawling; Focused web crawling; Global optimal solutions; GraphicaL model; Hidden state; Hop distance; Link analysis; Local classifier; Maximum entropy Markov model; Overlapping features; Personalized search; Probabilistic models; Sequential patterns; Text content; Topical crawlers; Web Mining; Web searches; Hidden Markov models; Learning systems; Websites; Learning algorithms
AbstractA focused crawler is an efficient tool used to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. Focused crawlers can only use information obtained from previously crawled pages to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modeling of context as well as the quality of the current observations. To address this challenge, we propose capturing sequential patterns along paths leading to targets based on probabilistic models. We model the process of crawling by a walk along an underlying chain of hidden states, defined by hop distance from target pages, from which the actual topics of the documents are observed. When a new document is seen, prediction amounts to estimating the distance of this document from a target. Within this framework, we propose two probabilistic models for focused crawling, Maximum Entropy Markov Model (MEMM) and Linear-chain Conditional Random Field (CRF). With MEMM, we exploit multiple overlapping features, such as anchor text, to represent useful context and form a chain of local classifier models. With CRF, a form of undirected graphical models, we focus on obtaining global optimal solutions along the sequences by taking advantage not only of text content, but also of linkage relations. We conclude with an experimental validation and comparison with focused crawling based on Best-First Search (BFS), Hidden Markov Model (HMM), and Context-graph Search (CGS). © 2012 Wiley Periodicals, Inc.
Publication date
AffiliationNational Research Council Canada (NRC-CNRC); NRC Institute for Information Technology (IIT-ITI)
Peer reviewedYes
NPARC number21269253
Export citationExport as RIS
Report a correctionReport a correction
Record identifier29ad1645-e1b5-4487-8572-ed43fc07353c
Record created2013-12-12
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)
Date modified: