Automatic Identification of Genre in Web Pages

Год: 2011
Автор: Marina Santini
Издательство: LAP Lambert Academic Publishing
Genre is a complex but intuitively understood concept. Home pages, FAQs, blogs, etc. are examples of genres currently thriving on the web. Automatically identifying web genres would help us find documents that are more relevant to our information needs. The aim of the research described in this book is to develop automatic genre classification algorithms. There are several challenges, however, that affect the modelling of these algorithms. First, genres on the web are instantiated in web pages, which can be considered documents of a new type, much more unpredictable and individualised than documents on paper. Second, the web is unstable and fluid, undergoing a fast-paced evolution, so genre identification is influenced by phenomena such as the formation of novel genres, genre hybridism, individualisation, intra-genre and inter-genre variation. Finally, the automatically extractable genre-revealing features used up to now are not adequate to define existing and novel web genres. The…
