Evaluating the potential of graph algorithms for extracting the main text from HTML files

Type:

Bachelor

Student name:

Lukas Lechner

Assignment Date:

November 9, 2022

The objective of this thesis is to explore graph algorithms as a novel approach for identifying and extracting the main text content from web pages.
At the time of writing, main text extraction from web pages has not been proposed as a potential application domain for graph algorithms. This is a lost opportunity considering the fact that the DOM tree of a given HTML document is intuitively represented as a hierarchical graph and therefore practically suggests the use of graph algorithms for associated ML problems. The main goals of the thesis will be the following:

investigate the potential of graph algorithms for main text extraction in web pages,
design and test different graph representations of a given HTML document (nodes, links, node attributes) to determine the optimal graph representation for main text extraction, and
evaluate and compare graph algorithms against state-of-the-art approaches.