Spreading excellence and disseminating the cutting edge results of our research and development efforts is crucial to our institute. Check for our educational offers for Bachelor, Master and PhD studies at the University of Innsbruck!
The objective of this thesis is to explore graph algorithms as a novel approach for identifying and extracting the main text content from web pages.
At the time of writing, main text extraction from web pages has not been proposed as a potential application domain for graph algorithms. This is a lost opportunity considering the fact that the DOM tree of a given HTML document is intuitively represented as a hierarchical graph and therefore practically suggests the use of graph algorithms for associated ML problems. The main goals of the thesis will be the following:
Contact person in charge.