EggShell — A workbench for modeling scientific communities

Dominik Seliner. EggShell — A workbench for modeling scientific communities. Bachelor’s thesis, University of Bern, August 2016. Details.


The collaboration in a scientific community can be analysed through the publication record of its members. The analysis of the metadata (e.g., title and authors) of those publications can help researchers to identify groups of collaboration, their evolution, and key authors. However, the criteria for collecting the papers of some communities might exceed the expressiveness offered by public databases and search engines available. Hence, the data has to be retrieved from the papers’ files themselves. Usually, scientific papers are available in unstructured file formats for which auto- matic extraction of data poses a challenge. To model the metadata of a community users have to define a pipeline. In it, each step contributes to the accuracy of the extracted data. The main challenge is to identify to which type of field of the document a piece of text corresponds. Previous research proposed heuristics to identify certain fields like the title and authors from papers’ files by analyzing their layout. The performance of such heuristics might vary across papers that use different layouts. Hence, ensuring the accuracy of a given heuristic is a challenging problem. Small improvements in a heuristic that tackles a popular layout can make a high impact on its overall performance. However, identifying popular layouts and evaluating the impact of improvements can be a laborious task. Visualization offers techniques that fit the analysis of such multivariate data. Through visualization, a developer who is implementing a heuristic for data extraction can obtain an overview of how it performs and find hotspots that can lead to improvements that impact the overall efficacy. In this thesis, we propose EggShell, a workbench that incorporates visualization to assess the performance of modeling pipelines for scientific papers in PDF format. We elaborate on examples of how EggShell allows users to define multiple pipelines. Pipelines can then be improved by assessing their output using visualization. We collected a corpus of 300 papers published by SOFTVIS/VISSOFT venues. We used a subset of 100 papers as a learning set to develop the pipelines, and then used the remaining 200 papers to evaluate their performance for modeling collaboration in the community. We observed that our best performing pipeline exhibits an accuracy of 70%.

Posted by scg at 26 August 2016, 11:15 am link
Last changed by admin on 21 April 2009