Community

Why not extracting the data from the Web?

Because the information is not reliable and is spread in different sites. Google Scholar although provides a feature for filtering papers from a specific venue, it returns many records that do not belong to it and missed others. Using IEEE Explore or ACM Digital Library can provide more accuracy but still there are missing proceedings (e.g. VISSOFT 2003), they do not seem to provide an API to programatically extract data and retrieving the data directly from the site seems prone to error. So our approach will be to extract the data directly from a corpus of PDF files.

What did not work

Corpora

Extract text from PDF files

Parse text for extracting title and authors.

Extract title and authors from XML file (Pharo).

Using XMLSupport

Gofer new
  squeaksource: 'XMLSupport'; 
  package: 'ConfigurationOfXMLSupport';
  load.

(Smalltalk at: #ConfigurationOfXMLSupport) 
perform: #loadDefault.

Analysing one file

file := 'output.head'.
xml := XMLDOMParser parseFileNamed: file.
title := ((xml allElementsNamed: #title) collect:[:e| e nodes first ]) 
first asString.
authors := ((xml allElementsNamed: #author) collect:[:e| e nodes first]) 
asOrderedCollection.

Running the visualisation

VCExtractor new visualise

First attempt. ICSM (from '95 to 2013: 15 editions)

Icsm