Hapax analyses the vocabulary of software systems. I apply search engine technology (Latent Semantic Analysis) to analyse the vocabulary and topics of software. In my research, I used Hapax for

  • aspect mining,
  • software clustering and,
  • source code classification.

Hapax’s approach is programming language independent as it is based on identifier names and comments only.  

Hapax Download

The original Hapax is written in Visualworks Smalltalk. Recently, Romain an me started porting Hapax to Squeak, but its not yet done, see


  • Adrian Kuhn, Stéphane Ducasse and Tudor Gîrba, "Semantic Clustering: Identifying Topics in Source Code," Information and Software Technology (IST), vol. 49, no. 3, March 2007, pp. 230—243. (view Bibtex, download PDF)
  • This journal paper is an extension of an WCRE paper and my Master’s thesis.


