Aaron Karper. Efficient regular expressions that produce parse trees. Masters thesis, University of Bern, December 2014. Details.
Regular expressions naturally and intuitively define parse trees that describe the text that they’re parsing. We describe a technique for building up the complete parse tree resulting from matching a text against a regular expression. In previous tagged deterministic finite-state automaton (TDFA) matching implementations, all paths through the non-deterministic finite-state automaton (NFA) are walked simultaneously, in different co-routines, where inside each co-routine, it is fully known when which capture group was entered or left. We extend this model to keep track of not just the last opening and closing of capture groups, but all of them. We do this by storing in every co-routine a history of the all groups using the flyweight pattern. Thus, we log enough information during parsing to build up the complete parse tree after matching in a single pass, making it possible to use our algorithm with strings exceeding the machine’s memory. Further we construct the automata such that a simulation of backtracking like behaviour is possible. This is achieved in worst case time $Theta$(mn), providing full parse trees with only constant slowdown compared to matching.
Boris Spasojević, Mircea Lungu, and Oscar Nierstrasz. Overthrowing the Tyranny of Alphabetical Ordering in Documentation Systems. In 2014 IEEE International Conference on Software Maintenance and Evolution, p. 511-515, September 2014. Details.
Software developers are often unsure of the exact name of the API method they need to use to invoke the desired behavior. Most state-of-the-art documentation browsers present API artefacts in alphabetical order. Albeit easy to implement, alphabetical order does not help much: if the developer knew the name of the required method, he could have just searched for it in the first place. In a context where multiple projects use the same API, and their source code is available, we can improve the API presentation by organizing the elements in the order in which they are more likely to be used by the developer. Usage frequency data for methods is gathered by analyzing other projects from the same ecosystem and this data is used then to improve tools. We present a preliminary study on the potential of this approach to improve the API presentation by reducing the time it takes to find the method that implements a given feature. We also briefly present our experience with two proof-of-concept tools implemented for Smalltalk and Java.
Boris Spasojević, Mircea Lungu, and Oscar Nierstrasz. Mining the Ecosystem to Improve Type Inference For Dynamically Typed Languages. In Onward! ’14, ACM, New York, NY, USA, 2014. Details.
Dynamically typed languages lack information about the types of variables in the source code. Developers care about this information as it supports program comprehension. Ba- sic type inference techniques are helpful, but may yield many false positives or negatives. We propose to mine information from the software ecosys- tem on how frequently given types are inferred unambigu- ously to improve the quality of type inference for a single system. This paper presents an approach to augment existing type inference techniques by supplementing the informa- tion available in the source code of a project with data from other projects written in the same language. For all available projects, we track how often messages are sent to instance variables throughout the source code. Predictions for the type of a variable are made based on the messages sent to it. The evaluation of a proof-of-concept prototype shows that this approach works well for types that are sufficiently popular, like those from the standard librarie, and tends to create false positives for unpopular or domain specific types. The false positives are, in most cases, fairly easily identifiable. Also, the evaluation data shows a substantial increase in the number of correctly inferred types when compared to the non-augmented type inference.
Jan Kurs, Mircea Lungu, and Oscar Nierstrasz. Bounded Seas: Island Parsing Without Shipwrecks. In Benoit Combemale, DavidJ. Pearce, Olivier Barais, and JurgenJ. Vinju (Ed.), Software Language Engineering, Lecture Notes in Computer Science 8706 p. 62-81, Springer International Publishing, 2014. Details.
Imprecise manipulation of source code (semi-parsing) is useful for tasks such as robust parsing, error recovery, lexical analysis, and rapid development of parsers for data extraction. An island grammar precisely defines only a subset of a language syntax (islands), while the rest of the syntax (water) is defined imprecisely. Usually, water is defined as the negation of islands. Albeit simple, such a definition of water is naive and impedes composition of islands. When developing an island grammar, sooner or later a programmer has to create water tailored to each individual island. Such an approach is fragile, however, because water can change with any change of a grammar. It is time-consuming, because water is defined manually by a programmer and not automatically. Finally, an island surrounded by water cannot be reused because water has to be defined for every grammar individually. In this paper we propose a new technique of island parsing - bounded seas. Bounded seas are composable, robust, reusable and easy to use because island-specific water is created automatically. We integrated bounded seas into a parser combinator framework as a demonstration of their composability and reusability.
Andrea Caracciolo, Andrei Chis, Boris Spasojevic, and Mircea Lungu. Pangea: A Workbench for Statically Analyzing Multi-Language Software Corpora. In Source Code Analysis and Manipulation (SCAM), 2014 IEEE 14th International Working Conference on, IEEE, September 2014. Details.
Software corpora facilitate reproducibility of analyses, however, static analysis for an entire corpus still requires considerable effort, often duplicated unnecessarily by multiple users. Moreover, most corpora are designed for single languages increasing the effort for cross-language analysis. To address these aspects we propose Pangea, an infrastructure allowing fast development of static analyses on multi-language corpora. Pangea uses language-independent meta-models stored as object model snapshots that can be directly loaded into memory and queried without any parsing overhead. To reduce the effort of performing static analyses, Pangea provides out-of-the box support for: creating and refining analyses in a dedicated environment, deploying an analysis on an entire corpus, using a runner that supports parallel execution, and exporting results in various formats. In this tool demonstration we introduce Pangea and provide several usage scenarios that illustrate how it reduces the cost of analysis.