Debugging Spark Applications — A Study on Debugging Techniques of Spark Developers

Melike Geçer. Debugging Spark Applications — A Study on Debugging Techniques of Spark Developers. Masters thesis, University of Bern, May 2020. Details.


Debugging is the main activity to investigate software failures, identify their root causes, and eventually fix them. Debugging distributed systems in particular is burdensome, due to the challenges of managing numerous devices and concurrent operations, detecting the problematic node, lengthy log files, and real-world data being inconsistent. Apache Spark is a distributed framework which is used to run analyses on large-scale data. Debugging Apache Spark applications is difficult as no tool, apart from log files, is available on the market. However, an application may produce a lengthy log file, which is challenging to examine. In this thesis, we aim to investigate various techniques used by developers on a distributed system. In order to achieve that, we interviewed Spark application developers, presented them with buggy applications, and observed their debugging behaviors. We found that most of the time, they formulate hypotheses to allay their suspicions and check the log files as the first thing to do after obtaining an exception message. Afterwards, we use these findings to compose a debugging flow that can help us to understand the way developers debug a project.

Posted by scg at 20 May 2020, 10:15 am link
Last changed by admin on 21 April 2009