Readings in Empirical Evaluation for 
Budding Software Engineering Researchers

Philip Johnson
Collaborative Software Development Laboratory
University of Hawaii 

Last Update: 08/03/2005 11:47 AM

Motivation

In CSDL, students usually write a thesis describing research in software engineering.  In the CSDL research culture, this typically means (a) designing and implementing a new kind of technology, (b) performing an empirical study using the technology, and (c) writing up the results.

In many cases, the hardest part is coming up with an appropriate and effective empirical study.  In some circumstances, it appears to me that students look at old theses, find a survey that was given out in a classroom, and modify it slightly.  While learning from examples is not a bad thing to do, it is important to develop a deeper understanding of how to do empirical studies in order to make sure that the example you are leveraging is appropriate to your situation.

To help all of us become more sophisticated in our approach to empirical studies in the context of software engineering research, this technical report provides pointers to a set of readings.  

As you proceed with your research, you are attempting to create a "tight" research project: one that contains (a) a motivation in terms of a 'big' problem in software engineering whose solution would be important; (b) a set of research questions regarding some concrete aspect of the 'big' problem, (c) a set of testable hypotheses corresponding to these research questions, (d) an evaluation methodology that yields evidence either for or against the hypotheses, (e) the data you collected, and (f) your interpretation of the results.  I find that in the initial phases of research, some students come up with interesting technological ideas without good research questions; in other words, they have a solution looking for a problem . That's not good. Others come up with interesting hypotheses that aren't actually tested by the data they intend to collect.  That's not good either.   A good research project "hangs together" as a whole.

This is much, much harder than it might seem at first glance.  Empirical design is a skill that requires practice to become good at.  Just as you don't read a book on software design with the expectation of finding the exact design required for your software system, you shouldn't read the following papers with the expectation of finding the exact evaluation design appropriate for your research.  Similarly, just as you wouldn't expect to become a good software designer just by reading a book or two, you can't become a good experimentalist just by reading through the following links (although that's a good first step).  To be a good designer of either software or empirical evaluation takes practice and experience.   But you have to start somewhere, so this document contains citations that I hope will help you get some traction. 

CSDL Empirical Evaluation Strategies

As I noted above, almost all CSDL theses involve the development of a novel software-based technology.  This greatly influences the approach to evaluation. (In contrast, theses in "information technology" might study a pre-existing technology, and thus their whole research process can be devoted to evaluation.  In CSDL, a substantial amount of the overall research effort involves the design and implementation of the new technology.)

If you look at prior CSDL theses, the evaluation tends to address one (or more) of the following issues:

(1) What happens when users employ my technology?  This is the most basic form of evaluation, and is essentially a usability evaluation. The common pitfall when developing this kind of evaluation in the context of CSDL is to focus too much on superficial surface characteristics of the technology, such as fonts, colors, menu items, etc. (This is because much of the usability literature seems to focus on these issues.)  Although some evaluation of the "look and feel" is important, your usability evaluation should focus primarily on ways to discover whether the technology worked correctly, whether users used your technology correctly, whether they feel they benefited from the use of your technology, what obstacles they encountered while using it, and so forth. In general, you want to find out what you would want to know if you were going to embark on the 2.0 version of your technology.   By reading the literature on usability evaluation, you will find that there are many different ways to accomplish this task, from questionnaires, to video-taped observations, to interviews.  All of these have been used in CSDL in the past; you must determine what approach is best suited to your situation. 

(2) Is my technology effective? Unlike the first form of evaluation, which is essentially descriptive in nature, determining whether your technology is effective involves some form of comparison. Basically, you want to see if your technology produces some kind of change, or effect.  For this to work, you need to gather at least two different groups of data and compare the values you obtained.  How you create these groups really depends upon the nature of your technology and research questions.  For example, Danu Tjahjono's evaluation involved splitting up two classes of students into multiple groups to test different inspection approaches.  Aaron Kagawa's evaluation involved splitting up Hackystat packages into two groups: one group being packages that should be less in need of inspection and the other group being packages that should be more in need of inspection. Cedric Zhang's evaluation involves a comparison over time: first a set of baseline measures are obtained, then the technology is introduced, and the new values for the measures are compared to the original ones. 

(3) What broader issue(s) in software engineering can be investigated using my technology?  In most cases, CSDL research results in technology that not only provides automated support for some kind of practice, it also provides infrastructure that helps address more fundamental questions in software engineering.  In other words, it can serve as experimental infrastructure.  As an example, Danu Tjahjono's research on CSRS not only resulted in technology support for code inspection, it also provided infrastructure that allowed him to investigate whether the group meeting phase of inspection was actually cost-effective for defect removal. 

In general, B.S. theses tend to address (1) and perhaps (2), M.S. theses tend to address (1) and (2), and Ph.D. theses tend to address (1), (2), and (3).  

General Introductions to Empirical Research Design 

One question you should be prepared to answer at your thesis defense: why did you _not_ choose another empirical evaluation approach (i.e. a controlled experiment, or an ethnographic study, etc.)  In other words, why is the design that you chose the best one for your research situation? In order to answer that question, you need to first understand what the alternatives are.  This section provides links to two sites: allpsych.com and socialresearchmethods.net, both of which have well-written overviews of the various empirical design methods.

AllPsych.com:

Research Methods, Chapter 4: Single Subject Design. This chapter shows an experimental design method for comparing treatment effects on a single subject or a group of single subjects.  The basic idea is to begin with a pre-test, or collection of baseline information, then introduce the treatment, and see if the baselines change. 

Research Methods, Chapter 5: Experimental Design. This chapter introduces three basic experimental designs: (1) pre-experimental design, (2) quasi-experimental design, and (3) true experimental design. 

Research Methods: Chapter 6: Historical, Developmental, and Qualitative Research Design.  These are 'non-experimental', or qualitative research designs. If properly designed and executed, these designs aren't any "less" valid than experimental designs, they are just "differently" valid.

SocialResearchMethods.net:

Experimental Design.  This chapter provides another overview of experimental (and quasi-experimental) design, and discusses the trade-offs between them.

Books and other offline resources:

Experimental and Quasi-Experimental Designs for Research, Donald Campbell and Julian Stanley, Houghton-Mifflin, 1963. (Recommended by Victor Basili)

Research Methods in Social Relations, Charles Judd, Eliot Smith, Louise Kidder, Harcourt Brace Jovanovich, 1991. (Recommended by Larry Votta).

Case Study Research, Robert Yin, Sage Publications, 1994.

Guidelines for Empirical Research

Although most of the papers cited on this page present guidelines of one sort or another, the following articles provide an excellent overview. 

Preliminary guidelines for empirical research in software engineering, Barbara Kitchenham et al. Once you have decided on a possible evaluation approach for your research, this paper can help you identify the key pieces of information that you must specify to carry out the evaluation appropriately.

Writing good software engineering research papers, Mary Shaw. Discusses research paradigms present in typical software engineering conference paper submissions, the concerns of program committee members, and suggestions on how to design research and present results for optimal acceptance.

Ethical issues in empirical studies of software engineering, Singer et al. Introduces ethical issues that arise in software engineering research and how to best address them.

Perspectives on the state of empirical research in software engineering

These papers combine practical guidance on empirical research with a discussion of the state of the discipline. 

Experimentation in Software Engineering, Victor Basili et al. This paper provides a framework for classification of experimental research in software engineering, and recommendations for future experimental research.

Should computer scientists experiment more?, Walter Tichy. An article describing (and refuting) common misperceptions regarding empirical evaluation in software engineering.

What makes good research in software engineering?, Mary Shaw. Provides an overview of research paradigms in software engineering.

Empirical Studies of Software Engineering, Perry at al. Discusses why we need empirical studies, common problems in empirical research in software engineering, and guidelines to support useful software engineering empirical research.

Experimental Validation in Software Engineering, Marvin Zelkowitz and Dolores Wallace. Discusses a 12 model classification scheme for experimental software engineering research and uses it to evaluate how software engineering research validates its theories and how software engineering compares to other scientific disciplines. 

Qualitative Research

Most CSDL research has a qualitative component.  These links provide useful insight into how to do effective qualitative research.

Case Studies for Method and Tool Evaluation, Barbara Kitchenham et al. Provides a nice introduction to case studies, indicating how they differ from experimental studies, and what guidelines to follow to help improve the usefulness of the results.

Qualitative Methods in Empirical Studies of Software Engineering, Seaman et al.  Discusses qualitative techniques for gathering information about the human aspects of software engineering and how they can be integrated with quantitative techniques.

Using Qualitative Methods in Software Engineering, Jeff Carver, Carolyn Seaman, Ross Jeffrey.  These are slides from the 2004 International Advanced School of Empirical Software Engineering. Provides insights on effective interviewing, participant observation, surveys, coding, along with case studies of qualitative software engineering research.

Grounded theory: a thumbnail sketch, Bob Dick. Provides a very nice overview of Grounded Theory, a data-driven, qualitative, emergent approach to theory building. Includes references and online links to further information.

Specialized issues in software engineering empirical research

This section provides more detailed information about specific kinds of empirical research issues.  They will be more or less relevant to you depending upon the empirical design you choose. In many cases, these links are only a starting point; you will have to do more research to learn how to apply the techniques introduced below. 

Usability Evaluation:

Usability Evaluation Methods, Zhijun Zhang.  Provides an overview of the three types of usability methods: Testing, Inspection, and  Inquiry. 

Comparison of Usability Evaluation Methods, Jeff Axup.  Provides a taxonomy of 10 usability evaluation methods with comparisons and contrasts. 

Surveys and questionnaires:

Questionnaires in Usability Engineering, Jurek Kirakowski. Provides a good introduction to the design of questionnaires, including concepts of reliability and validity.

Questionnaire Design, John Stasko.  Provides an introduction to questionnaire design, including how to formulate objectives, when to use questionnaires, and how to write questionnaire questions. 

Web-Based Surveys for Corporate Information Gathering: A Bias-Reducing Design Framework, Jake Burkey at al. This paper reviews literature on web-based surveying and discusses how to design statistically useful web-based surveys.

Use of students:

Issues in using students in empirical studies in software engineering education, Jeff Carver et al. As the title suggests, this paper provides insights into the strengths and weaknesses of student subjects.

Theory:

What Theory is NOT, Robert Sutton et al.  An interesting paper about the importance of grounding empirical research in an underlying theory.

Comments on What a Theory is Not, Paul DiMaggio. Feedback on the article, with additional perspectives on theory.

Theoretically Speaking, Ron Weber.  A perspective on the nature of theories, what makes a good theory,  and how to make theories useful.

Metric validation:

A methodology for validating software product metrics, Khaled El Emam.  If you are proposing a new kind of measure in your research, you must consider the issue of its validation, or how to assess what the meaning of the numbers produced by the measure are (if any).

Field studies:

A set of principles for conducting and evaluating interpretive field studies in information systems, Klein et al. If your research involves the study of an external organization, this paper can help you understand how to best collect and evaluate your data.

Technology adoption:

User acceptance of information technology: Toward a unified view, Viswanath Venkatesh. If your evaluation focuses on the adoption of your technology, this article provides a good overview of some of the issues.

Statistical analysis:

Case study examples from the Rice Virtual Lab in Statistics. A very cool site that shows various kinds of analyses on real case study data. Gives you a nice, practical introduction to statistical analyses appropriate to various experimental designs.

Model generation:

Building Parametric Models, Barry Boehm. These are slides from the 2003 International Advanced School of Empirical Software Engineering. Provides insights on how to build parametric models, including an 8 step model development process. Uses examples from the COCOMO model family.

Help improve this technical report

I am sure there are useful articles I have missed in this collection.  Please email me with suggestions on how to improve the content and structure of this technical report.