A Comparative Review of LOCC and CodeCount

Philip M. Johnson
Collaborative Software Development Laboratory
Department of Information and Computer Sciences
University of Hawaii

http://csdl.ics.hawaii.edu/techreports/00-10/00-10.html

November, 2000
Small changes: November 2001

Introduction

This paper provides one review of the comparative strengths and weaknesses of LOCC and CodeCount, two tools for calculating the size of software source code. The next two sections provide quick overviews of CodeCount and LOCC. The final section presents the perceived strengths and weaknesses of the two tools. A caveat: although I am attempting to be objective in this review, I have in-depth knowledge of LOCC and only very superficial knowledge of CodeCount. Comments and corrections solicited and welcomed.

CodeCount

The incarnations of CodeCount can be divided into two basic flavors. The first flavor, which I will dub as "Classic" CodeCount, has been in production use for over a decade. The second flavor, dubbed "CSCI" CodeCount, is the result of a set of student projects to support various "object" languages by extending Classic CodeCount.

Classic CodeCount is a relatively old and mature project. The first work on Classic CodeCount began in the late 1980's. CodeCount reflects its early origins in three important ways.

  1. It is written in ANSI C, with all the typical strengths (speed) and weaknesses (lack of higher level program structuring mechanisms, etc.) of that language.

  2. It was originally designed to provide size metric information appropriate to the kinds of languages targetted at that time (assembly languages and non-OO languages such as C and Fortran). Classic CodeCount does not employ a grammar and so has cannot reliably recognize object oriented language structures within a file, such as packages, classes, methods, inner classes, etc.

  3. It has been used extensively by many industrial partners for over a decade, and has presumably been used to count many millions of lines of code. It can be presumed to be well-tested and reliable for the kinds of measures it produces.

Here is an example of the kind of output produced by CodeCount:

                                    Temporary Project Name              
The Totals
   Total   Blank |      Comments     |  Compiler  Data    Exec.  |  Number  |          File  SLOC
   Lines   Lines |   Whole  Embedded |  Direct.   Decl.   Instr. | of Files |   SLOC   Type  Definition
------------------------------------------------------------------------------------------------------------------------------------
     399      61 |      94         5 |       11      37      196 |      5   |     244  CODE  Physical
     399      61 |      94         5 |       11      13      135 |      5   |     159  CODE  Logical
       0       0 |       0         0 |        0       0        0 |      0   |       0  DATA  Physical

Number of files successfully accessed........................     5 out of 5

Ratio of Physical to Logical SLOC............................     1.53

Number of files with :
        Executable Instructions        >     100          =       1
        Data Declarations              >     100          =       0
        Percentage of Comments to SLOC <      60.0 %      =       4      Ave. Percentage of Comments to Physical SLOC =  40.6

Total occurrences of these Java Keywords :
   Compiler Directives                Data Keywords                      Executable Keywords
 import.............     11         abstract...........      0         goto...............      0
 export.............      0         const..............      0         if.................      4
                                    boolean............      0         else...............      1
                                    int................      6         for................      0
                                    long...............      0         do.................      0
                                    byte...............      0         while..............      1
                                    short..............      0         continue...........      0
                                    char...............      0         switch.............      0
                                    extends............      2         case...............      0
                                    float..............      0         break..............      0
                                    double.............      0         default............      0
                                    implements.........      1         return.............      2
                                    class..............      2         super..............      0
                                    function...........      9         this...............      1
                                    interface..........      0         new................      9
                                    native.............      0         try................      1
                                    void...............     12         throw..............      0
                                    static.............      0         throws.............      0
                                    package............      0         catch..............      1
                                    private............      3         with...............      0
                                    public.............     17        
                                    protected..........      0        
                                    operator...........      0        
                                    volatile...........      0        

REVISION AG1  SOURCE PROGRAM -> JAVA_LINES          This output produced on Tue Apr 27 10:21:37 1999

"CSCI" CodeCount appears to be a collection of student projects done in the past year or so to modify the original code count tool to support "object" languages. The level of counting sophistication appears to be constrained by the design of Classic CodeCount. Essentially, the code was modified with the addition of counter variables to count the occurrence of method/function declarations, and the output routines were enhanced with to print the number of methods/functions found in the file. CSCI Code Count provides this kind of support for languages like HTML, XML, Excel, JavaScript, C/C++. etc.

As a result, CSCI CodeCount output consists of the standard Classic CodeCount output, interspersed with additional output lines indicating the number of objects found. Here is an excerpt of CSCI CodeCount output illustrating the kinds of object information provided:

    79     15 |    27        1 |       0       1      36 |     37 | CODE  CybraMenu.html
 Number of User Defined Object of CybraMenu.html = 3
    39      7 |    16        1 |       0       0      15 |     15 |ObjectType  ObjectName
 Number of Methods in Object ObjectName = 2
    32      6 |    13        1 |       0       0      12 |     12 |Data  Test
 Number of Methods in Object Test = 2
    12      3 |     5        0 |       0       0       4 |      4 |Logic  NestedObj
 Number of Methods in Object NestedObj = 1
 Number of User Defined Method in CybraMenu.html = 2

LOCC

Unlike CodeCount, work on LOCC began relatively recently. Joe Dane started work on LOCC in 1998, and the first public release occurred in 1999. LOCC arose from our earlier work on a system called JavaLOC. Interestingly, JavaLOC was a Java-based lines of code counter for Java source that was designed in a manner quite similar to CodeCount, in that it was not grammar-based. However, it attempted to go beyond CodeCount by determining the number of lines of code in each individual method in a Java class, then summing those to determine the number of lines of code in a entire class, and summing the class sizes to determine the size of the corresponding package.

Calculating this "hierarchical" size data at these differing grain sizes of methods, classes, and packages (as opposed to files) has significant advantages: you can determine things like the average number of methods per class, the distribution of method sizes, and so forth. Concrete applications of this information include triggering for inspection any methods that exceed a certain size, or characterizing the size of "small", "medium", and "large" methods for project estimation.

After struggling with JavaLOC for about a year, we came to the conclusion that a non-grammar-based approach to providing such "hierarchical" size data was unmanageable: the system's algorithms were kludgey and either failed or produced incorrect results rather frequently.

Joe Dane, the original developer of LOCC, recognized that a grammar-based approach would provide a robust approach to providing hierarchical data, and as a bonus, could allow the system to be designed so that new languages could be supported simply by "plugging in" a new grammar. He found a Java-based parser generator called JavaCC that provided appropriate infrastructure and grammars for many popular languages.

As well as designing the system for ease of extension to any language with a JavaCC grammar, he also designed the system for ease of extension to different input and output formats. Thus, there are command line, programmatic, and GUI input formats, and text, Leap data file, and csv (comma separated value) output formats. Here is some example output in text format from LOCC for the canonical HelloWorld program:

Java Source: HelloWorld.java (6)
Number of Classes: 1
Number of Interfaces: 0
Number of Methods: 1
Package: 
 Class: HelloWorld (5)
1 Method(s):
  Method: main (3)
As you can see, LOCC provides the total LOC for the entire file (6), the total LOC for the class (5), and the total number of classes and methods (1 each), and the total LOC for each method (3 LOC for main).

LOCC, however, does something quite useful beyond providing a hierarchical size-based measurement of the total lines of code in one or more files. It also produces a "structural diff" by comparing two versions of a system, matching classes and methods (by name), and determining the number of lines of code that were changed in each method, class, package, etc. After making some minor modifications to the canonical HelloWorld program, the text output for the "diff" between the two versions might look as follows:

Size Difference Info for HelloWorld.java
2 lines added in method main
0 lines added in class HelloWorld
0 lines added in package 
This indicates that two lines were added (or simply modified) in the main method, that no lines were added/modified in the class HelloWorld apart from the method-level modifications already listed, and that no lines were added/modified in the package apart from those already indicated at the class level.

At least in our experience, this "diff" mechanism is even more useful than the "total" mechanism, because most projects do not start from scratch. Instead, in an incremental development scenario, a developer wants to estimate how many previously written methods might be "touched" and how many new methods might be written in the upcoming increment of the project. LOCC's diff method provides a much more accurate measure of these changes than simply computing the total LOC before and after the project increment. For example, an increment of development might not change the total size of the system significantly, but might require extensive rewriting of many hundreds of lines of code.

Although the LOCC output examples are from trivial programs, it has been used to count systems of significant size. For example, it was used to count the size of a Java system over 500,000 lines of source code.

Strengths and Weaknesses of CodeCount and LOCC

Given this difference in background and functionality, here are some factors to keep in mind when deciding whether to adopt CodeCount or LOCC: There may be additional important factors that are not included here. Please let us know and we will add them to this report.