Replication Package for the Paper on
"The Use of Summation to Aggregate Software Metrics Hinders the Performance of Defect Prediction Models"

Abstract

Abstract—Defect prediction models help software organizations to anticipate where defects will appear in the future. When training a defect prediction model, historical defect data is often mined from a Version Control System (VCS, e.g., Subversion), which records software changes at the file-level. Software metrics, on the other hand, are often calculated at the class- or method-level (e.g., McCabe’s Cyclomatic Complexity). To address the disagreement in granularity, the class- and method-level software metrics are aggregated to file-level, often using summation (i.e., McCabe of a file is the sum of the McCabe of all methods within the file). A recent study shows that summation significantly inflates the correlation between lines of code (Sloc) and cyclomatic complexity (Cc) in Java projects. While there are many other aggregation schemes (e.g., central tendency, dispersion), they have remained unexplored in the scope of defect prediction.
In this study, we set out to investigate how different aggregation schemes impact defect prediction models. Through an analysis of 11 aggregation schemes using data collected from 255 open source projects, we find that: (1) aggregation schemes can significantly alter correlations among metrics, as well as the correlations between metrics and the defect count; (2) when constructing models to predict defect proneness, applying only the summation scheme (i.e., the most commonly used aggregation scheme in the literature) only achieves the optimal performance in 11% of the studied projects, while applying all of the studied aggregation schemes achieves the optimal performance in 40% of the studied projects; (3) when constructing models to predict defect rank or count, either applying only the summation or applying all of the studied aggregation schemes achieves similar performance, with both achieving the closest to the optimal performance more often than the other studied aggregation schemes; and (4) when constructing models for effort-aware defect prediction, the mean or median aggregation schemes yield performance values that are significantly closer to the optimal performance than any of the other studied aggregation schemes. Broadly speaking, the performance of defect prediction models are often underestimated due to our community’s tendency to only use the summation aggregation scheme. Given the potential benefit and the negligible cost of applying additional aggregation schemes, we advise that future defect prediction models should explore a variety of aggregation schemes.

Anything unclear, please don't hestitate to contact any of the authors.

Data Sources

Our empirical study uses SourceForge and GoogleCode projects as our subject projects.
This dataset was collected from SourceForge and GoogleCode. In our study, we select 255 projects. Our primary tool for computing metrics is a commercial tool called Understand.
Our raw data can be downloaded from this link.

Experimental Results

RQ1. Correlation Analysis

The R script used to answer this RQ can be downloaded from this link (RQ1.R).

RQ2. Defect Prediction Models

The R script used to answer this RQ can be downloaded from this link (RQ2.R). This script requires AUC.R that is also downloadable.

Authors

Feng Zhang
(first name <at> cs.queensu.ca)
Ahmed E. Hassan
(first name <at> cs.queensu.ca)
Shane McIntosh
(last name <at> cs.queensu.ca)
Ying Zou
(first name <dot> last name <at> queensu.ca)