Trace retrieval techiniques should be evaluated according to the purpose of measurement. Traceability has a wide range of tasks to respond to the Grand Challenges including improving the quality of automatically generated traceability links and reducing the cost to create traceability links. While a variety of evaluation methods exist, the criteria to select an evaluation method for a specific tracing task should be well defined for a fair comparison of techniques. Evaluation of trace retrieval techniques consists of three dimensions: metrics, summarization methods, and classification threshold types.
A trace link as a ‘specified association between a pair of artifacts, one comprising the source artifact and one comprising the target artifact.’ The task of tracing therefore involves discovering the set of target artifacts that are related to a given source artifact, and then establishing trace links between them. This can be accomplished manually or in a semi-automated way. For example, Information Retrieval (IR) techniques, such as the Vector Space Model (VSM) can be used to automatically compute the degree of relevance between source and target artifacts. The results are typically sorted in descending order according to relevance score, and pairs of artifacts scoring over a certain threshold are classified as candidate links and are said to be retrieved. Following normal IR convention, a source artifact is often referred to as a query and a target artifact as a document.
To evaluate the accuracy of a traceability technique, the retrieved results are compared against a predefined answer set. The accuracy of traceability techniques is typically measured either using classification accuracy metrics or rank accuracy metrics. Classification accuracy metrics count
the number of correctly or incorrectly retrieved links. Rank accuracy metrics measure the accuracy of the relative ordering of correct links in the ordered retrieval results. Some studies, such as those studying the behavior of human trace analysts, do not use relevance scores, hence use only
classification accuracy metrics.
Classification Accuracy Metrics
Recall measures the fraction of relevant documents that are correctly retrieved and is defined as follows:
Precision measures the fraction of retrieved documents that are relevant and is defined as follows:
Fβ-measure is a weighted harmonic mean of recall and precision and is defined as follows:
where β > 1 puts more emphasis on the importance of recall. F-measure is a simplified expression of F1-measure.
Rank Accuracy Metrics
Average precision measures how well a traceability technique retrieves relevant documents at the top of the ranked retreval results and is defined as follows:
where r is the rank of the requirement in the ordered list of retrieved results from N documents, relevant() is a binary function assigned 1 if the rank is relevant and 0 otherwise, and P(r) is the precision computed after truncating the list immediately below that ranked position. Average precision returns a higher value (approaching 1) when more relevant documents are retrieved towards the top of the ranked list according to the computed relevance score.
DiffAR measures the difference between the average relevance scores of correctly and incorrectly retrieved links and is defined as follows:
where q is a query, d is a document, rel(q,d) is a relevance score between q and d, LT is the set of correctly retrieved links, and LF is the set of incorrectly retrieved links.
DiffMR measures the difference bewteen the median relevance scores of correctly and incorrectly retrieved links and is defined as follows:
Lag measures how many incorrect links are retrieved above each correct link and is defined as follows:
AUC is the area under an ROC curve, where an ROC curve represents the trade-off between recall and false positive rate. False positive rate is defined as the fraction of non-relevant documents that are incorrectly retrieved. AUC measures how well a traceability technique discriminates relevant documents from non-relevant documents.
Result Summarization Methods
Trace retrieval results can be summarized in three ways: average method, aggregation method, per-query method.
Average method computes the average metric value from the results of each individual trace query.
Aggregation method computes the metric value after ordering the retrieved documents in descending order of relevance scores across queries.
Per-query method reports per-query tracing results using statistical tests or graphs.
Ranking with bootstrapping Mean rank measures how well a traceability technique performs compared with other traceability techniques by computing the ranks of metric values among the compared traceability techniques, where the metric can be any kind of metric such as average precision. For each query, the ranks of traceability techniques are computed. Then the average of rank across queries is computed for each traceability technique per data set. To obtain robust estimates, the current implementation performs resampling of queries using 1000 repetitions of bootstrapping. In each bootstrap, the same number of queries as the original data set is randomly selected with replacement and the mean rank is computed. The same procedure is repeated for all the data sets. Therefore, for five data sets, 5000 bootstraps are made. Because mean rank ignores the magnitudes of metric values, each data set contributes to the final results with the same weight. Therefore, mean rank is more robust than actual metric values.
Selection of Evaluation Methods
The combination of pre-mentioned metrics, result summarization methods, and classification threshold types result in a large set of evaluation methods. To guide the selection of evaluation methods from this large set of possible evaluation methods, we suggest six desirable propeperties of evaluation methods. This work is ongoing resarch and more details can be found from .
Goal satisfiability An evaluation metric should be able to measure how well a traceability technique achieves the goal of tracing. Goal satisfiability is the ability for a metric to measure the extent to which a technique achieves the given goal of the benchmarking task.
Generalizability An evaluation method that suits a single study within a certain context of techniques and data sets may not be general enough for enchmarking, which uses more diverse sets of techniques and data sets. Generalizability is the ability for an evaluation method to provide coverage of diverse datasets and traceability techniques in order to support benchmarking.
Discriminability A metric that returns the same values for all compared techniques may not be a good metric, unless it is the case that multiple techniques actually do return the same degree of accuracy. It is clearly important to determine whether evaluation methods are sensitive enough to discriminate between high and low accuracy techniques. Discriminability is the ability to summarize the performance of a technique in order to distinguish between high accuracy and low accuracy techniques.
Orthogonality Orthogonality is the ability of a set of evaluation methods to evaluate different aspects of a traceability technique. Simply utilizing a slew of evaluation metrics creates confusion in interpreting the results, and certainly undermines the goal of utilizing benchmark results to help practioners make informed decisions.
Objectivity Objectivity is the ability to evaluate a technique without being affected by subjective criteria. For example, it is hard to select and justify an objective and agreeable classification threshold for benchmarking, although it may be possible for a specific use of a technique in a specific organization.
Robustness Robustness is the ability to measure the essential accuracy of a technique not affected by random chance due to peculiarities in a data set such as outliers.
 Y. Shin, J. Huffman Hayes, and J. Cleland-Huang. A framework for evaluating traceability benchmark metrics. In Technical report, DePaul University, School of Computing, pages TR:12–001, 2012