Metric for the Task 3 / Lesion Diagnosis

noelcodella · May 22, 2018, 3:28pm

You should be selecting the highest confidence as the multi-class label (without thresholding). The threshold is only being specified so that additional binary classficiation analyses can be performed, such as disease specific sensitivity and specificity. These additional binary analyses will not affect ranking, which is based on balanced multi-class accuracy alone.

If you have a tie, I would recommend you implement your own tie-breaker. It’s possible that we will design the evaluation system to consider ties as simply “wrong” – because in a clinical system, this would be equivalent to a system saying “I don’t know.” We will decide on this and put up the answer on the challenge page.

Balanced multi-class accuracy is essentially a multi-class accuracy metric as if the test set had equal distribution of classes. The diagonal of your confusion matrix are the TP, the rows (except for diagonal) are FP, and the columns (except for diagonal) are FN (assuming predictions across rows, and ground truth along columns). The balanced multi-class accuracy is essentially the diagonal, scaled by the total number of elements for each category (the TP cell will be affected by the other off-diagonal values in the same column). We will go with this metric for this year, but feedback is welcome if the community thinks a different metric would be more informative for future years (such as average AUC or something similar).

Comments are welcome.