Metric for the Task 3 / Lesion Diagnosis

The goal metric for the Task 3 / Lesion Diagnosis is written simply “multi-class accuracy metric”. Is this the simple/raw accuracy or the balanced accuracy? (Considering the contingency table of absolute numbers, with the rows being the actual classes and the columns being the predictions, the simple accuracy would be the sum of the diagonal divided by the sum of the table, and the balanced accuracy would be the average of the diagonal after dividing each row by the sum of the row).

Hi @dr.eduardo.valle,

Great question! We’ll follow-up soon with a clarification, here and on the Task 3 description page.

Hi @dr.eduardo.valle,

The metric will indeed be the normalized (balanced) multi-class accuracy. Thanks for pointing this out. We will update the task description page.

The description page states “The response data are sets of binary classifications for each of the 7 disease states, indicating the diagnosis of each input lesion image.” If they are binary classifications (i.e. the probabilities for all classes do not sum up to one, similar to a multi-label problem) how is the confusion matrix going to be created? If the predicted class is taken as the output with the highest probability, how are clashes resolved when two or more classes are predicted to be equally likely and also have the highest confidence measures among the 7? Can you please clarify?

Hi @joshuapeterebenezer,

Thanks for this question. We’ll get a clarification for you soon.

Hi @joshuapeterebenezer,

Sorry for the confusion. You should be submitting real valued numbers between 0 and 1. Further down the page reads “Diagnosis confidences are expressed as floating-point values in the closed interval [0.0, 1.0], where 0.5 is used as the binary classification threshold.”

Thanks for the feedback – We will clarify the text to avoid further confusion.

I’m sorry, but I did understand that we should be submitting real valued numbers between 0 and 1. My question was that if these probabilities/confidence-values are independent and do not sum up to one, and if you are using binary classification thresholds for each confidence-value, how is the confusion matrix going to be created (for the calculation of balanced multi-class accuracy). For example, if ‘mel’ class has a confidence value of 0.7 and ‘nv’ also has a confidence value of 0.7 (which is possible if you use binary classification confidence values), what will the predicted class be taken as? If confidence-values are to be dependent, and must sum up to one, please clarify that as well (because then 0.5 cannot be used as a threshold).

I have to agree with Joshua Ebenezer, we cannot create a confusion matrix with the current binary classification threshold as it might lead to ambiguous predictions. However, the balanced multi-class accuracy metric requires a clearly defined confusion matrix. It also does not seem very intuitive for me to take this approach as the dataset is perfectly suited for mutually exclusive classification with softmax predictions (instead of sigmoid, as suggested right now).

Predicted diagnosis confidence values may vary independently, though exactly one disease state is actually present in each input lesion image.

Also, regarding the balanced accuracy metric I am still a bit confused. The way Eduardo Valle described the metric, it sounds like a multi-class sensitivity for me. Diagonal elements are the true positives (TP), the rows, except for diagonal elements, are the false negatives (FN). And typically, sensitivity is defined as TP/(TP+FN). Did I understand that correctly? Is it reasonable to only focus on that metric?

Hi @joshuapeterebenezer, @gessert,

You should be selecting the highest confidence as the multi-class label (without thresholding). The threshold is only being specified so that additional binary classficiation analyses can be performed, such as disease specific sensitivity and specificity. These additional binary analyses will not affect ranking, which is based on balanced multi-class accuracy alone.

If you have a tie, I would recommend you implement your own tie-breaker. It’s possible that we will design the evaluation system to consider ties as simply “wrong” – because in a clinical system, this would be equivalent to a system saying “I don’t know.” We will decide on this and put up the answer on the challenge page.

Balanced multi-class accuracy is essentially a multi-class accuracy metric as if the test set had equal distribution of classes. The diagonal of your confusion matrix are the TP, the rows (except for diagonal) are FP, and the columns (except for diagonal) are FN (assuming predictions across rows, and ground truth along columns). The balanced multi-class accuracy is essentially the diagonal, scaled by the total number of elements for each category (the TP cell will be affected by the other off-diagonal values in the same column). We will go with this metric for this year, but feedback is welcome if the community thinks a different metric would be more informative for future years (such as average AUC or something similar).

Comments are welcome.



this became very confusing. I thought we were to submit only predictions in the csv and no label. Could you specify how we should select/submit this label?

My approach was to simply ignore the “binary” and live with the fact that for some images i have no valid prediction submitted. If i would switch to binary there is no way to avoid that (with a fixed threshold of 0.5) an image is classified for multiple categories (as described by @joshuapeterebenezer ).

Hi @humptydumpty
As far as I have understood by the discussion. We need to submit our prediction only in the csv format. The labels will be selected by the class having the highest prediction. If we have a tie like two classes have high prediction then we need to come with a tie breaker for the same
The threshold of 0.5 is for other metrics which will not be there in the leaderboard but are encouraged by the organisers to work on.


I understand the following from @noelcodella . The thresholding is not relevant for the final ranking. For the ranking, a single class with the highest confidence is determined for each test example. Based on that, the balanced multi-class accuracy is calculated. I think you can simply go with softmax predictions and you are safe.

What is still not completely clear for me is whether we are supposed to submit final labels (e.g. a vector [0 1 0 0 0 0 0]) or the output probabilities (e.g. [0.1 0.5 0.2 0.1 0.01 0.01 0.01]). I assume the latter is correct.

Summarized, for calculating the key metric, ignore this:

Diagnosis confidences are expressed as floating-point values in the closed interval [0.0, 1.0], where 0.5 is used as the binary classification threshold. Note that arbitrary score ranges and thresholds can be converted to the range of 0.0 to 1.0, with a threshold of 0.5, trivially using the following sigmoid conversion:

1 / (1 + e^(-(a(x - b))))
where x is the original score, b is the binary threshold, and a is a scaling parameter (i.e. the inverse measured standard deviation on a held-out dataset). Predicted responses should set the binary threshold b to a value where the classification system is expected to achieve 89% sensitivity, although this is not required.

… and assume that the organizers perform label_predicted = np.argmax(x) (e.g. x = [0.1 0.5 0.2 0.1 0.01 0.01 0.01]). Then, they use “label” and “label_predicted” to build a standard confusion matrix and calculate the balanced multi-class accuracy, following the description of @dr.eduardo.valle .

For your paper submission, you might consider the thresholding for additional evaluation.

@gessert Indeed, it is the latter: we expect submissions to contain diagnosis probabilities.

From the Task instructions:

File columns must be:

  • … diagnosis confidence

Diagnosis confidences are expressed as floating-point values in the closed interval [0.0, 1.0]

Since the ground truth is, by definition, 100% confident in its diagnosis predictions, our ground truth CSV only contains 0.0 and 1.0 values. We would expect that participant algorithms very rarely predict with this degree of total confidence.


Could a Challenge Organizer please confirm, that maximum prediction will be scored regardless of any threshold?

Hi @simonhschaefer, please have a look at this previous post: