###

Question:

I am sure there are lot of Software Testing Engineers, Algorithm Validation Engineers on Stackoverflow.Could someone please tell me how would one proceed in the following scenario.

Say we have a Mammogram and 5 different algorithms which take this mammogram as input and identify if there is Cancer in the patient. If 3 out of 5 algorithms say there is cancer in the patient and 2 say there is no cancer in the patient. Which algorithm should I believe. How should I proceed testing these algorithms. Is there any statistical concept used in such scenarios?

I was asked this question in an interview for Algorithm Validation Engineer position. I believe they were trying to see how I would think given such scenario. How should I have answered this?

Thanks for your time

-Sashi

###

Solution:1

You can't say anything having only this information. What if some of the algotithms reuse some other algorithms from those 5? Then they could be prone to the same defects.

Say A, B and C actually use the same sub-algorithm for preprocessing data and the latter gives suboptimal results on some specific image and so the preprocessed image causes the later phases to produce wrong results - it won't matter you have three algorithms saying the same.

You need more specific data on how the algorithms correlate and what statistical characteristics are knwon about there error rates to be able to perform any analysis.

###

Solution:2

This is actually pretty tough to answer. I'm sure each algorithm is good at picking up different types of input triggers. More than likely, you'll need some statistical analysis to determine what each algorithm will usually detect as cancer. Additionally, you may go so far as to do something like make a Bayesian model to describe/determine if a patient has cancer based off of the algorithmic results.

you may find that 3 algorithms consistently miss the specific type of cancer that the other two are moderately good at picking up. you may discover similar relationships that occur, like when algorithms 2, 3, and 5 say there is no cancer, algorithm 1 says there is, and algorithm 4 is inconclusive, that there are usually typically-benign spots of a certain shape and color intensity that should be analyzed but probably aren't cancer.

###

Solution:3

Choosing the best classifier for a job, or combining various classifiers is a whole separate field. This general article about classification is as good a start as any to learn about choosing the best classifier for a job. And this article about classifier ensembles is a good place to start learning about classifier combining.

To give the basis for an answer to your (quite broad) question: The best classifier for a job depends on several factors:

- The required quality of classification (in your case this would be high)
- The allowed complexity of the classification (i.e. can you compute for days to reach an answer of do you have several milliseconds) (time is not a limitation in you case I'd guess)
- The cost associated with misclassification. This is a very important factor in your case. If you tell people they have cancer when they have not you cause great stress, but (one hopes) further testing (which costs money) will eventually find out they are healthy. On the other hand, if you miss cancer in a patient she could die. This means that the "best" classifier (the one that makes the least errors) may not be the best for your problem.

On that last point: Say 1 in 1000 women has cancer, I have some classifiers:

- Misses 20% of cancer cases, and says a healthy woman has cancer in 2% of cases. This classifier would make about 200 errors in a population of 10000 people.
- Just say: "This person does not have cancer" in all cases. Only 10 errors in 10000 cases!
- Just say "This persion has cancer" it will make 9990 errors in 10000 cases.

The second classifier makes the lowest number of errors, but after a few months of using it people who could have been saved start dying. The third classifier sends everybody on for the next test (which would have the same problem as this one) or maybe it inflicts a useless life changing operation on 9990 healthy people. The second test makes a trade-off. Two people might become very sick or even die, 198 people experience painfull and stressfull procedures and operations for naught. Obviously in your case all classifiers were like classifier 1 with slight variations in the percentages. In those cases you have to make a trade-off between missing cancer cases, and inflicting the rest of the procedure (including cost!) on healthy people. The starting point for research about this trade-off is the receiver-operater-characteristic

###

Solution:4

Put your interviewee hat on, it's a psychological assessment. Questions such as this algorithm assessment have more than one correct answer. I learned about these questions from my wife who worked as a recruiter for 5+ years. The interviewer wants to see how you react. Its best to just make assumptions and drive to a logical conclusion. Do not say "I don't know", become argumentative, or ask a ton of questions. You will appear difficult and argumentative (as many programmers are).

Now that you know this is not a programming question, consider asking about it on careeroverflow.com. I like these questions because it shows an ability to adapt and become nonrigid.

Why is a manhole round? <--Microsoft's version

###

Solution:5

Well, clearly false negatives here are far more serious here than false positives, so all things being equal, we may want to show a preference for algorithms that find more cancer.

If we feed lots more mammograms to the software and we find a collection of the algorithms seem to agree on a large sample of the mammograms, then we may want to prefer those algorithms, since their results are supported by more algorithms.

Something like that.

###

Solution:6

Everything else being equal, you could say that the patient has a 60% chance of cancer. To give a better answer you need to know more information about how algorithm works. Some points to consider:

- Perhaps some algorithms are newer than others, or have been proven to be less reliable. It would be good to know the accuracy of each algorithm, using historical mammogram data tagged as "Cancerous" and "Non-Cancerous".
- Each person's cancer is slightly different - maybe there are characteristics that a certain algorithm is better at identifying? Is a domain expert required in order to determine which diagnosis is correct, based on the algorithm conclusions and the mammogram (image?) data?
- As sharptooth mentioned, perhaps certain algorithms use the same techniques as other algorithms, so both are likely to have the same bias.

###

Solution:7

It is *not* a trivial problem, and highly dependent on what risks you are willing to take.

Formalisms like decision theory and bayesian inference are really to be considered here. It allows you to take into account various probabilities of false positive/negatives, and whether you want to weigh them differently.

###

Solution:8

I don't think you *should* have answered in any specific way. The interviewer would probably want to analyze *how* you would evaluate this problem, not your final answer. In other words, they were probably interested in your own algorithm to take a decision.

In a real life environment, I can't think of any serious choice between 5 algorithms to find cancer, specially when they give so different results.

###

Solution:9

This is a good opportunity for implementing what is sometimes called an "expert system". You take a large sample of your data (in your case, mammogram images and the output of the various algorithms) and run it past a series of real-life, flesh and blood experts in the field (here, oncologists or lab techs). Record the responses for each image along with the outputs of the algorithms. At the end, you should have enough data to map algorithm output to expert output. To verify that your mapping works, run a bunch of test images through your system (samples that were *not* part of the original data set) and ask your panel of experts to double-check the results. Ideally, the experts should agree with your system's output a very high percentage of the time.

Without knowing anything about the algorithms itself, it's difficult to make a decision based off of 3 "yes" and 2 "no" results (especially for something as important as a cancer screening). Getting as close as possible to the same results as a trained expert is your goal (at least at first), and systems like this can sometimes be made more accurate by basing decisions on the knowledge and experience of experts in the field rather than on mathematical algorithms alone.

###

Solution:10

I would have asked if using a computer to determine whether someone has cancer is the correct course of action, given that the use of algorithms is error prone.

But, if for some reason a set of algorithms must be used, then have a human operator (ie, a doctor) examine the mammogram personally in the case where there is some uncertainty. The doctor can then decide if further tests are warranted, based on the disagreement of the algorithms used.

The one thing we overlook as programmers, is that humans can some solve problems we cannot predict; imagine the doctor notices something in the mammogram that algorithms were *not* designed to detect?

###

Solution:11

I think that if you had some statistical information about each algorithm previous performances(how many times it was right/wrong on a number of statistical experiments), then you could calculate the probability of being right for each algorithm. Then you could somehow combine those probabilities to get the odds of that person having cancer. Just a speculation...

###

Solution:12

To accomplish much in such a situation, you generally want to have a "gold" standard -- e.g., an opinion from a physician about whether a set of mammograms shows cancer, or use historical information where you know that one set of mammograms show cancer and another set does not. Along with that, if possible, you'd like information about what indicators each algorithm is using in a particular case.

With a standard, you can start to get an estimate of which algorithm(s) is/are more "accurate" (i.e., agree with the expert opinion the most often). Information about indicators allows you to generate more detail about the times and circumstances under which each appears to be more or less accurate, so you can start to form a judgement about times/circumstances under which to trust one over another. With this, you can (at least hope to) combine the results from your five existing algorithms into a single overall result that's (with care and perhaps a bit of luck) more accurate than any of them individually.

###

Solution:13

Basically, if you know that the results of the algorithms are conditionally independent (i.e. independent given the true but unknown class label) then using Naive Bayes is an optimal meta-classifier.

Otherwise, this question is unanswerable without knowing the conditional dependency structure among the classifiers. For example, if classifier A, B, C and D are weak, identical classifiers (i.e. they always give the same results) and have accuracies of 0.51, while classifier E is conditionally independent of classifiers A, B, C and D and has an accuracy of 0.99 then I think it's pretty obvious that voting is a bad idea.

###

Solution:14

Since the algorithm produces a "yes" or "no" answer, this is pretty easy. You need actual test data to compare your algorithms to. You should probably gather long-term data on the success rates of the various heuristics, and do some statistical analysis of which ones are more likely to be right.

Validating things like Google's search algorithm - which has no "right" answer - would be harder.

###

Solution:15

Based on the information given, you would not be able to answer. You would have to take all 5 algorithms, and test them on patients diagnosed with cancer, and also ones that are known to be cancer free. This would allow you to determine which algorithm was the most accurate.

You also could make an algorithm from the 5 (presuming they were all good and valid algortihms) and taking the side of whatever one had more votes. This could or could not be a valid sixth algorithm depending on how good the first 5 are.

**Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com**

EmoticonEmoticon