The Use of a Confusion Matrix in Data Science

Jmstipanowich
5 min readFeb 28, 2021

--

Suppose a doctor’s office wants to know how effective their practice is at detecting whether or not their patients have heart disease. Every day patients get diagnosed with heart disease that actually have heart disease. Every day patients are diagnosed as healthy when they do not have heart disease. However, occasionally patients get misdiagnosed as having heart disease when they do not have the disease, or are told they are healthy and end up back in the doctor’s office later to find out they have heart disease when the disease was initially missed. What is a visual and informational way using data science algorithms that these instances of heart disease cases can be constructed and evaluated to enhance a doctor’s office’s decision making processes when detecting heart disease in patients? The answer is in the creation and implementation of a data science confusion matrix.

The Data Science Confusion Matrix

A data science confusion matrix is a table that provides binary classification information where every item in the dataset has a ground-truth value of 0 or 1. The confusion matrix looks like as follows:

‘TN’ stands for True Negative. That means that every person who came into a doctor’s office that was predicted as not having heart disease (0) truly did not actually have heart disease (0).

“TP’ stands for True Positive. That means that every person who came into a doctor’s office that was predicted as having heart disease (1) truly had heart disease(1).

“FP” and “FN” are a bit more complicated because they represent errors in evaluation.

“FP” stands for False Positive. False Positive is a Type 1 error. A false positive result means that a person that came into a doctor’s office that was predicted as having heart disease (1) but did not actually have heart disease (0). Hence why it is a error.

“FN” stands for False Negative. False Negative is a Type 2 error. A false negative result means that a person that came into a doctor’s office that was predicted as not having heart disease (0) actually did have heart disease (1). Hence why it is also an error.

The confusion matrix provides ‘TN’, ‘TP, “FP’, and “FN’ values so a doctor’s office has an idea of what the spread of their cases are and what might be done to execute plans for further improvement and to avoid future error.

Coding an Example Data Science Confusion Matrix

Here is an example of a data science confusion matrix for heart disease. The pictures below show how the confusion matrix was coded after instantiating a logistic regression model and what information it procures:

This plot shows information from the heart disease test set about true positives, true negatives, false positives, and false negatives for heart disease diagnoses.

There were 24 cases where a patient was diagnosed as healthy when they were actually healthy. These cases accounted for 31.6 percent of the data. There were 38 cases where a patient was diagnosed with heart disease and they actually had heart disease. These cases accounted for 50 percent of the data. There were 9 cases where a patient received a false positive score. The patients were diagnosed with heart disease and they did not actually have heart disease. These cases accounted for 11.8 percent of the data. There were 5 cases where a patient received a false negative score. The patient was not diagnosed with heart disease but they did actually have heart disease. These cases accounted for 6.6 percent of the data.

Most cases in these instances were correctly predicted, but there is room for improvement. There were still cases that fall in the Type 1 and Type 2 error categories.

What To Do With The Matrix Information

What does the doctor’s office focus become in relation to the confusion matrix information provided? True and error values are important to deduce evaluations and solutions in different situations.

The amount of false positive information in the above confusion matrix was greater than the amount of false negative information in the matrix. More patients were predicted as having heart disease that did not actually have heart disease than patients predicted as not having heart disease that actually had heart disease. Is this helpful information?

This is most likely good information that the confusion matrix yielded for a doctor’s office. Their focus would be better for minimizing false negative error information than minimizing false positive error information. A situation where a patient gets diagnosed with heart disease and later finds out they do not have the disease is better than a situation where a patient is told they do not have heart disease and later find out that they do in fact have heart disease.

A false negative error being better than a false positive error of a confusion matrix might be better suited for a pregnancy situation. A couple looking to have a child might be happier to find out they were actually having a baby when they did not initially know they were than a couple believing they were having a child and finding out they were not.

The confusion matrix is helpful at identifying the variety of patient diagnoses. Because most of the patient diagnoses fall into the true positive and negative categories or false positive category, the doctor’s office is probably actually on track with what they would want for the majority of their diagnoses. The confusion matrix proves that for the most part doctors truly are doing their job with diagnosing disease and this is very remarkable information to know.

Sources:

--

--

Jmstipanowich
Jmstipanowich

No responses yet