DIAGNOSIS

 · Introduction
 · Learning objectives
 · Searching evidence
 · Appraising evidence
 · Reading and links

 



Are the results of this DIAGNOSIS study valid?

1. Was there an independent, blind comparison with a reference ("gold") standard?

Patients in the study should have undergone both the diagnostic test in question and the reference or "gold" standard. 

The "gold" standard refers to the commonly accepted "proof" that they do or do not have the target disorder; the "gold" standard might be an autopsy or biopsy. The "gold" standard provides objective criteria (e.g., laboratory test not requiring interpretation) OR a current clinical standard (e.g., a venogram for deep venous thrombosis) for diagnosis.

Sometimes there may not be a widely accepted "gold" standard; the author will then need to clearly justify his/her selection of the reference test. The results of one test test should not be known to those who are conducting or evaluating the other test.

2. Did the patient sample include an appropriate spectrum of patients to whom the diagnostic test will be applied in clinical practice?

For the information to be truly useful, the test should be applied to a broad spectrum of patients: those with mild and severe cases, as well as early and late cases, and with patients both treated and untreated for the target disease. The test should also be applied to patients with disorders that are commonly confused with the target disease. 

3. Did the results of the test being evaluated influence the decision to perform the reference standard?

Researchers should conduct both tests regardless of the results of the test in question. Researchers should not be tempted to forego the "gold" standard test, if the outcome of the test in question is negative. (There are additional issues to consider related to the invasive or risky nature of some "gold" standard tests. These issues should be addressed in the design of the study.) 

4. Were the methods for performing the test described in sufficient detail to permit replication?

The methodology for conducting the test should be presented in enough detail so that it can be conducted again within the appropriate setting. This may include dosage levels, patient preparations, timing, etc.


Key issues for Diagnostic Studies

  • blinding (personnel who administer and evaluate or read the tests should be blinded to the results of the other test) 
  • patient group (include spectrum of disease, from no disease to severe disease) 
  • diagnostic or gold standard test (already exists) 
  • patients get all tests (everyone in the study gets all the tests) 
  • sensitivity and specificity



What are the results ?

1. Sensitivity and Specificity

Sensitivity and specificity are the most widely used statistics used to describe a diagnostic test. 
Unfortunately, they are not very helpful to clinicians trying to revise the probability of disease. 
Reviewing the definitions of prevalence, sensitivity, and specificity will help us understand why:

  • prevalence = probability of disease in the entire population at any point in time (i.e. 2% population has diabetes mellitus)

  • incidence = probability that a patient without disease develops the disease during an interval (the incidence of diabetes mellitus is 0.2% per year)

  • sensitivity = probability of a positive test among patients with disease

  • specificity = probability of a negative test among patients without disease

As clinicians, though, we don’t generally know whether or not the patient has disease; that’s why we’re ordering the test in the first place! Thus, sensitivity and specificity do not give us the information we need to interpret the test results.

What do we want to know? Ideally, we’d like to know what the probability of disease is, given a positive or negative test. 

Reverend Bayes first described a way to do this in the late 1800's, when he developed an equation to relate the probability of disease before ordering the test (the pre-test probability) and the probability of disease given a positive or negative test (the post-test probability). Bayes’ equation is shown below:

Probability of disease given a positive test

                                    (prevalence x sensitivity)                             
            ((prevalence x sensitivity) + ((1-prevalence) x (1-specificity)))

Clearly, this is not an equation you can carry around in your head! It is also not a simple transformation, which explains why it is hard to "guesstimate" the post-test probability from the sensitivity and specificity.

Take a look at the "2 x 2" table below:

 

Patients with disease

Patients without disease

Test positive

a

b

Test negative

c

d

We’ll refer to similar tables in other discussions, so it’s a good idea to get familiar with how they work. 
Using the above table, the definitions of sensitivity and specificity can also be written as:

sensitivity = a /(a+c)

specificity = d /(b+d)

Sensitivity and specificity by themselves are only useful when either is very high (over typically, 95% or higher).
A very high sensitivity, when negative, rules out disease. For example, consider the complaint of "dyspnea on exertion" in the diagnosis of congestive heart failure1 (CHF):

 

CHF

no CHF

Dyspnea on exertion present

41

183

Dyspnea on exertion absent

0

35

The sensitivity of dyspnea on exertion for the diagnosis of CHF is 100% (41/(41+0)), and the specificity 17% (35/(183+35)). If negative (the patient does not complain of dyspnea on exertion), it is very unlikely that they have CHF (0 out of 41 patients with CHF did not have this symptom). 
An easy way to remember this rule of thumb is the acronym "SnNOut", which is taken from the phrase: "Sensitive test when Negative rules Out disease".

Conversely, a very specific test, when positive, rules in disease. 
The acronym for this kind of test is "SpPIn"!  Consider a gallop (S3) murmur in the diagnosis of congestive heart failure, with data taken from the same study:

  CHF

no CHF

Gallop (S3) murmur

10 3

No gallop murmur

31 215

The sensitivity of gallop for CHF is only 24% (10/41), but the specificity is 99% (215/218).  
Thus, if a patient has a gallop murmur, they probably have CHF (10 out of 13)

1
Davie AP, Caruana FL, Sutherland GR, et al. Assessing diagnosis in heart failure: which features are any use? Q J Med 1997;90:335-9

2. Predictive Values

Predictive values help us answer the question:

    "Given a positive (or negative) test result, what is the new probability of disease?"

Let’s fill some numbers in. In his article on the clinical diagnosis of strep, Frank Dobbs took consecutive patients, prospectively did the same history and physical exam manoeuvres on all of them, and then did a throat culture. 

One of the things our nurses ask patients who call with a sore throat is how long they’ve had it. If it’s only been a couple of days, they are more likely to ask patients to try symptomatic remedies. If the duration is longer, they are more likely to ask them to come in. The presence of fever is another important factor in advising the patients, with febrile patients more likely to be asked to come in for evaluation. Is there any evidence to support these strategies?

Essentially we are asking, "If a patient has fever, what is the likelihood of strep pharyngitis?" and "If a patient has symptoms for 3 or more days, what is the likelihood of strep pharyngitis?" 
Let’s start by creating 2 x 2 tables for each question, using data from Dobbs’ article2:

 

Strep

No strep

    Strep No strep  

Fever

58

80

138

Duration
>= 3 days

16

62

78

No fever

14

54

68

Duration
< 3 days

56

72

128
 

72

134

   

72

134

 

Using the equations for sensitivity and specificity, we find that for fever:

sensitivity = 58 / (58 + 14) = 0.81

specificity = 54 / (54 + 80) = 0.40

Note that the sensitivity can be written as "0.81" or "81%".   One is no better than the other - just be consistent!   Similarly, for duration of symptoms >= 3 days,

sensitivity = 16 / (56 + 16) = 0.22

specificity = 72 / (72 + 62) = 0.55

At this point, you should be getting a little uneasy about your triage policy. While most patients with strep had fever, only 22% with symptoms for more than 3 days had the diagnosis! We still haven’t answered our question, though. To do that, we have to calculate predictive values. They are defined as:

Positive predictive value = probability of disease among patients with a positive test

Negative predictive value = probability of no disease among patients with a negative test

The probability of disease given a positive test can therefore be called the "post-test probability of disease given a positive test", the "positive predictive value", or the "posterior probability of disease given a positive test". These names are interchangeable. Similarly, the probability of disease given a negative test is called the "post-test probability of disease given a negative test" or the "posterior probability of disease given a negative test"; this is equal to one minus the negative predictive value. Note this last point: the negative predictive value does not equal the post-test probability of disease given a negative test. They are the converse of one and another.

What about our old friend, the 2 x 2 table? Here is the standard 2 x 2 table:

 

Patients with disease

Patients without disease

Test is positive

a

b

Test is negative

c

d

We can now define positive and negative predictive value as follows:

Positive predictive value = a/(a+b)

Negative predictive value = d/(c+d)

Post-test probability of disease given a positive test = a/(a+b)

Post-test probability of disease given a negative test = c/(c+d)

Notice that we are now using the rows instead of columns, as for sensitivity and specificity. 
What about our original question on the diagnosis of strep throat? Recall:

 

Strep

No strep

    Strep No strep  

Fever

58

80

138

Duration >= 3 days

16

62

78

No fever

14

54

68

Duration < 3 days

56

72

128
 

72

134

   

72

134

 

We can quickly calculate that for fever:

Positive predictive value = 58 / (58+80) = 0.42

Negative predictive value = 54 / (54+14) = 0.79

And for duration of symptoms of 3 or more days:

Positive predictive value = 16 / (16+62) = 0.20

Negative predictive value = 72 / (72+56) = 0.56

So…If a patient has fever, there is a 42% chance of strep, and if they have symptoms for 3 or more days, only a 20% chance. It appears that it may be appropriate to revise our triage policy!

Note that these are the same values we got earlier. This is an important point, and one of the strengths of sensitivity and specificity:

"The sensitivity and specificity do not depend on the prevalence or pre-test probability of disease"

In other words, they are not affected by how common or rare the disease is!  On the other hand, 

"The predictive value varies with the pre-test probability of disease"



3. Likelihood Ratios

When we decide to order a diagnostic test, we want to know which test (or tests) will best help us rule-in or rule-out disease in our patient. In the language of clinical epidemiology, we take our initial assessment of the likelihood of disease ("pre-test probability"), do a test to help us shift our suspicion one way or the other, and then determine a final assessment of the likelihood of disease ("post-test probability"). Take a look at the diagram below, which graphically illustrates this process of "revising the probability of disease".

Likelihood ratios tell us how much we should shift our suspicion for a particular test result. Because tests can be positive or negative, there are at least two likelihood ratios for each test. The "positive likelihood ratio" (LR+) tells us how much to increase the probability of disease if the test is positive, while the "negative likelihood ratio" (LR-) tells us how much to decrease it if the test is negative. The formula for calculating the likelihood ratio is:

            probability of an individual with the condition having the test result    
   LR = probability of an individual without the condition having the test result

Thus, the positive likelihood ratio is:

              probability of an individual with the condition having a positive test    
   LR+ = probability of an individual without the condition having a positive test

Similarly, the negative likelihood ratio is:

              probability of an individual with the condition having a negative test    
   LR- = probability of an individual without the condition having a negative test

You can also define the LR+ and LR- in terms of sensitivity and specificity:

LR+ =      sensitivity  
              1-specificity

LR- =   1-sensitivity
              specificity

(Of course, if you're using sensitivity and specificity on a scale of 0 to 100 instead of 0 to 1, the equations would be sensitivity / (100-specificity) and (100-sensitivity)/specificity, respectively).

*     *     *

The first thing to realize about LR’s is that an LR > 1 indicates an increased probability that the target disorder is present, and an LR < 1 indicates a decreased probability that the target disorder is present. Correspondingly, an LR = 1 means that the test result does not change the probability of disease at all!   
The following are general guidelines, which must be correlated with the clinical scenario:

LR

Interpretation

> 10

Large and often conclusive increase in the likelihood of disease

5 - 10

Moderate increase in the likelihood of disease

2 - 5

Small increase in the likelihood of disease

1 - 2

Minimal increase in the likelihood of disease

1

No change in the likelihood of disease

0.5 - 1.0

Minimal decrease in the likelihood of disease

0.2 - 0.5

Small decrease in the likelihood of disease

0.1 - 0.2

Moderate decrease in the likelihood of disease

< 0.1

Large and often conclusive decrease in the likelihood of disease

The terms "odds of disease" and "probability of disease" get thrown around a lot as if they were the same thing, but they are not. Let’s consider a group of 10 patients, 3 of whom have strep and 7 of whom don’t.  If we randomly choose a patient, the probability that they will have strep is 3/10 or 0.3 or 30%. On the other hand, the odds of having strep in this group are 3 : 7. Here is a table which relates the odds to the probability:

Probability Odds
1% 1:99
5% 1:19
10% 1:9
20% 1:4
33% 1:2
50% 1:1
67% 2:1
80% 4:1
90% 9:1
99% 99:1

Stated as a mathematical formula (yuck!) this relationship is:

  • for an odds of a : b, probability = a / (a + b)
  • for a probability of x%, the odds are x : (100-x)

Thus, if the odds are 4:9, the probability is 4 / (4+9) = 4/13 = 0.31 (or 31%). Similarly, if the probability is 15%, then the odds are 15 : (100-15) = 15 : 85.   With a little practice, you can easily convert from probability to odds and back again in your head.

Why should you possibly care about doing this? Well, the likelihood ratio has a very interesting property:

post-test odds of disease = likelihood ratio x pre-test odds of disease

So, for positive and negative tests:

odds of disease for (+) test = odds of disease before testing  x  LR+

odds of disease for (-) test = odds of disease before testing  x  LR-

Now, with a little practice, we can actually estimate the probability of disease given a positive or negative test in our heads! Let’s go through a couple of examples:

You estimate, based on your knowledge of the community, the patient’s age of 10 years, and his symptoms (sore throat, fever, exudate, and adenopathy) that the pre-test probability of strep is approximately 40%. The rapid antigen test for strep is positive; looking at the package insert, you see that it has a sensitivity of 90% and specificity of 90%. The LR+ and LR- are therefore 9 and 0.1. Before proceeding, make sure you understand how we calculated those LR’s using the formulas described above.

First, notice that knowing the sensitivity and specificity doesn’t help you much when it comes to calculating the likelihood of disease in your patient. However, in 3 simple steps, we’ll use the LR’s to do just that:

Step

Description

Calculation

1

Convert the pre-test probability to odds form 40% = 40 / (100-40) = 40 : 60 = 4 : 6 *

2.

Multiply the pre-test odds by the LR to calculate the post-test odds (4 : 6) x 9 = 36 : 6

3.

Convert the post-test odds back to a probability 36 : 6 = 36 / (36 + 6) = 36/42 = 0.86 or 86%

* It simplifies calculations somewhat to reduce elements to the least common denominator. Thus, 40:60 is the same as 4:6, and is also the same as 2:3. Similarly, 30 : 70 is the same as 3:7.

What if the test is negative? Let’s go through that, using the LR- of 0.1 this time in our calculations:

Step

Description

Calculation

1.

Convert the pre-test probability to odds form 40% = 40 / (100-40) = 40 : 60 = 4 : 6 *

2.

Multiply the pre-test odds by the LR to calculate the post-test odds (4 : 6) x 0.1 = 0.4 : 6

3.

Convert the post-test odds back to a probability 0.4 : 6 = 0.4 / (0.4 + 6) = 0.4/6.4 = 0.06 or 6%

Now, instead of just knowing that a positive strep test makes disease more likely, and a negative one makes it less likely (or worse yet, thinking that a positive test means the patient has disease and a negative test means they don’t) you can estimate the specific likelihood of disease for your patient. This is truly "patient-centred" medicine, since your interpretation of the laboratory test is specific to your patient’s pre-test probability of disease, which is in turn based on his or her age, symptoms, and signs.

In the above example, a positive test provides pretty convincing evidence of strep (86% probability). On the other hand, many physicians would be uncomfortable not treating a child who had a negative strep test, and therefore still had a 6% chance of having strep. After going through this calculation once, you might decide that in similar patients, you will empirically treat them, since a negative test does not rule out disease. Or, you might decide to get a throat culture in-patients with a negative strep screen, while giving antibiotics to those with a positive strep screen.

Let’s consider another example: an older patient with much less typical symptoms of strep (age 20, sore throat, cough, no adenopathy, and no exudate) and a pre-test probability of disease of 5% by your estimate. If the test is positive (remember, LR+ = 9):

Step

Description

Calculation

1

Convert the pre-test probability to odds form 5% = 5 / (100-5) = 5 : 95 = 1 : 19

2.

Multiply the pre-test odds by the LR to calculate the post-test odds (1 : 19) x 9 = 9 : 19

3.

Convert the post-test odds back to a probability 9 : 19 = 9 / (9 + 19) = 9/28 = 0.32 or 32%

If the test is negative:

Step

Description

Calculation

1.

Convert the pre-test probability to odds form 5% = 5 / (100-5) = 5 : 95 = 1 : 19

2.

Multiply the pre-test odds by the LR to calculate the post-test odds (1 : 19) x 0.1 = 0.1 : 19

3.

Convert the post-test odds back to a probability 0.1 : 19 = 0.1 / (0.1 + 19) = 0.1/19 = 1 / 190 = 0.005 or 0.5%

In this case, a negative test does rule out disease, and a positive test gives a high enough likelihood of disease that you would probably treat the patient, but remain open to other causes for his or her symptoms. Individualizing treatment in this way is much more powerful than simply doing the same thing for every patient.

Getting the most information from a test

When we order a test result, we’re accustomed to thinking in terms of the results being positive or negative. However, the actual information in the result is often much richer. Consider the diagnosis of iron deficiency anemia (IDA) from the serum ferritin level. Labs generally report a single cut-off for abnormal around 65 mmol/l, with low values suggesting a diagnosis of iron deficiency anemia. Using that value as a "positive" test, the LR+ is 6 and the LR- is 0.12.

But there is more information hidden in these results. You can also calculate a likelihood ratio for each range of ferritin, as shown below:

Serum ferritin

(mmol/l)

# with IDA

(% of total)

# without IDA

(% of total)

LR

Comment

< 15

474 (59%)

20 (1.1%)

52

Strong evidence for IDA

15-34

175 (22%)

79 (4.5%)

4.8

Moderate evidence for IDA

35-64

82 (10%)

171 (10%)

1

No evidence either way

65-94

30 (3.7%)

168 (9.5%)

0.39

Weak evidence against IDA

> 94

48 (5.9%)

1332 (75%)

0.08

Strong evidence against IDA

Doing these calculations is easy. Set up your table as above, with a column showing the percentage of patients with the disease that have a test value in that range, and a second column showing the percentage of patients without disease that have a test value in that range. Then, divide the first column by the second column to calculate the LR for that range. In the table above, for example, 59% / 1.1% = 52.

Once again, likelihood ratios help us provide individualized care, and get the most possible information from a test result.  
This is an important advantage of using likelihood ratios!

4. ROC Curves
Considering that when a test becomes more sensitive, it becomes less specific, and vice versa, receiver-operating characteristic (ROC) curves are an excellent way to compare diagnostic tests.
 Consider again the data for serum ferritin as a test for iron deficiency anemia:

Serum ferritin
(mmol/l)

# with IDA
(% of total)

# without IDA
(% of total)

< 15

474

20

15-34

175

79

35-64

82

171

65-94

30

168

> 94

48

1332

If we just want to calculate sensitivity and specificity for this test, we have to choose a "cutpoint" which separates 'normal' from 'abnormal'.  If we choose <= 34 as an abnormal ferritin, we can "collapse" some rows and get the following table:

Serum ferritin
(mmol/l)

# with IDA
(% of total)

# without IDA
(% of total)

<= 34

474 + 175

20 + 79

> 34

82 + 30 + 48

171 + 168 + 1332

Doing the math, we now have a familiar 2 x 2 table:

 

Serum ferritin
(mmol/l)

# with IDA
(% of total)

# without IDA
(% of total)

<= 34

649

99

> 34

160

1671

Finally, we can calculate sensitivity and specificity for this cutpoint of 34:

Sensitivity = 649 / (649 + 160) = 649 / 809 = 80.2%

Specificity = 1671 / (1671 + 99) = 1671 / 1770 = 94.4%

Remember, though, that the sensitivity and specificity depend on where we make the cutpoint.  I have done the math, and calculated the sensitivity and specificity for each of 4 different cutpoints in the table below:

Cutpoint which defines an abnormally low serum ferritin (mmol/l)

Sensitivity Specificity
< 15 58.5% 98.9%

<= 34

80.2%

94.4%

<= 64 90.4% 84.7%
<= 94 94.1% 75.3%

This confirms that as the sensitivity increases, the specificity drops, and vice versa.  
 Here is a graphical example, using creatinine kinase and diagnosis of MI:

CK_curves.gif (5380 bytes)

This diagram graphs the creatinine kinase values for two groups of patients, those with MI and those without MI.  As we know from our clinical experience, there is an overlap in the CK values between the two groups, shown in the middle of the diagram.   
Using "Cutpoint 1" means that almost all of the patients with MI will be considered 'abnormal', but so will many without MI. 
This is highly sensitive, but not very specific.  
Cutpoint 2 does not misclassify nearly as many of the patients without MI, but also misses more of those who actually had an MI.  When setting our cut points, we have to keep in mind that we are making a trade-off, and have to think about what is worse:  a false positive or a false negative.  In this case, most would agree that a false negative (i.e. telling a patient with an MI that he or she doesn't have one) is the worse error, so we would choose Cutpoint 1.

What about ROC curves?  We're getting there, but the above concepts are important.   Make sure you understand how you can derive multiple pairs of sensitivity and specificity for a diagnostic test, and why sensitivity and specificity are inversely related.

An ROC curve is simply a graph of sensitivity vs. (1-specificity).  Why not sensitivity vs. specificity?  Well, you could do that, but because the area under the curve for sensitivity vs. (1-specificity) has special meaning, whereas it does not for sensitivity vs. specificity, we choose the former.  You'll see.

Below, we've graphed the values from the table of sensitivities and specificities for the diagnosis of iron deficiency anemia using serum ferritin:

ROC.gif (3980 bytes)

The area under the ROC curve (AUROCC) is a reflection of how good the test is at distinguishing (or "discriminating") between patients with and without IDA.  The greater the area, the better the test.  
Let's look at another graph, which shows where a really good test (that has a high sensitivity and specificity) and a perfectly bad test (which classifies diseased patients as healthy, and vice versa) would fall on the ROC curve:

ROC2.gif (4865 bytes)

A worthless test, which does not discriminate between IDA and non-IDA patients, would have a curve shown by the diagonal red line.  Thus, the best possible test (100% sensitive and 100% specific) would have an area under the curve of 1.0