Home Race, age, and other surrogate variables: are we really thinking this through?


A bigger treatment hurdle for Black kidney patients, and is it fair?

I recently read an article (Powe, 2020) defending the use of race in determining whether a key variable in diagnosing chronic kidney disease (CKD), i.e., glomerular filtration rate (GFR), is low enough to greenlight treatment.  The author argues that the differences in estimated GFR between those who identified as "Black" and others were solidly proven, and that this justifies a relevant diagnostic double standard.  According to Vyas et al. (2020), this was based on the observation that Black people had "higher average serum creatinine concentrations".  Powe also notes that he is apparently swimming against the tide, mentioning several prominent hospitals that have abandoned the use of race in this respect;  in fact, a cursory review of popular sources online suggests that this double standard has been dropped.  Referring to an old lab report, I note that my estimated GFR was calculated differently for "Africn Am" and "NonAfricn Am": the former calculation was about 15% higher than the latter, although the reference ranges (>59) were the same.  However, using different equations to calculate estimate GFR while using the same reference range is in effect applying a different reference range. 

The higher estimated GFR for Black people implies that they have to meet a higher standard for treatment, and therefore could mean a deprivation of needed treatment on the part of Black patients if it is based on the wrong assumptions; in at least one case, this meant that a Black kidney patient was almost kept from being put on the kidney transplant list that a non-Black patient with the same actual GFR would have been put on.  Fortunately, his nephrologist used another, race-neutral, method to estimate his GFR, which turned out to assume the same level as the one that the race-biased method estimated for white men, allowing him to be put on the list (Waddell, 2020).  This is a special point of concern, since, as Powe (2020) admits, Black individuals have a higher rate of chronic kidney disease (CKD) than their white counterparts. 

The difference between "on the average" and "no overlap"

It is not enough for the averages of a measurement taken from two groups to differ to jump to certain conclusions about those groups.  Can you look at one of those measurements and determine from it what group it comes from?  If you cannot, then you need to be careful about generalizing about what that measurement says about those groups.  This is the problem we face when treating race as a predictor in algorithms.

No overlap, i.e., predictability, is the typical efficacy requirement in clinical trials of a new drug; in the trade, it's called "statistical significance".  Customers need to be assured that the drug will almost certainly improve the condition that they want to be treated, especially if there are major side effects or the drug is very expensive.  What is confusing, however, is that sometimes no one knows how the drug achieves its benefit, i.e., its "mechanism", so the clinical trial provides the only information we have to base a marketing approval on.

Sometimes, however, the means of two groups are substantially different, but there is also a substantial overlap.  In other types of situations, this can matter.  If these are safety profiles of a treated group and a placebo group in a clinical trial, this is a red flag (or, in unusual cases, a sign of an unexpected benefit.)  It generally calls for further investigation, because it strongly suggests that the study drug has side effects in some patients.  But applying efficacy standards, i.e., no overlap, criteria to safety data in this manner would effectively dismiss many real side effects.  It's probably safe to say that this type of pattern tells us that there is something we don't know that we need to know: it's best interpreted as a red light rather than a green one.  In sum, when this pattern is applied to different demographic groups, it's not good enough to justify different reference ranges.

Sometimes, demographic group membership should be used to determine reference ranges.  For instance, testosterone levels in normal men and women who are not receiving sex hormone supplementation have no overlap, and a normal level in a man would be abnormal in a woman (and vice versa.)  Not using separate reference ranges for these two groups would be likely to result in misdiagnosis of relevant problems.

It's about creatinine, and that doesn't seem to be well-understood

The big question is, why do these racial differences exist, and how predictable are they, i.e., how small is the overlap in the two race category frequency distributions?  I find the complex calculations to estimate GFR very puzzling, but one thing is clear: the only lab measurement on which they are based is the level of creatinine, which is a muscle breakdown product.  Getting rid of "dead" muscle tissue, which we learned from the statins/rhabdomyolysis flap, is a job which taxes the kidneys and, at least above a certain level, damages them and in extreme cases destroys them.  On the other hand, well-functioning kidneys manage to recycle more creatinine than those with poorer function.  If one has a higher muscle mass and/or eats a high-protein diet, that might mean more creatinine that the kidneys have to get rid of, which might cause them greater stress and perhaps damage them in the long run, especially if intense exercise brings about more muscle remodeling.  Another consideration is that the more protein devoted to muscle mass, the less that is available for the maintenance of vital organs, including the kidneys. 

How easy can it be to figure out what causes these racial differences, though, enough to justify this higher standard for Black people?  Do they really all have more muscle mass than all "nonblack" people?

The specifics of the traditional formula

Powe (2020) makes reference to the traditional estimation of GFR (Levey et al., 2009).  Their equation has a few weighting variables in addition to creatinine levels, i.e., sex, race category ("Black" and "White or other"), and age categories with cutpoints at 62 years for women and 80 years for men.  It is an interesting description of the population in general, but what explains the differences in these categories and what is its health significance for individuals?  If a 63-year-old woman receives the news that her estimated GFR has dropped suddenly from the previous year, should she be alarmed, given that the reference range she is given, i.e., 59 and above, is the same for everyone?  What is interesting is that the sexes are distinguished by being raised to different exponents in their GFR calculation formulas, while the races are distinguished by a simple multiplier.  Indeed, the multiplier is a little more than 15% higher for Blacks than for "White or other".

In sum, if we have different expectations for patients according to their race, age, or gender category membership, shouldn't there be a scientific justification for that?  Obvious anatomical differences between men and women exist, and there is minimal overlap in their levels of sex hormones.  But what would explain differences in kidney function, and do these differences imply that one sex is healthier than the other in that respect?  And wouldn't it be a good idea to understand why African ancestry or, in this case, reported "Black" race, is apparently associated with higher GFR's relative to their creatinine levels, and how that might be associated with a (known) higher risk of CKD?  Given that the Black population in the U.S. today includes immigrants from all over the large, biologically and ethnically diverse African continent as well as from various Caribbean countries, the 2009 formulas might not even fit in the purest mathematical sense.  Besides, if there is a big overlap in kidney function between these two racial groups, that would be considered a major problem in other scientific contexts, i.e., those in which predictability is demanded, such as clinical trials associated with New Drug Applications (NDA's).

Multiple reference ranges, surrogate variables, and other forms of double standards in medical research

Multiple health standards for different demographic groups of adults are actually quite rare in medicine today.  The cutoff for an unhealthy "bad" cholesterol level is officially the same for all adults, even though it typically rises with age.  The same BMI is applied to all, although this has been questioned with respect to differences in muscle and fat composition.  In fact, nearly every standard blood test applied the same reference ("normal") range to all adults.  But there are a few notable exceptions other than the estimated GFR.  One such lab test is the thyroid stimulating hormone (TSH) level, for which some endocrinologists (Biondi, 2013) have proposed different reference ranges for several age groups and different races.  It turns out that, according to the NHANES III study (Hollowell et al., 2002), "Black, Non-Hispanic" subjects had substantially lower rates of hypothyroidism and higher urine iodine levels than their "White, Non-Hispanic" counterparts.  But we cannot jump to conclusions about why this is the case: different average levels of dietary iodine might explain these differences.  On the other hand, if well-understood inherent iodine metabolism differences explain this, and these differences are great enough for measured levels to be used reliably to identify the racial group of an individual, then different reference ranges might be appropriate.  But we do not have this information as far as I know.

The TSH reference range issue in the case of older people is more clear-cut, with cause-effect arguments having been laid out, with supporting statistics.  Older adults have higher average TSH levels than their younger counterparts, although these differences in averages are rather small.  However, the TSH distribution curve for the oldest people is much more skewed to the right:  in other words, there are more individuals with very high values.  This means, in sum, there is much greater diversity in thyroid health in older people than in younger ones, suggesting a diversity of circumstances.  Old people are more likely to have chronic and progressive diseases, and it makes sense that these would put extra stress on the thyroid, raising TSH levels.  Besides, older people are no less likely than younger ones to have primary hypothyroidism, for which there is no cure, and for some that might be their only disease.  But the argument for setting the hypothyroidism treatment bar higher for older patients is similar to that used to support setting a higher standard for Black people to receive treatment for kidney disease: there is an observed measurement difference in the groups considered, accompanied by claims that this difference is obviously natural and normal and therefore not needing an explanation.  Alas, aging, decline in health, and death are normal by these standards!

What do reference ranges mean, what should they mean, and how can we agree on how they can be useful?

We should never lose sight of the main function of the reference range, i.e., to separate the healthy from the sick for treatment purposes.  Reference ranges are inevitably based in part on observations of what is typical of the population as a whole.  But when model developers use what I like to call the "stone soup" method of model development, i.e., throwing miscellaneous variables into the model to see what happens with available data, they tend to lose sight of what the model should be telling us.  Using surrogate variables, i.e., easy-to-measure variables that are highly correlated with those that are meaningful to use but difficult or dangerous to obtain, is inevitable in clinical studies.  But using them without having some understanding of why these correlations exist might result in treating members of groups defined by such surrogate variables not simply unfairly but wrongly.  In some cases, this could be a life-and-death matter.


Biondi B (2013) The normal TSH reference range: what has changed in the last decade? J Clin Endocrinol Metab 98(9):3584-87.  Retrieved 6 Nov 2019 from https://academic.oup.com/jcem/article/98/9/3584/2833082

Hollowell JG, Staehling NW, Flanders WD, Hannon WH, Gunter EW, Spencer CA, and Braverman LE (2002) Serum TSH, T4, and thyroid antibodies in the United States population (1988 to 1994): National Health and Nutrition Examination Survey (NHANES III).  The Journal of Clinical Endocrinology & Metabolism 87(2): 489-499. Retrieved 7 Dec 2019 from https://academic.oup.com/jcem/article/87/2/489/2846568

Levey AS. Stevens LA, Schmid et al. (2009) A new equation to estimate glomerular filtration rate. Ann Intern Med 150(9):604-12. Retrieved 2 Sep 2020 from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2763564/

Powe NR (2020) Black kidney function matters: use or misuse of race?  JAMA 324(8): 737-8. Retrieved 1 Sep 2020 from https://jamanetwork.com/journals/jama/article-abstract/2769035 (abstract only, but I had access to the full article via subscription)

Waddell K (2020) Medical algorithms have a race problem. Consumer Reports, 17 Sep 2020.  Retrieved 17 Sep 2020 from https://www.consumerreports.org/medical-tests/medical-algorithms-have-a-race-problem/

Vyas DA, Eisenstein LG, and Jones DS (2020) Hidden in plain sight -- reconsidering the use of race correction in clinical algorithms. N Engl J Med 383(9): 874-82. Retrieved 5 Oct 2020 from https://www.nejm.org/doi/full/10.1056/NEJMms2004740

Copyright © 2020 by Dorothy E. Pugh.  All rights reserved.