# Composite End Points in Clinical Trials of Heart Failure Therapy

## How Do We Measure the Effect Size?

## Jump to

## Abstract

Composite end points are popular outcomes in clinical trials of heart failure therapies. For example, a global rank composite is typically analyzed using a Mann–Whitney U test, and the results are summarized by the mean of ranks and a corresponding *P* value. The mean of ranks is uninformative, and a clinically meaningful estimate of the treatment effect is needed to communicate study results and facilitate an assessment of heterogeneity (the consistency of the effect across outcomes). The probability index is intuitive for clinicians, easy to calculate, and may be applied to various composites. We suggest a simple and familiar plot to assess heterogeneity across outcomes, which should be routine when analyzing composites. We think that the probability index provides an immediate and simple solution to an overt problem.

## Introduction

Composite end points are increasingly popular outcomes in clinical trials of heart failure (HF) therapy.^{1,2} HF has a complex presentation, and pathophysiology and the outcomes are diverse, leading to the inclusion of clinical events, as well as symptom resolution and biomarker changes. Some composite end points amalgamate these outcomes of different types with the goal of increasing statistical power and a more economical presentation of results. Hence, the analysis and handling of composite end points is a current and persistent issue, especially in early phase trials, where the sample size precludes the use of mortality as a primary outcome or an adjusted significance level for multiple testing.

However, composite outcomes may yield ambivalent results,^{3} and the uptake of composites has outpaced guidance on their use. In particular, 2 issues are commonly neglected when presenting trial results: (1) an effect measure summarizing the magnitude of the treatment difference; and (2) an explicit assessment of heterogeneity of this effect across the component outcomes. Regarding heterogeneity, some authors have described multiple testing; however, an advantage of composite end points is that they obviate the issue of multiple testing by creating a univariate outcome. An assessment of heterogeneity may imply multiple testing; however, the composite is taken as the primary end point, and adjusting alpha may mean statistical significance is unattainable for any single outcomes in phase II research. Alternatively, Pogue et al^{4,5} described a statistically rigorous assessment of heterogeneity; yet, such a modeling approach may not be easy to implement or persuasive, and such a test is inappropriate for some composites, for example, those that measure risk benefit.

In meta-analysis, a forest plot of odds ratios (OR; with each OR representing a study) provides a visual inspection of heterogeneity. The OR is a measure that summarizes the magnitude of the treatment difference, termed the effect size. We require an analogous measure for composite end points to enable such a graphical assessment of heterogeneity and to communicate study findings.

## Need for an Effect Size

The effect size should be “[t]he primary focus in interpreting therapeutic clinical research data,”^{6} which is also stipulated in ICH E9: “it is important to bear in mind the need to provide statistical estimates of the size of treatment effects together with confidence intervals (in addition to significance tests).” The Table summarizes some well-known composites and their typical effect sizes, such as the win ratio^{7} and days alive and out of hospital. The choice of effect size is less obvious for composites that combine noncommensurate outcomes (percent change for biomarkers, survival end points, etc.), such as the global rank^{8} and the average *Z* score.^{9} The global rank composite is similar to an unmatched win ratio^{7} and arranges outcomes in a hierarchy, with the most definitive at the top and, accordingly, patients may be ranked from the most adverse response (rank=1, eg, mortality) to the most favorable (rank=n, if there are no ties); see Figure 1 for an example which follows Felker and Maisel.^{8} These ranks are analyzed using a Wilcoxon–Mann–Whitney rank-sum test (*U* test). The average *Z* score, unlike the global rank, is unweighted and is calculated by first translating each patient response on each outcome to a *Z* score and then taking the average across outcomes for each patient. The average *Z* scores obtained are analyzed in the same manner as the global rank. These composites have been compared elsewhere.^{9,10}

Because these composites tend be rank based, we often see the sum^{8} or mean of ranks displayed by treatment group to summarize the main findings, but the former is misleading and the latter has little clinical meaning. This fails to gauge the effect of the investigational drug in the way, for example, a hazard ratio would for a time-to-first event composite or an OR for a major adverse cardiac event outcome. Consequently, *P* values are emphasized,^{8} contradicting the guidance on the value of reporting an estimate of the treatment effect (and a confidence interval) to supplement *P* values.

As an example, consider 2 recent studies in HF, one for each of the composites of interest. The FIGHT study (Functional Impact of GLP-1 for Heart Failure Treatment) compared Liraglutide and placebo groups using a global rank composite comprising mortality, hospital readmission, and time-averaged proportional change in N-terminal pro-B-type natriuretic peptide level.^{11} The mean global rank score was presented for each group (146 for Liraglutide and 156 for placebo) without any group difference or confidence interval because a difference between these rank scores (ie, 10) is not readily interpretable in terms of a clinical effect. However, analyses of the component outcomes were summarized with effect sizes, namely hazard ratios for time to death and time to first hospital readmission, and difference in percentage change from baseline for N-terminal pro-B-type natriuretic peptide. The omission of an effect size for the overall composite in the table of results makes the overall interpretation of this result challenging.

The BLAST-AHF study (Biased Ligand of the Angiotensin Receptor Study in Acute Heart Failure) used an average *Z* score as the primary outcome comparing 3 dose groups and a placebo in acute patients with HF.^{12} The average *Z* score was an average across *Z* scores for 5 outcomes: time to death (≤30 days), time to HF-related hospitalization (≤30 days), worsening HF at 5 days, change in dyspnea visual analogue scale area under the receiver operating curve, and length of hospital stay. In this case, the results were presented for each outcome and for the overall composite using the difference in mean *Z* scores against the placebo group (this difference was displayed with a confidence interval). This is an effect measure but not an intuitive one. Clearly, a difference of zero indicates no difference between the groups, and a positive difference favors the active treatment. But the magnitude of the effect measure is not informative. For example, is a difference of 1 compelling and worthy of affecting clinical practice? It is not a quantity that allows us to gauge the clinical benefit of the therapy. Clearly, a solution is needed, and we will now describe an effect measure for HF composite end points.

## Probability Index: An Effect Size for Composite End Points

A likely candidate for an effect size measure for rank-based composites is the probability index (PI), which ascribes a probability to the strength of superiority of the investigational treatment over the control, that is, it represents the probability that a randomly selected patient from the investigational treatment has a superior response to a randomly selected patient from the control group. The PI has been evaluated extensively, although mostly for the case of continuous data.^{13–15} Potentially inhibiting wider adoption is inconsistent terminology, including the individual exceedance probability,^{16} nonparametric relative effects,^{17} relative effect size,^{18} relative treatment effect,^{19} a measure of stochastic superiority,^{20} the global treatment effect,^{21} generalized treatment effect,^{22} theta,^{23} the probability of concordance, and the common language effect size,^{24} and more explicitly, it is referred to as *P*(*X*>*Y*).^{25}

The PI is easily derived from the Wilcoxon–Mann–Whitney U statistic (which is the default approach to analysis as noted above) and is equivalent to a more common measure, the area under the receiver operating curve. U may be thought of as the number of wins resulting if every patient in the active group were compared with every patient in the control group. The PI is this number divided by the total number of such comparisons (ie, the number of patients in one group multiplied by the number in the other). The PI is suitable for ranked data but has also been described for normally distributed data,^{26} time-to-event data,^{27} and non-normally distributed continuous data.^{19} This is a key advantage of the PI because it allows the effect on outcomes to be estimated using the same measure, rather than a mix of hazard ratios, differences between means, and so forth, and the effects across outcomes are, thus, comparable and heterogeneity may be readily assessed (ie, the inconsistency in the effect across outcomes). The PI has been promoted in the statistical literature, but despite its intuitive appeal, it remains underutilized in medical research.

The confidence intervals of PI require more computation. Newcombe^{28} evaluated several methods for obtaining confidence intervals, and we follow their method 5, which was shown to be superior to alternatives and is in use elsewhere.^{13,25,29,30} Note that a PI of 0.5 implies no difference between the groups, and therefore, we are interested in testing the null hypothesis *H*_{o}: PI=0.5. If the 95% confidence interval does not encompass 0.5, the null hypothesis is rejected (eg, we can say active treatment is superior to control). We assume that the variance of responses in each group is roughly equal. An SAS macro that yields the PI and its confidence interval is provided at the following link: https://paulmbrown-programs.blogspot.com/.

## Interpreting the Magnitude of PI

Unlike a hazards ratio or OR, the PI is bounded and falls between 0 and 1, with a value of 0.5 implying no difference between the treatment groups and values above and below this indicating supportive and negative results, respectively. Figure 2 shows the separation between density curves for different values of the PI for a 3-tier global rank composite and a sample size of n=100; the closer PI is to 1, the stronger the benefit of the investigational treatment over the control. In particular, note the separation of the peaks of the distributions. As a probability, it is reminiscent of a *P* value, and it is tempting to provide a threshold or region indicating evidentiary strength, for example, a large effect. Acion^{14} et al suggest 0.7 is large, 0.64 is medium, and 0.56 is a small difference, but it is too simplistic to apply such an interpretation across different study populations, end points, follow-up, and outcomes.

The interpretation of the magnitude of the PI will depend on the composite being used and other design features that affect the variability of outcomes, for example, eligibility criteria. If clinical outcomes, including mortality, are prioritized as per a global rank composite, then a PI of 0.6 would be impressive; however, for an unweighted average *Z* score potentially dominated by a biomarker, a value of 0.6 may not be so compelling (for the average *Z* score, the contribution of outcomes is not limited in the way it is for the hierarchical global rank composite^{10}). But because these composites are essentially different outcomes, it stands to reason that we interpret them differently (as we would for a hazard ratio that corresponds to time to death versus a hazard ratio corresponding to time to hospital readmission). However, if for each of the component outcomes it is known what is deemed a clinically important difference, then it is possible to gauge what this translates to in terms of the PI for a particular composite using data simulations analogous to an anchor-based approach.^{31} For a composite end point, the magnitude of the PI will of course depend on the strength of the treatment effect across the component outcomes, and the effect an individual outcome has on the PI will be limited according to the construction of the composite. One could use data simulations to plot the PI for the component outcomes versus the PI for the composite to gauge the influence of individual outcomes. The slope of the line would indicate how sensitive the composite is to the effect on the outcome, that is, it would be suggestive of the weighting or what Cordoba et al referred to as the inflation factor.^{32} For example, the average Z-score would show a more congruent relationship with the individual outcomes because it is unweighted.

## Assessing Heterogeneity Among Component Outcomes

For the summary of results for a composite and its component outcomes, tabulations have been suggested.^{33,34} But this can seem cluttered and inadequate on its own. As noted earlier, the benefit of the PI is that it may be applied to various types of outcome, and hence, we may summarize results using a common measure.

We replicate the hypothetical phase II data of Felker and Maisel^{33} for illustration (recall Figure 1). Figure 3 shows the familiar forest plot (typically used for meta-analyses and subgroup analysis) with the PI estimates and their 95% confidence intervals for each outcome and overall for the composites. We have considered 2 scenarios, that is, concordant effects and discordant effects (a negative effect is included for dyspnea). The results are plausible, that is, most outcomes are suggestive of an effect, but statistical significance is unattainable in the small study sample. The PI for mortality is 0.515, which is equivalent to a hazard ratio near 1.^{35} Thus, the optimism of a lone biomarker (pro-B-type natriuretic peptide in this example) is dampened by a low mortality rate for the weighted composite but proves influential in the unweighted composite (average *Z* score), where the result might be reported as a significant result by the investigators. Statistical significance is achieved for the composites, but the effect seems modest. For example, for the average *Z* score, we would report a PI of 0.639 (0.559, 0.710), which means the probability that a randomly selected patient in the active arm has a superior response to a patient in the control arm is 0.64, or in other words, the ranks on active tend to be larger than those on control (higher ranks are better; as in Figure 2). When discordant effects are present, interpretation of the composite becomes problematic^{1}; note that the average *Z* score remains statistically significant, and the global rank does not. Claiming a positive result overall based on discordant effects may give a false impression of the value of the treatment; thus, caution is warranted, and the forest plot is recommended to enable a complete interpretation.

Adopting the hypothetical example of Felker and Maisel, we have suggested an alternative or supplementary presentation of their results that allows a ready assessment of heterogeneity of the effect across outcomes. This graphical assessment could be applied routinely in the analysis of composite end points to aid the interpretation of results. Few solutions have been offered for investigating heterogeneity in the context of composite end points. Pogue et al describe a statistical test for binary^{5} and survival^{4} outcomes; however, for the rank-based composites that may measure risk benefit, we prefer a graphical assessment and think that heterogeneity cannot be discerned by a hypothesis test with a yes/no declaration at some significance level; clinical reasoning is needed. For these types of composites, a certain amount of variation in the magnitude of the effect is expected, with some outcomes more sensitive than others (eg, N-terminal pro-B-type natriuretic peptide in our illustration); yet, the components should demonstrate directional concordance,^{36} and discussion regarding the presence of heterogeneity should not, therefore, be hinged to a single *P* value testing homogeneity of effects, especially given a likely lack of power for the test in phase II studies. We would suggest even dropping the term heterogeneity in this context and instead refer to discordant effects or opposing effects, that is, effects that are counteracting, or qualitative versus quantitative interactions between treatment and outcomes, for clarity. Our method may also be simpler, familiar, and applicable in varying circumstances. The forest plot satisfies the recommendation that component outcomes should be analyzed separately and appear alongside the results for the overall composite.^{1,2,32,37} Such a display reiterates that statistical significance has not been achieved on the component outcomes and highlights that treatment effects vary across components.

In the context of phase II research, a *P* value aids the impending decision of whether to proceed to phase III. However, the PI estimate and its confidence interval provide an enhanced interpretation supplementary to a *P* value regarding the magnitude of the effect, distinguishing between statistical significance and clinical significance. It is conceivable that a *P* value may reach the threshold for significance while the PI suggests a negligible effect, as we see for the global rank in Figure 3, where the *P* value is borderline significant. We should then look at the estimates for the component outcomes to see what is driving the result. This is a discussion that cannot be informed by *P* values alone; however, the PI, like composites themselves, should be used when concordant effects are anticipated.

## Potential Caveats and Critiques of the PI

The PI has been criticized^{16,38} because the estimate depends on the variance, and thus, comparing results across studies is problematic. However, statisticians have responded to the issues raised,^{13,39–41} and research is ongoing. Nunney et al^{13} are looking at covariate adjustment when data are non-normally distributed. Whether composite end points that combine disparate outcomes are clinically and statistically meaningful may be questioned,^{36} especially for phase III trials. However, in phase II trials, combining uncorrelated outcomes increases the efficiency of the composite, and they have become increasingly popular because ranking patient responses using a set of HF end points has intuitive appeal in early phase research. The forest plot described seems especially pertinent for such composite end points where tentative conclusions are derived from a single summation of disparate outcomes. An assessment of the composite and its components is needed,^{2} and the PI may facilitate communication between biostatisticians, clinical trialists, cardiologists, and the wider patient and medical community. Previous evidence suggests that a physician’s willingness to prescribe is affected by the way in which trial results are reported,^{42} and medical journals have requested that study results include estimates to supplement *P* values.^{43} An SAS macro is provided at the following link to enable ready calculation of the PI and its confidence interval: https://paulmbrown-programs.blogspot.com/.

The PI is nonparametric and appropriate for ranked data such as the global rank composites. We restricted attention to 2 particular composites, but the PI may be applied to a variety of composites, for example, time to first event and days alive and out of hospital,^{44} or those that are 3-tier composites, for example, combining mortality, hospital readmission, and a biomarker (typically 3 or 4 outcomes are combined^{1}). In addition, some end points like days alive and out of hospital may not be familiar to the entire readership; some readers may lack a sense of what constitutes an important difference on this scale, and the PI can clarify this. Finally, because the PI is derived from the test that is commonly applied (ie, the Wilcoxon–Mann–Whitney U statistic), there is no change to the analysis.

## Conclusions

The PI was described decades earlier^{45,46} but has been slow to appear in the results of clinical trials using composite end points. Its value has been expressed in statistics journals,^{14} although some have argued that the quantity is somewhat convoluted and may not be easily grasped.^{16} However, we and others^{14,40} think that it is a value clinicians will find intuitive (more so than a hazards ratio^{35}) because its interpretation is phrased in terms of individual patients rather than population averages, and it is no more esoteric than the interpretation of a *P* value.^{47} Califf et al^{48} noted in 1990: “we have become interested in the use of combined end points. ... The major disadvantage ... is that the scale that is developed may not be readily interpretable.” Yet 25 years later, top-line results are typically reported without meaningful effect estimates, and researchers have noted that the presentation of results should improve.^{32} Thus, the PI provides an immediate solution to an overt problem, it is apt and easily calculated and ought to gain wider use, especially when the end point is an amalgamation of noncommensurate outcomes.

## Sources of Funding

Alberta Innovates–Health Solutions (AIHS) and the Canadian Institutes of Health Research (CIHR) provided grant support for AHF-EM. AIHS provided support for Dr Ezekowitz. Motyl Studentship in Cardiac Sciences provides support for P. Brown.

## Disclosures

P. Brown has no industry relationships to declare. Dr Ezekowitz has received grants or honoraria from Novartis, Servier, Bayer, Merck, Trevena, Amgen, Canadian Institutes of Health Research, National Institutes of Health, Heart and Stroke Foundation of Canada.

- © 2017 American Heart Association, Inc.

## References

- 1.↵
- 2.↵
- 3.↵
- Johnston SC,
- Amarenco P,
- Albers GW,
- Denison H,
- Easton JD,
- Evans SR,
- Held P,
- Jonasson J,
- Minematsu K,
- Molina CA,
- Wang Y,
- Wong KS

- 4.↵
- 5.↵
- 6.↵
- Mark DB,
- Lee KL,
- Harrell FE Jr..

- 7.↵
- Pocock SJ,
- Ariti CA,
- Collier TJ,
- Wang D.

- 8.↵
- Felker GM,
- Maisel AS.

- 9.↵
- Sun H,
- Davison BA,
- Cotter G,
- Pencina MJ,
- Koch GG.

- 10.↵
- Brown PM,
- Anstrom KJ,
- Felker GM,
- Ezekowitz JA.

- 11.↵
- Margulies KB,
- Hernandez AF,
- Redfield MM,
- Givertz MM,
- Oliveira GH,
- Cole R,
- Mann DL,
- Whellan DJ,
- Kiernan MS,
- Felker GM,
- McNulty SE,
- Anstrom KJ,
- Shah MR,
- Braunwald E,
- Cappola TP

- 12.↵
- Felker GM,
- Butler J,
- Collins SP,
- Cotter G,
- Davison BA,
- Ezekowitz JA,
- Filippatos G,
- Levy PD,
- Metra M,
- Ponikowski P,
- Soergel DG,
- Teerlink JR,
- Voors AA

- 13.↵
- Nunney I,
- Clark A,
- Shepstone L.

- 14.↵
- 15.↵
- 16.↵
- 17.↵
- Grabcanovic-Musija F,
- Obermayer A,
- Stoiber W,
- Krautgartner WD,
- Steinbacher P,
- Winterberg N,
- Bathke AC,
- Klappacher M,
- Studnicka M.

- 18.↵
- Konietschke F,
- Placzek M,
- Schaarschmidt F,
- Hothorn LA.

- 19.↵
- Fokianos K,
- Troendle JF.

- 20.↵
- Vargha A,
- Delaney HD.

- 21.↵
- 22.↵
- 23.↵
- Dahlqvist HZ,
- Landstedt E,
- Gådin KG.

- 24.↵
- 25.↵
- 26.↵
- Reiser B,
- Guttman I

- 27.↵
- Jiang S,
- Tu D.

- 28.↵
- 29.↵
- Gelston EA,
- Coller JK,
- Lopatko OV,
- James HM,
- Schmidt H,
- White JM,
- Somogyi AA.

- 30.↵
- 31.↵
- Julious SA,
- Walters SJ.

- 32.↵
- Cordoba G,
- Schwartz L,
- Woloshin S,
- Bae H,
- Gøtzsche PC.

- 33.↵
- Felker GM,
- Maisel AS.

- 34.↵
- 35.↵
- Moser BK,
- McCann MH.

- 36.↵Guideline on clinical investigation of medicinal products for the treatment of acute heart failure. European Medicines Agency: Committee for Medicinal Products for Human Use 2015.
- 37.↵
- 38.↵
- Senn S.

- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- Ariti CA,
- Cleland JG,
- Pocock SJ,
- Pfeffer MA,
- Swedberg K,
- Granger CB,
- McMurray JJ,
- Michelson EL,
- Ostergren J,
- Yusuf S.

- 45.↵
- 46.↵
- Glass GV.

- 47.↵Statistical challenges in assessing and fostering the reproducibility of scientific results: summary of a workshop. National Academies of Science, Engineering and Medicine, 2016. Available at: https://www.nap.edu/catalog/21915/statistical-challenges-in-assessing-and-fostering-the-reproducibility-of-scientific-results. Accessed January 4, 2017.
- 48.↵
- Califf RM,
- Harrelson-Woodlief L,
- Topol EJ.

## This Issue

## Jump to

## Article Tools

- Composite End Points in Clinical Trials of Heart Failure TherapyPaul M. Brown and Justin A. EzekowitzCirculation: Heart Failure. 2017;10:e003222, originally published January 11, 2017https://doi.org/10.1161/CIRCHEARTFAILURE.116.003222
## Citation Manager Formats