Gastroesophageal reflux disease (GERD) is a prevalent condition characterized by the retrograde flow of gastric contents into the esophagus, significantly impacting quality of life. Traditional diagnostic approaches often lack precision due to symptom overlap with other conditions. This review introduces the American Foregut Society (AFS) classification, Milan score, pHoenix score, COuGH RefluX score, and Lyon score, five novel tools designed to enhance the objectivity, reproducibility, and clinical relevance of GERD diagnosis.
The AFS classification refines the endoscopic assessment of esophagogastric junction (EGJ) integrity by incorporating measurable parameters (hiatal hernia length, hiatal aperture diameter, and flap valve), overcoming the subjectivity of the Hill classification. The Milan score, derived from high-resolution manometry, integrates four parameters (ineffective esophageal motility, EGJ-contractile integral, EGJ morphology, and straight leg raise response) to quantify anti-reflux barrier (ARB) disruption.
The pHoenix score, developed for prolonged wireless pH monitoring, weights supine AET more heavily, addressing limitations of the DeMeester score and Lyon 2.0 consensus. The COuGH RefluX score, a clinical prediction model for laryngopharyngeal symptoms, uses six parameters (cough, obesity, globus, hiatal hernia, regurgitation, male sex) to stratify GERD likelihood. Finally, the Lyon score integrates endoscopic and pH-impedance data, categorizing patients into phenotypes (from no GERD to severe GERD) and predicting treatment outcomes.
These tools collectively address diagnostic challenges by standardizing assessments and improving patient stratification. By reducing diagnostic ambiguity and guiding personalized therapy, these innovations hold promise for transforming GERD management, particularly in selecting candidates for escalated medical or surgical interventions.
La enfermedad por reflujo gastroesofágico (ERGE) es una condición prevalente caracterizada por el flujo retrógrada de contenido gástrico al esófago, con un impacto significativo en la calidad de vida. Los abordajes diagnósticos tradicionales a menudo carecen de precisión debido al traslape de síntomas con otras condiciones. La presente revisión presenta la clasificación de la American Foregut Society (AFS), el puntaje de Milán, el puntaje pHoenix, el puntaje COuGH RefluX y el puntaje de Lyon, cinco nuevas herramientas diseñadas para mejorar la objetividad, reproducibilidad y la relevancia clínica del diagnóstico de ERGE.
La clasificación de la AFS refina la evaluación endoscópica de la integridad de la unión esofagogástrica (UEG) al incorporar parámetros medibles (longitud de hernia hiatal, diámetro de apertura hiatal y válvula “flap”), superando la subjetividad de la clasificación de Hill. El puntaje de Milán, derivado de la manometría de alta resolución (MAR), integra cuatro parámetros (motilidad esofágica ineficaz, integral contráctil de UEG, morfología UEG y respuesta de elevación de pierna recta) para cuantificar la disrupción de la barrera antirreflujo (BAR).
El puntaje de pHoenix, desarrollado para el monitoreo prolongado remoto del pH, mide el tiempo de exposición a ácido (TEA) supino más profundamente, abordando las limitaciones del puntaje de DeMeester y el consenso de Lyon 2.0. El puntaje COuGH RefluX, un modelo de predicción clínica para síntomas laringofaríngeos, utiliza seis parámetros (tos, obesidad, globo, hernia hiatal, regurgitación y sexo masculino) para estratificar la probabilidad de ERGE. Finalmente, el puntaje de Lyon integra datos endoscópicos y de impedancia de pH, categorizando pacientes en fenotipos (de sin ERGE a ERGE grave) y prediciendo desenlaces de tratamientos.
Estas herramientas, de manera colectiva, abordan los desafíos de diagnóstico al estandarizar las evaluaciones y mejorar la estratificación de los pacientes. Al reducir la ambigüedad diagnóstica y guiar la terapia personalizada, estas innovaciones prometen transformar el manejo de ERGE, particularmente al seleccionar candidatos para intervenciones médicas o quirúrgicas de mayor impacto.
Gastroesophageal reflux disease (GERD) is a common condition where gastric contents flow backward into the esophagus, causing symptoms that can significantly affect quality of life.1 GERD presents with a wide range of symptoms, which are often categorized into two groups: “typical” symptoms, such as heartburn and regurgitation, which have a high likelihood of being linked to GERD, and “atypical symptoms”, like cough, asthma or hoarseness, which are less reliably associated with the condition.2
For patients experiencing typical GERD symptoms with no alarm symptoms, such as unintended weight loss or dysphagia, the American Gastroenterological Association recommends an 8-week trial of proton pump inhibitor (PPI) therapy, taken once daily before a meal. This recommendation is supported by moderate-level evidence and is considered a strong guideline.3 Diagnostic workup is typically indicated after the PPI trial, in case of inadequate response, or upfront, if the presenting symptoms are atypical.4
The first-line diagnostic test for GERD is typically endoscopy. This procedure can confirm GERD by identifying specific findings, such as Los Angeles grade B, C, or D esophagitis, histologically confirmed Barrett’s esophagus, or a peptic stricture. Additionally, endoscopy provides valuable insights into the extent of esophagogastric junction (EGJ) disruption, assessed using the Hill classification, and can detect a hiatal hernia, a condition where the lower esophageal sphincter (LES) and the crural diaphragm (CD) are separated, which is a major risk factor for GERD.5,6
If endoscopy does not provide conclusive evidence of GERD or if surgical intervention is being considered, esophageal function testing becomes necessary. High-resolution manometry (HRM) is a key tool in this phase. While HRM cannot directly diagnose GERD, it plays a critical role, beyond simply guiding catheter placement for reflux testing, by ruling out other conditions that may mimic GERD, such as achalasia, other motility disorders, or behavioral conditions.7–9 Over the past decade, HRM has been extensively explored to identify variables and provocative maneuvers that can quantify the disruption of the anti-reflux barrier (ARB) and differentiate GERD patients from healthy individuals.10–15
For cases in which endoscopy is inconclusive, reflux monitoring is considered the gold standard for diagnosing GERD. Over the years, different criteria have been developed for interpreting 24 h catheter-based reflux monitoring studies. In 1974, Johnson and DeMeester introduced the DeMeester score, which evaluates 6 reflux parameters and remains widely used by surgeons to select candidates for anti-reflux surgery (ARS).16 More recently, the Lyon 2.0 consensus proposed updated benchmarks: an acid exposure time (AET) greater than 6% indicates definitive GERD, an AET between 4 and 6% is inconclusive, and an AET below 4% suggests no GERD. For inconclusive cases (AET 4-6%), additional criteria are used to confirm or rule out GERD, including a total of more than 80 reflux episodes per day, a mean nocturnal baseline impedance (MNBI) below 1,500 ohms, or an association between reflux events and symptoms.4,17
While these established criteria and tests provide solid confidence in objective GERD diagnosis, several novel tools to assess GERD severity and treatment response have been developed in the past decade. These innovations aim to make GERD diagnosis more precise and less invasive, and to better identify patients who may benefit from treatment escalation. Our review discusses these new tools, focusing on current evidence, practicality, potential impact on clinical practice, and limitations.
The American Foregut Society ClassificationIntroductionEndoscopy is the first-line diagnostic test for gastrointestinal symptoms, enabling direct visualization of the upper gastrointestinal tract. It plays a critical role in excluding conditions, such as malignancies, gastric atrophy, and peptic ulcer disease, from the GERD diagnostic pathway.3,18 The Lyon 2.0 consensus further solidifies endoscopy’s crucial role by defining specific findings, such as Los Angeles grade B, C, or D esophagitis, Barrett’s esophagus, or peptic stricture, as diagnostic for GERD. Beyond diagnosis, endoscopic evaluation of the EGJ provides essential insights into the anatomic factors contributing to reflux. To address limitations in earlier EGJ classifications, the American Foregut Society (AFS) has introduced a novel classification system for assessing EGJ integrity, offering a more comprehensive and standardized approach to GERD evaluation.19
Background and limitations of previous scoresSince 1996, the Hill classification has been considered the primary tool for endoscopic assessment of the gastroesophageal flap valve, a critical component of the anti-reflux barrier (ARB).5 Despite its relevance, the Hill classification has not been fully adopted in routine practice due to several limitations. It primarily focuses on the flap valve, giving only minimal attention to hiatal hernia. Additionally, its reliance on subjective assessments rather than measurable parameters, without a standardized endoscopic technique or nomenclature, has reduced its reliability. For these reasons, the distinction between grades I and II, both considered normal, lacks clinical significance, whereas the differentiation between grades II and III shows poor correlation with pathologic reflux rates. These shortcomings have limited the Hill classification’s utility in both clinical practice and research, highlighting the need for a more robust system.
Description of the scoreThe AFS classification represents a significant improvement by incorporating objective measurements of three key components of the ARB, summarized by the acronym “LDF”: L (length) measures hiatal hernia axial length, D (diameter) assesses the hiatal aperture size, using the standard endoscopic diameter of approximately 1 cm as a reference, and F (flap valve) evaluates the presence (F+) or absence (F–) of a functioning gastroesophageal flap valve at the angle of His (Fig. 1).
This system assigns grades from 1 (normal) to 4 (severe anatomic disruption), with the final grade determined by the component exhibiting the most significant abnormality, reflecting the weakest link in the ARB. To ensure consistency, the AFS classification provides a clear methodology, recommending prolonged insufflation (30-45 s) and rotational maneuvers in the retroflexed position, to assess for sliding hiatal herniation. These standardized techniques may reduce inter-observer variability and help to prevent under-grading of EGJ disruption, addressing a key limitation of previous systems.
Evidence and validation studiesRecent prospective studies have validated the AFS classification using objective physiologic measurements.12–20 In a study involving 56 patients with suspected GERD, who underwent endoscopy, HRM, and reflux monitoring study, the AFS classification showed a strong correlation with both pathologic reflux and manometric EGJ disruption.20
The study found a progressive increase in the prevalence of pathologic AET (>6%) across the AFS grades: 0% in grade I, 5.9% in grade II, 52% in grade III, and 77.8% in grade IV (p < 0.001). In contrast, the Hill classification demonstrated poor discriminatory ability in the same population, with similar rates of pathologic reflux in grades II and III (42.1% vs. 37.9%, p = 0.411), underscoring its limitations.
In the same study, each component of the AFS classification correlated with specific HRM parameters, reflecting different mechanisms of EGJ disruption: the L component correlated with EGJ morphology and intra-abdominal LES length, the D component showed significant association with EGJ-contractile integral (EGJ-CI) and LES basal pressure, and the F component demonstrated correlation with the Straight Leg Raise (SLR) maneuver. These findings confirm the physiologic relevance of the AFS classification components and its ability to stratify patients according to the severity of EGJ disruption. Another recent paper demonstrated the superiority of the AFS classification to the Hill classification, in terms of inter-observer variability, and confirmed its superior ability to predict AET.21
Clinical ImplicationsThough still in the early stages of validation, the AFS classification offers a promising advancement in endoscopic EGJ assessment. Its precise, reproducible protocol, with measurable parameters, minimizes subjective interpretation, potentially improving patient selection for further investigations and reducing unnecessary diagnostic testing in patients with an intact EGJ.
Limitations and future researchDespite its advancements, the AFS classification has certain limitations that warrant consideration. The detailed assessment methodology may require additional training for endoscopists, and further studies on inter-observer variability are needed to ensure consistency among practitioners. Additionally, the recommended procedures, such as prolonged insufflation and provocative maneuvers, extend the duration of routine endoscopy, which could pose challenges in busy clinical settings. Finally, current validation studies have also been limited to tertiary referral centers for esophageal diseases, potentially limiting applicability to community practice. Finally, the AFS classification does not yet address the evaluation of patients who have undergone antireflux interventions, even though the American Foregut Society has recently proposed a separate comprehensive endoscopic evaluation for post-ARS patients.22
Additional studies in different clinical settings are needed to confirm the classification’s utility in real-world scenarios outside of tertiary care centers, to assess its long-term predictive value for guiding intervention selection, and to further validate its ability to predict pathologic GERD and stratify patients effectively.
ConclusionThe AFS classification is a promising advancement in the endoscopic assessment of EGJ integrity, providing a standardized and objective framework for evaluating patients with suspected GERD. Although further validation is needed, particularly regarding its generalizability across different practice settings and prognostic value, the classification holds substantial potential to improve both clinical practice and research in GERD management.
The Milan scoreIntroductionHRM plays an ancillary role in the diagnostic pathways of GERD, primarily used to exclude major motility disorders that mimic GERD symptoms and to accurately localize the LES for reflux monitoring catheter placement.8 However, despite its secondary role, consensus papers emphasize HRM’s growing importance in evaluating ARB disruption and in the preoperative assessment prior to ARS.9,23–26
Unlike traditional gold-standard methods (pH monitoring and endoscopy), which focus on detecting acid reflux or esophageal lesions, HRM offers a unique ability to quantify underlying functional abnormalities. The recent introduction of the Milan score leverages this capability, providing a single parameter to assess ARB disruption and predict objective GERD.
Background and limitations of previous scoresThe exploration of manometric abnormalities correlating with objective GERD measures in patients with upper gastrointestinal symptoms is well-documented.27 One of the fathers of modern esophageal surgery, Dr. DeMeester, extensively studied conventional manometry potential. He demonstrated the impact of LES characteristics, including total and intra-abdominal LES length and basal pressure,28 body contraction,29 hiatal hernia, and EGJ response to increased abdominal pressure,30,31 on the pathophysiology of GERD. These factors have been adapted into HRM metrics, including the LES Pressure Integral (LESPI),32 EGJ-CI,10 Ineffective Esophageal Motility (IEM),33 EGJ morphology,11,34–37 thoracoabdominal pressure gradient (TAPG),38 and the Straight Leg Raise (SLR) maneuver.12,13
In 2020, Masuda et al.14 proposed a manometric index combining EGJ morphology, LESPI, and TAPG. However, its receiver operating characteristics (ROC) analysis yielded an area under the curve (AUC) of 0.615, with a sensitivity of 56% and specificity of 60.7% at the optimal cut-off, indicating limited diagnostic accuracy and highlighting the need for a more robust scoring system.
Description of the scoreThe Milan score is designed to quantify ARB disruption and to estimate the risk of objective GERD in patients undergoing HRM and pH-studies for persistent symptoms. It integrates 4 key HRM parameters:
- 1
IEM: defined as > 70% ineffective swallows or ≥ 50% failed swallows, it is a measure of esophageal clearance time.
- 2
EGJ-CI: calculated during the reference period using the distal contractile integral (DCI) tool placed over the EGJ, and adjusted for respiration and gastric pressure, it quantifies the strength and duration of EGJ contraction, reflecting its competency.
- 3
EGJ morphology: classified as type 1 in cases of superimposed LES and CD, type 2 in cases of LES-CD separation < 3 cm, and type 3 with separation ≥ 3 cm. EGJ morphology identifies the presence and size of hiatal hernia.
- 4
SLR response: performed in the supine position as previously described,13 with double leg raise if intra-abdominal pressure (IAP) augmentation was insufficient with single leg raising. The SLR measures the ability of the EGJ to counteract the IAP increase, a major pathophysiologic factor of GERD.
The Milan score employs a mathematical formula to weigh these parameters, yielding a final score. The Milan score is computed using the online calculator tool (www.milanscore.com) (Fig. 2). A value ≥ 137 indicates a 50% risk for objective GERD. This structured approach provides a reproducible measure of ARB dysfunction.
Evidence and validation studiesThe Milan score was developed using a multicenter cohort of 295 patients and externally validated in a separate cohort of 233 patients.39 The ROC analysis showed an AUC of 0.880 in predicting pathologic GERD, with strong discrimination and calibration (corrected Harrell’s c-index = 0.90, integrated calibration index 0.07) in the validation cohort. Subsequent studies have explored its applicability across different clinical scenarios. In a paper published by the same multicenter group, the effectiveness of the Milan score was confirmed in patients with type 2 EGJ morphology, a challenging group due to partial LES-CD separation, achieving an AUC of 0.858 in identifying those at risk for objective GERD.40
Another study demonstrated its utility in patients with laryngopharyngeal symptoms (LPS), showing a sensitivity of 57.1% and a specificity of 91.3%, suggesting the potential to streamline diagnostic pathways and better select candidates for further testing.41
Additionally, a monocentric study of 160 patients demonstrated the Milan score’s ability to predict successful outcomes, post-ARS, offering an objective ARB assessment after intervention. Despite limitations, such as its retrospective design, lack of objective outcome measures, and inclusion of varied surgical techniques, this finding underscores the score’s promise in surgical contexts.42
Clinical implicationsThe Milan score’s primary strength lies in its ability to quantify the degree of ARB disruption. The comprehensive nature of the score, by integrating multiple HRM parameters, makes it accessible to most patients undergoing HRM. By providing risk rates for GERD, it can also serve as an upfront test to identify patients who may benefit from further invasive tests or to exclude low-risk individuals from further evaluation.
Limitations and future researchDespite its advantages, the Milan score has some limitations that should be considered. The score relies on HRM parameters, which requires specialized equipment and expertise that may not be available in all clinical settings.
The SLR maneuver, a cornerstone of the score, poses challenges, as not all patients can perform it correctly, potentially skewing results. The current definition of an effective SLR (a 50% increase in intra-esophageal pressure during the maneuver over baseline) is based on expert opinion and lacks objective validation, warranting further studies. Additionally, the SLR significant weight in the score may be overestimated; larger studies could refine this balance or confirm its predictive power. The weighting of all four components might also benefit from integration with clinical parameters to improve accuracy.
Given the multifactorial nature of GERD and the overlap with other conditions, the Milan score cannot replace a definitive diagnosis but should complement clinical assessment, endoscopy, and reflux monitoring studies. Future research should also evaluate its predictive value for outcomes after medical or surgical treatments to fully validate its clinical utility.
ConclusionThe Milan score represents a significant advancement in GERD diagnosis by offering a physiologically grounded method to stratify risk, based on HRM parameters. Its comprehensive evaluation of esophageal function and ARB integrity enhances the ability to identify objective GERD, paving the way for personalized diagnostic and therapeutic strategies. While further validation is needed, particularly regarding its applicability in diverse clinical settings and outcomes after treatment, the Milan score has substantial potential to improve GERD management by enabling more personalized diagnostic and therapeutic strategies.
The Phoenix scoreIntroductionUntil the DeMeester Score (DMS) was introduced in 1974, GERD diagnosis had relied heavily on symptom assessment, a method often inaccurate due to overlap between patients and healthy individuals.16,43 The DMS was the first attempt to provide an objective measure of esophageal AET, establishing normative values for GERD diagnosis. More recently, the Lyon 2.0 consensus defined objective GERD based on AET and provided criteria for prolonged wireless reflux monitoring.4 However, both methods have limitations, prompting the development of the pHoenix score to address these gaps and improve diagnostic precision.44
Background and limitations of previous scoresOver the years, the DMS has become a cornerstone in GERD diagnosis, considered the gold standard by the surgical community for selecting patients for ARS. The DMS is a composite score that integrates six parameters: recumbent, upright, and total reflux time, number of total episodes, number of episodes over 5 minutes, and longest episode, weighting each one based on standard deviations from healthy controls. Parameters with greater variability (e.g., number of episodes, upright reflux) contribute less to the score, whereas parameters that are seldom found in the controls (e.g., long episodes or recumbent reflux) carry more weight. This approach revolutionized GERD diagnosis by introducing objective AET metrics and has been correlated with endoscopic findings like esophagitis and Barrett’s esophagus.45–48
However, the DMS presents certain limitations: its development cohort was small, it requires patients to accurately report supine and meal periods, and it was validated only for 24 h catheter-based pH studies, making it susceptible to day-to-day variability and lacking thresholds for prolonged reflux monitoring.
In contrast, the Lyon 2.0 consensus defines pathologic GERD solely by AET (> 6%), introducing a borderline range (4-6%) that requires additional criteria and establishing specific indications for definitive diagnosis in patients undergoing wireless 96 h reflux monitoring tests. The Lyon 2.0 does not take into account other pH parameters that might affect the severity of GERD, in particular supine reflux; however, its 6% threshold seems to effectively identify true GERD with excellent specificity. To address these gaps, specifically the DMS lack of thresholds for prolonged monitoring and the Lyon 2.0 absence of weighting for positional reflux, Latorre-Rodriguez et al. developed the pHoenix score.
Description of the scoreThe pHoenix score is a composite measure derived from a cohort of patients with AET 2-6% undergoing 48 h wireless pH monitoring that uses the following formula to weigh supine and upright reflux: pHoenix = (% upright AET × 0.991) + (% supine AET × 1.286).
The score sets cut-offs at 8.45 (upper) and 7.06 (lower), defining a grey area for patients with scores in between. Its innovation lies in assigning greater weight to supine AET, reflecting its stronger association with GERD complications compared with upright reflux.
This positional weighting distinguishes the pHoenix score from Lyon 2.0, which treats all reflux equally, based on total AET.
Evidence and validation studiesThe pHoenix score’s initial validation showed promising results, with an AUC of 0.957 in the identification of pathologic GERD, defined as pathologic DMS.44
Notably, it reduced the proportion of patients classified as borderline per Lyon 2.0 criteria from 77.2 to 13.2% (p < 0.001), offering clearer diagnoses for patients with inconclusive total AET.
The score demonstrated good sensitivity and specificity across thresholds, with internal validation via bootstrapping, confirming its robustness.
Clinical implicationsThe pHoenix score offers an objective measure of reflux that prioritizes supine AET, which is more closely linked to GERD complications. Unlike the DMS, it defines a grey area, which is crucial, given the progressive nature of GERD, and establishes clear positivity criteria for prolonged wireless reflux monitoring. Compared with Lyon 2.0 criteria, it could be helpful in the identification of pathologic GERD in patients with borderline AET (4-6%).
Limitations and future researchDespite its potential, the pHoenix score has some limitations. Its initial validation was conducted at a single center, limiting generalizability across diverse populations. Multicenter studies are needed to confirm its performance in varied clinical settings and larger cohorts, which would also help refine diagnostic thresholds.
The original study’s focus on patients with AET between 2 and 6% may have introduced selection bias, potentially excluding those with clearly normal or pathologic reflux patterns. Additionally, the accuracy of self-reported supine periods during 48 h monitoring, despite the use of electronic diaries, could affect reliability, as inconsistencies in reporting may skew positional data.
Further research should include head-to-head comparisons with other GERD diagnostic methods, such as impedance-pH monitoring and symptom assessment tools, to clarify the pHoenix score’s advantages. Prospective studies evaluating patient outcomes after medical or surgical therapy are also essential to validate its clinical utility. Finally, since the score was developed using wireless reflux monitoring, its applicability to 24 h catheter-based pH studies requires validation.
ConclusionThe pHoenix score represents an advancement in GERD diagnosis by incorporating positional reflux patterns into a weighted composite measure. Its emphasis on supine AET and the proposal of diagnostic criteria for prolonged wireless monitoring addresses key limitations of the DMS and Lyon 2.0. While further validation across diverse settings is needed, the pHoenix score has the potential to enhance GERD diagnosis.
The COuGH RefluX scoreIntroductionLaryngopharyngeal symptoms, including chronic cough, globus sensation, and hoarseness, have an increasing impact on quality of life and social performance, leading to a significant increase in gastroenterology and ENT clinic consultations.49 Determining which of these patients are affected by true GERD remains difficult, as the LPS spectrum includes a wide range of non-specific clinical manifestations that can be caused by multiple conditions. This diagnostic uncertainty often leads to empiric PPI therapy, which may be ineffective when GERD is not the underlying cause.50 In order to avoid PPI over-prescription, upfront esophageal physiologic tests should be performed,4,51 but they are not always available in all clinical settings. The COuGH RefluX score emerged as a novel clinical prediction tool designed specifically to address this diagnostic gap by stratifying patients with laryngopharyngeal symptoms based on their likelihood of having GERD.52
Background and limitations of previous scoresTraditionally, GERD diagnosis has always been preceded by empiric PPI trials that, while convenient, led to diagnostic uncertainty. The PPI trial lacks specificity and may lead to false positive diagnoses due to the placebo effect or symptomatic improvement unrelated to acid suppression.53 This approach can result in unnecessary long-term medication use with associated costs and potential side effects.54
Although alternative tests and parameters have been proposed,55,56 definitive diagnosis relies on endoscopy and reflux monitoring study, preferably using prolonged wireless monitoring techniques.57 Endoscopy has limited sensitivity in patients with LPS, as up to 70% of LPS patients have non-erosive reflux disease.58
Ambulatory reflux monitoring study is invasive, costly, often unavailable in primary care settings, and may not be well-tolerated by patients.
Prior to the COuGH RefluX score, no validated clinical prediction tool existed specifically for identifying GERD in patients presenting with LPS, as questionnaires and scoring systems did not demonstrate sufficient diagnostic accuracy.59,60
Description of the scoreThe COuGH RefluX score is a clinical prediction model that incorporates 6 readily assessable parameters: Cough, Overweight/Obesity, Globus sensation, Hiatal hernia, Regurgitation, and male seX.52 Each parameter is assigned a specific point value based on its relative predictive strength for GERD (Table 1).
Calculation of the COuGH RefluX score.
Variables | Points | Results |
---|---|---|
Cough | 1.5 | Score ≤ 2.5 |
Obesity/overweight | Unlikely GERD | |
BMI 25-30 | 1.5 | Score 3.0-4.5 |
BMI ≥ 30 | 2.0 | Inconclusive GERD |
Globus | -1.0 | |
Hiatal hernia ≥ 1 cm | 1 | Score ≥ 5 |
Regurgitation | 1.5 | Likely GERD |
Male sex | 1.5 |
BMI: body mass index; GERD: gastroesophageal reflux disease.
The score calculation is simple, requiring only basic clinical information that can be obtained during routine patient evaluation without specialized testing, with the possible exception of hiatal hernia detection, which typically requires endoscopy. The total score ranges from 0 to 6 with a lower threshold of 2.5 and an upper threshold of 5.0 to predict proven GERD.
Evidence and validation studiesThe original paper developed and validated the COuGH RefluX score in 856 patients (304 training, 552 validation). The training cohort established a model with an AUC of 0.68. A threshold of 2.5 was 82% sensitive, and 5 was 79% specific for proven GERD. In the validation cohort (AUC 0.67), sensitivity was 79% at 2.5, and specificity was 81% at 5. A lower threshold of 1.5 increased sensitivity to 93%, enhancing the negative predictive value, whereas excluding hiatal hernia slightly reduced performance (AUC 0.63).
A subsequent study evaluated the practical application of the COuGH RefluX score in 232 patients, classifying them as unlikely (126 patients), inconclusive (74 patients), or likely (32 patients) GERD. The authors found proven GERD rates increasing progressively (8%, 36.5%, 65.5%; p < 0.001). After excluding erosive esophagitis and Barrett’s esophagus, 196 patients showed increasing PPI response rates across groups (12.4%, 45.0%, 73.3%; p < 0.001). Multivariate analysis identified COuGH RefluX scores and lower MNBI as independent predictors of PPI response (scores ≥ 5.0, OR = 15.772; MNBI, OR = 0.915; p < 0.001).61
Clinical implicationsThe COuGH RefluX score provides a standardized risk stratification for GERD in patients with LPS. Patients in the “unlikely GERD” category showed clinical characteristics similar to the healthy population, with a low rate of pathologic GERD, low likelihood of PPI response, and higher MNBI. The application of this tool has the potential to avoid unnecessary PPI therapy, reducing inappropriate prescribing.
In the “likely GERD” category, the score provides support for initiating PPI therapy with increased confidence in cases of initial medical evaluation, or to consider invasive physiology tests in cases of chronic LPS. The documented correlation between higher scores and better PPI response allows for more personalized treatment planning and setting of realistic expectations with patients.
Additionally, the COuGH RefluX score may be a practical tool in the hands of specialists in many areas, helping to create a common ground, in particular among gastroenterologists, otolaryngologists, pulmonologists, surgeons, and primary care physicians, who frequently co-manage these patients.
Limitations and future researchDespite its promising utility, several limitations of the COuGH RefluX score must be acknowledged. The score relies partly on subjective symptom reporting, which introduces potential variability and reporting bias. Standardized symptom assessment tools could enhance the reliability of this component.
The detection of hiatal hernia, one of the score’s parameters, requires endoscopy, which may not be readily available in all clinical settings. This could limit its applicability in primary care environments without easy access to endoscopic services. The scoring tool was shown to have moderate efficacy, even without hiatal hernia, but it is still an important component of the score. Moreover, the small cohort of patients with isolated laryngeal symptoms was underpowered for subgroup modeling. However, this cohort of patients is not a rare encounter in clinical practice and deserves specific validation.
Additionally, the COuGH RefluX score does not replace the need for clinical judgment or physiologic tests. The moderate AUC values indicate that while the score is a valuable and practical predictive tool, it is not a perfect discriminator and should be interpreted within the broader clinical context.
The Lyon scoreIntroductionIn 2024, the Lyon Consensus 2.0 updated the modern definition of reflux disease and introduced the concept of actionable GERD, which refers to settings where a high confidence in GERD diagnosis is essential, including long-term acid suppression following a successful PPI-trial, patients with GERD symptoms refractory to PPIs or with atypical symptoms, and patients requiring escalation of medical management or invasive treatment. Endoscopic and ambulatory reflux monitoring criteria were defined as conclusive, supportive, borderline, or non-supportive to diagnose objective GERD. Moreover, the Lyon 2.0 introduced adjunctive parameters, such as the number of reflux episodes (REs), MNBI, or reflux symptom association, to define or refute GERD in cases of borderline AET (4-6%), describe the thresholds to be used with prolonged wireless monitoring and with on-PPI testing, and added Los Angeles grade B to the definition of objective GERD.62
The Lyon score was subsequently developed to integrate findings from different diagnostic modalities (impedance pH study [MII-pH] and endoscopy), based on the Lyon 2.0 framework. This novel scoring system aims to translate the concept and evidence behind the Lyon 2.0 into clinical practice, condensing multiple testing metrics into a simple tool.63
Background and limitations of previous scoresConventional approaches to GERD diagnosis have relied heavily on single parameters or limited composite metrics, each with significant limitations. The DeMeester score, although widely used particularly by the surgical community, focuses primarily on acid exposure parameters, without considering endoscopic data or impedance parameters.16 Similarly, isolated assessment of AET may result in a substantial number of inconclusive cases (AET 4-6%) and a greater dependence on the day-to-day variability of catheter-based reflux monitoring.
The limitations of these approaches in assessing mucosal integrity, reflux episode frequency, and multiple diagnostic modalities result in diagnostic uncertainty in response to antireflux therapy.
The updated Lyon Consensus 2.0 addresses many of these limitations by establishing a comprehensive framework for GERD diagnosis that incorporates multiple diagnostic parameters. The Lyon score attempts to fill the gap between sophisticated diagnostic criteria and a practical tool for clinical use.
Description of the scoreThe Lyon score integrates multiple parameters from endoscopy and pH-impedance monitoring into a single, comprehensive scoring system.
The Lyon score includes two endoscopic and four MII-pH parameters:
- 1
Esophagitis, graded according to the Los Angeles classification as follows: grade A, one or more mucosal erosions < 5 mm; grade B, erosions > 5 mm; grade C, mucosal breaks covering less than 75% of the esophageal circumference; and grade D, ≥ 75% of the esophagus.64
- 2
Hiatal hernia, detected on endoscopy or HRM.
- 3
AET, weighted based on established thresholds from the Lyon Consensus 2.0, with higher scores for increased acid exposure.
- 4
MNBI, a measure of mucosal integrity, with lower values indicating compromised esophageal mucosa, calculated manually as the mean of three 10 min night measurements.65
- 5
Number of Reflux Episodes, quantifying the frequency of reflux events, identified by software and then manually verified according to the Wingate consensus.66
- 6
Reflux-symptom Association, defined as either symptom index > 50% or symptom association probability > 95%.67
- 1
Each parameter is assigned a specific weight based on its relative importance in GERD diagnosis and correlation with treatment outcomes, and the weighted values are added together to create a composite score (Table 2).
Calculation and phenotypes of the Lyon score.
Esophagitis | AET | Reflux episodes | MNBI | Reflux-symptom association | Hiatus hernia | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Normal | None | 0 | <4.0% | 0 | <40 | 0 | >2500 | 0 | No | 0 | No | 0 |
Abnormal | LA-A | 2 | 4-6% | 3.5 | >40 | 2.5 | 1500-2499 | 1.5 | Yes | 1.5 | Yes | 1 |
<1500 | 2.0 | |||||||||||
Conclusive | LA-B | 4.5 | >6.0% | 4.5 | NA | NA | NA | NA | ||||
Severe | LA-C | 5.5 | >10% | 5 | NA | NA | NA | NA | ||||
LA-D | 6.5 |
Phenotype | No GERD | Isolated permeability defect | Reflux hypersensitivity | Borderline GERD | Conclusive GERD | Severe GERD |
---|---|---|---|---|---|---|
Lyon score | 0-0.5 | 0.5-2.0 | 1.0-3.0 | ≥ 3.0 | ≥ 5.0 | ≥ 10.0 |
AET: acid exposure time; GERD: gastroesophageal reflux disease; LA: Los Angeles classification; MNBI: mean nocturnal baseline impedance; NA: not applicable.
Moreover, the Lyon score provides clear thresholds for categorizing patients into diagnostic phenotypes: functional heartburn, permeability defect, reflux hypersensitivity, inconclusive GERD, conclusive GERD, and severe GERD.
Evidence and validation studiesIn the validation study published by Gyawali et al. in 2024, the development cohort consisted of 281 patients with GERD symptoms, 263 of whom were on PPI therapy and 18 who underwent ARS. The median Lyon score was 6.5 (IQR 2.5-10.5). Good outcome was defined as 50% symptom improvement using a visual analog scale (VAS) ranging from 0 (no symptoms) to 100 (severe symptoms).68 Response rates rose with GERD evidence: from 7.1% in the functional heartburn phenotype, to 37.5% in borderline GERD, to ≥ 76.6% in conclusive GERD (p < 0.001). The AUC for predicting good outcomes was 0.819. The Lyon score outperformed the DeMeester score (AUC 0.779, p = 0.019) in predicting outcomes after GERD treatment, with an optimal threshold of 6.25 (81.2% sensitivity, 73.4% specificity).
Two different validation cohorts were evaluated. The European (215 patients from two centers in Italy) was heartburn-dominant, while the Asian (258 patients from Taiwan) was regurgitation predominant and consisted of previously untreated patients. Median Lyon scores were 2.5 in the European cohort and 4.3 in the Asian cohort, with AUCs of 0.908 and 0.637, respectively (p < 0.001), in predicting treatment response. The Lyon score consistently outperformed individual components and the DeMeester score, especially in the regurgitation-heavy Asian cohort. At a 2.75 threshold (95% sensitivity), it identified symptom response in 85.6% of the European cohort and in 77.9% of the Asian cohort; at the 11.5 threshold (95% specificity), only 14.4% of patients in the European cohort and 12.3% in the Asian cohort had no symptom response.
Clinical implicationsThe Lyon score has the potential to offer significant clinical advantages for the diagnosis of GERD. By integrating and weighting metrics from endoscopy and MII-pH, it provides a comprehensive tool to stratify patients with GERD symptoms into distinct phenotypes, ranging from functional heartburn to severe GERD. The Lyon score’s ability to predict symptom response to antireflux therapy, ranging from 7.1 in functional heartburn phenotype to ≥ 75% in conclusive GERD, may reduce unnecessary treatments and improve overall therapeutic outcomes, enabling tailored treatment decisions. Patients with low scores might benefit from conservative approaches, such as lifestyle changes or minimal medical therapy, whereas those with higher scores could be prioritized for more invasive options, such as ARS. The score’s reliance on widely available diagnostic tests and potential for automation further enhance its practicality, offering a reproducible, readily available, user-friendly tool. If validated prospectively, the Lyon score could make the GERD diagnosis more efficient, offering a better selection of good candidates for GERD treatment escalation.
Limitations and future researchDespite its promising performance, some limitations of the Lyon score must be acknowledged. The retrospective design of the developmental studies introduces potential biases related to data collection and patient selection. Prospective validation in diverse clinical settings remains necessary to confirm the score’s robustness and generalizability.
The abovementioned study focused on patients with typical GERD symptoms off antisecretory therapy, therefore validation in atypical presentations and on-PPI cohorts is warranted. Variability in symptom profiles across developmental and validation cohorts, such as heartburn predominance in Europeans versus regurgitation in Asians, suggests potential ethnic or geographic influences on performance. Additionally, the score’s reliance on pH-impedance metrics raises questions about its applicability in settings where wireless pH monitoring is preferred. Future research should prioritize prospective, multicenter trials to externally validate the score across diverse populations and symptom phenotypes. Studies targeting specific therapeutic cohorts, medical vs. surgical, integration with other scores (e.g., Milan score or COuGH RefluX score) may provide a more robust validation and enhance their clinical utility.
DiscussionThe diagnosis and management of GERD have evolved significantly over the past few decades, driven by advancements in diagnostic technologies and a deeper understanding of its pathophysiology. This review highlights several novel tools the AFS Classification, Milan score, pHoenix score, COuGH RefluX score, and Lyon score, novel tools that aim to enhance the precision, objectivity, and clinical relevance of GERD assessment. These tools address longstanding challenges in GERD diagnosis, such as subjectivity of earlier methods, limitations of single-parameter assessments, and the need for treatment stratification. A summary of the modality and performance of the discussed scores is shown in Table 3.
Summary of the setting, modality, and performance of the novel tools.
Tool | Setting | Application | Modality | Threshold | Performance |
---|---|---|---|---|---|
AFS Classification | Suspected GERD | Endoscopic assessment of EGJ integrity | Endoscopy | Pathologic: AFS type III-IV | Correlation with pathologic AET (>6%): |
0% (Grade I) | |||||
5.9% (Grade II) | |||||
52% (Grade III) | |||||
77.8% (Grade IV) | |||||
Milan Score | Typical and atypical GERD symptoms | Assessment of ARB disruption | HRM | 137 | AUC 0.880 |
Post-operative assessment | Sensitivity 80.4% | ||||
Specificity 85.2% | |||||
pHoenix Score | GERD symptoms | Diagnosis of GERD in borderline AET cases (2-6%) with prolonged monitoring | Wireless pH monitoring | Lower cut-off: 7.06 | AUC 0.957 in predicting pathologic DeMeester |
Higher cut-off: 8.45 | |||||
CouGH RefluX Score | Atypical GERD symptoms (LPS) | Risk stratification for GERD in patients with laryngopharyngeal symptoms | Questionnaire | Low: ≤ 2.5 | AUC 0.67 |
Intermediate: 3 - 4.5 | Sensitivity 79% | ||||
High: ≥ 5 | Specificity 81% | ||||
Lyon Score | Typical GERD symptoms | Diagnosis and phenotyping of GERD, predicting treatment outcomes | Multi-modal (Endoscopy + MII-pH) | 6.25 | AUC 0.819 |
Sensitivity 81.2% | |||||
Specificity 73.4% |
AFS: American Foregut Society; ARB: anti-reflux barrier; AUC: area under the curve; EGJ: esophagogastric junction; GERD: gastroesophageal reflux disease; HRM: high-resolution manometry; LPS: laryngopharyngeal symptoms; MII-pH: multichannel intraluminal impedance-pH.
All the scores discussed above share the common intent of providing a more objective evaluation. Historically, GERD diagnosis relied heavily on symptom reporting and empiric PPI trials, but this approach has failed, given the significant symptom overlap with other conditions and placebo effects. The introduction of the DMS in 1974 marked a pivotal shift toward objective measurement by quantifying AET and other reflux parameters.16 Several decades later, the Lyon Consensus established AET benchmarks and incorporated adjunctive metrics, such as the MNBI and reflux-symptom association.4 However, even if its threshold of AET (6%) correctly identifies true GERD, it did not take into account other reflux monitoring parameters (e.g., supine reflux) and it lacked a simple and practical tool to convert its concepts into clinical practice. To overcome these shortcomings, the pHoenix and the Lyon scores have been recently introduced.
Because clinical assessment, endoscopy, and HRM are an integral part of the GERD work-up and traditional metrics suffered from subjectivity and a lack of diagnostic accuracy, different research groups proposed the COuGH RefluX Score, the AFS Classification, and the Milan score, with the aim to objectively stratify LPS for GERD likelihood, standardize the endoscopic assessment of the EGJ, and quantify ARB disruption.19,39
Together, these tools reflect a paradigm shift toward precision medicine in GERD, aiming to reduce diagnostic ambiguity and optimize therapeutic decision-making.
Each tool brings unique strengths to the diagnostic landscape. The AFS classification improves upon the Hill classification, by introducing measurable parameters (length, diameter, and flap valve integrity) and a standardized endoscopic protocol, reducing subjectivity, and enhancing reproducibility. Its correlation with physiologic measures, such as the AET and HRM parameters, underscores its potential to identify patients with significant EGJ disruption, who may benefit from escalated therapy. Similarly, the Milan score enhances the HRM capability to assess functional ARB defects, offering a single, predictive metric with excellent diagnostic accuracy for objective GERD. Its ability to guide preoperative assessment for ARS highlights its clinical utility in complex cases.
The pHoenix score addresses limitations in the DeMeester score and Lyon 2.0 by emphasizing supine AET, a known risk factor for GERD complications, and establishing thresholds for prolonged wireless monitoring. Its ability to reclassify borderline AET cases (4-6%) into clearer diagnostic categories could streamline management decisions. The COuGH RefluX score fills a critical gap in evaluating LPS, a notoriously challenging symptom cluster, by providing a simple, clinic-based tool with high sensitivity and specificity at defined thresholds. Finally, the Lyon score stands out for its integrative approach, combining endoscopic and pH-impedance data to stratify patients into actionable phenotypes, with demonstrated predictive power for treatment outcomes (AUC 0.819).
Despite their promise, these tools face certain limitations that temper their immediate adoption into routine practice, such as their validation in tertiary care centers, a fact that raises questions about applicability in community practice.
Another common thread of these tools is their developmental stage. Most of them have been validated in controlled or specialized settings, with retrospective and monocentric populations, and their performance in diverse populations remains underexplored. Additionally, the multifactorial nature of GERD implies that no single tool can fully capture its complexity, underscoring the need for complementary use with clinical judgment and established diagnostics.
Despite these limitations, the potential impact of these tools on clinical practice is substantial, by providing a more sophisticated and tailored approach to GERD diagnosis. Some diagnostic tests are not readily available in several countries, thus making clinical evaluation and first-line endoscopic assessment crucial for patient stratification.69 Under this light, the COuGH RefluX score and the AFS classification during endoscopy might help to select patients presenting with a high risk of GERD and profound EGJ disruption for confirmatory pathophysiologic tests and to reduce unnecessary testing in those with low risk or intact barriers. The COuGH RefluX score stands out as particularly suitable for resource-limited settings due to its reliance on clinical parameters that can be assessed during routine patient evaluations, with the exception of hiatal hernia detection, which may require endoscopy. In primary care or underserved areas where endoscopy is not readily available, the score’s performance without hiatal hernia remains moderate, suggesting potential utility as a triage tool. However, its dependence on subjective symptom reporting may introduce variability, necessitating standardized patient questionnaires to enhance reliability. In contrast, the AFS classification offers a standardized endoscopic assessment of EGJ integrity but poses challenges in general practice due to the need for trained specialists. To improve accessibility, simplified training protocols or telemedicine-guided endoscopic assessments could be explored.
It is well known that PPI response and good outcomes after ARS rely heavily on objective GERD diagnosis.70,71 The pHoenix score’s focus on supine reflux could refine treatment escalation, particularly in patients with nocturnal symptoms or borderline AET, whereas the Lyon score’s comprehensive nature and phenotypes might offer objective evidence to select patients for surgery (conclusive and severe GERD) or less invasive therapeutic options (neuromodulators, PPI).
Finally, the Milan score’s capability to quantify ARB disruption and identify cases of severe GERD make it a particularly appropriate and useful test before ARS.
Future research should focus on multicenter, prospective validations to establish the robustness of these tools across diverse populations and clinical scenarios. Integrating clinical parameters (e.g., symptom severity, BMI) with physiologic metrics could enhance diagnostic accuracy, as seen in the COuGH RefluX score’s use of obesity and sex. Comparative studies challenging these tools against each other, or using them in combination, will clarify their relative advantages and combined performance.
Based on the performance and features of each tool, we propose a diagnostic algorithm to integrate these new scores into the diagnostic pathway for GERD, as shown in Fig. 3. Long-term outcome studies are critical for validating these scores as useful tools to predict treatment response to medical therapy escalation or ARS.
Integration proposal of the novel tools into the diagnostic pathway for GERD.
AFS: American Foregut Society; ARS: anti-reflux surgery; GERD: gastroesophageal reflux disease; HRM: high-resolution manometry; LPS: laryngopharyngeal symptoms; MII-pH: multichannel intraluminal impedance-pH; PPI: proton pump inhibitor.
The AFS classification, Milan score, pHoenix score, COuGH RefluX score, and Lyon score represent significant advancements in GERD diagnosis, offering objective, standardized tools to enhance precision and personalize treatment. The AFS classification and COuGH RefluX score streamline endoscopic and clinical assessments, respectively, aiding patient triage in primary care. The Milan score and pHoenix score refine pre-operative and borderline AET evaluations, and the Lyon score integrates multimodal data to predict treatment outcomes. The proposed diagnostic pathway (Fig. 3) integrates these tools to guide clinicians from initial symptom assessment to confirmatory testing and treatment escalation, ensuring efficient and tailored GERD management. As validation progresses and practical barriers are addressed, these innovations have the potential to transform GERD management, improving diagnostic accuracy, reducing overtreatment, and optimizing patient outcomes.
FundingNo funding of any kind was received for this article.
The authors declare no conflicts of interest.