External validation of an opioid misuse machine learning classifier in hospitalized adult patients

Background Opioid misuse screening in hospitals is resource-intensive and rarely done. Many hospitalized patients are never offered opioid treatment. An automated approach leveraging routinely captured electronic health record (EHR) data may be easier for hospitals to institute. We previously derived and internally validated an opioid classifier in a separate hospital setting. The aim is to externally validate our previously published and open-source machine-learning classifier at a different hospital for identifying cases of opioid misuse. Methods An observational cohort of 56,227 adult hospitalizations was examined between October 2017 and December 2019 during a hospital-wide substance use screening program with manual screening. Manually completed Drug Abuse Screening Test served as the reference standard to validate a convolutional neural network (CNN) classifier with coded word embedding features from the clinical notes of the EHR. The opioid classifier utilized all notes in the EHR and sensitivity analysis was also performed on the first 24 h of notes. Calibration was performed to account for the lower prevalence than in the original cohort. Results Manual screening for substance misuse was completed in 67.8% (n = 56,227) with 1.1% (n = 628) identified with opioid misuse. The data for external validation included 2,482,900 notes with 67,969 unique clinical concept features. The opioid classifier had an AUC of 0.99 (95% CI 0.99–0.99) across the encounter and 0.98 (95% CI 0.98–0.99) using only the first 24 h of notes. In the calibrated classifier, the sensitivity and positive predictive value were 0.81 (95% CI 0.77–0.84) and 0.72 (95% CI 0.68–0.75). For the first 24 h, they were 0.75 (95% CI 0.71–0.78) and 0.61 (95% CI 0.57–0.64). Conclusions Our opioid misuse classifier had good discrimination during external validation. Our model may provide a comprehensive and automated approach to opioid misuse identification that augments current workflows and overcomes manual screening barriers. Supplementary Information The online version contains supplementary material available at 10.1186/s13722-021-00229-7.


Background
Opioid-related inpatient and emergency department (ED) visits have increased 64% since 2009, and the rate of opioid-related ED visits has nearly doubled through 2014, including recent rises during the COVID-19 pandemic [1,2]. Many patients engage the health system for the first time after a physical health complication related to opioid misuse such as endocarditis or respiratory infection [3]. The care team is frequently focused on the acute physical ailment and not the patient's opioid misuse. Large treatment gaps continue to exist as hospitals serve a high concentration of individuals with opioid misuse who do not receive screening, especially when admitted for another condition [4]. Existing universal screening questionnaires such as the interviewer-administered screening questions [5] require significant staff time and training to administer. Further, patients may be reluctant to report stigmatized behavior to an interviewer [6,7]. Overall, conventional screening methods are resourceintensive and face significant barriers to successful implementation in a hospital setting [8].
Routinely collected data in the electronic health record (EHR) may be leveraged to identify cases of opioid misuse. Patients are more likely to disclose substance use to their hospital primary care team than to designated screeners who are not part of the care team [9,10]. The admission notes and social history sections of notes written by the provider teams frequently contain details about substance use but are rarely accessed for surveillance or screening programs. Computational methods of natural language processing (NLP) can derive discrete representations of clinical notes, from which machine learning can predict opioid misuse better than rule-based approaches [11][12][13][14].
We previously published and made publicly available an opioid misuse classifier using NLP and machine learning from the clinical notes [15]. In hospitalized patients, our convolutional neural network (CNN) outperformed a rule-based approach and other machine learning methods. Our CNN opioid classifier had 79% sensitivity and 91% specificity, and our results showed that clinical notes from the hospitalization can be used to identify opioid misuse and serve as an alternative to manual screening by staff. Our opioid classifier was originally trained and calibrated in a source cohort of high-risk inpatients at Loyola University Medical Center. The trained model comprised of 15,651 medical concepts from 63,301 notes fed into the CNN [15]. The top positive features included terms such as 'heroin' , 'opiates' , 'drug abuse' , and 'polysubstance abuse' . However, the CNN is a non-linear model with many potential interactions and combination of concepts so external validation is vital prior to deployment. The previously developed CUI-based opioid classifier is accessible at https:// github. com/ Afsha rJoyc eInfo Lab/ Opioi dNLP_ Class ifier.
We aim to externally validate our opioid classifier against manual screening in an independent health system (Rush University Medical Center) that instituted hospital-wide screening for all hospital admissions since 2017. We hypothesized that our opioid classifier would provide sensitivity and specificity above 80%.

Source of data and participants
Rush University Medical Center (Rush) is a 727-bed hospital, tertiary care academic center serving Chicago with Epic (Epic Systems Corporation, Verona, Wisconsin) as its EHR vendor. Rush launched a multidisciplinary Substance Use Intervention Team (SUIT) to address the opioid epidemic through a Screening, Brief Intervention, and Referral to Treatment (SBIRT) program with an inpatient Addiction Consult Service in October 2017 [16]. Part of the SUIT program included the following single question universal drug screen: "How many times in the past year have you used an illegal drug or used a prescription medication for non-medical reasons?" (> 1 is positive). The single-question screen was administered by nursing staff to patients admitted to Rush's 18 inpatient medical and surgical wards. Patients with a positive universal screen were referred for a full screen with the 10-item Drug Abuse Screening Test (DAST-10) [17].
The inclusion criteria were all unplanned adult inpatient encounters (≥ 18 years of age) who were screened between October 23, 2017 and December 31, 2019. Unplanned admissions were defined using the Center for Medicare and Medicaid Services (CMS) rules for unplanned admission [18]. Outpatient encounters or discharges from the ED were excluded and patients that did not receive a universal screen and/or DAST-10 were excluded. The original development and internal validation cohort for training the NLP classifier was from Loyola University Medical Center using a sampling of hospitalized patients with an over-sampling of individuals with risk factors for opioid misuse [15].

Reference standard: outcome for testing opioid misuse classifier
Reference cases of opioid misuse for testing against the machine learning algorithm were determined using the DAST-10 [19]. We used a cutoff score of ≥ 2 for a positive screen for any substance misuse, which has been shown to have favorable sensitivity and specificity in identifying substance misuse in healthcare settings [17]. The type of substance use, including opioid misuse, was also collected in patients with a positive DAST-10. Opioid misuse was defined as patients with a DAST ≥ 2 and taking an opioid for reasons other than prescribed or as an illicit drug [20]. The final labels of positive cases in the reference cohort included patients with opioid misuse, either alone or in combination with other drugs.

Predictors from clinical notes
The manual screen data (e.g., questionnaire data) collected by hospital staff into EHR flowsheets were excluded from the extraction of predictors (i.e., features) to avoid any contamination of the reference data in the test dataset. Linguistic processing to extract all features from the clinical notes (i.e., admission note, progress note, consult note, ancillary notes, etc.) was performed using the clinical Text Analysis and Knowledge Extraction System (cTAKES; http:// ctakes. apache. org) [21]. cTAKES can recognize words or phrases from text as medical terms and maps them to the National Library of Medicine's Unified Medical Language System (UMLS), which includes over 2 million clinical concepts merged into the National Library of Medicine Metathesaurus. The spans of the UMLS Metathesaurus named entity mentions (diseases, symptoms, anatomical sites, drugs, and procedures) were mapped from the Rush EHR clinical notes and organized into Concept Unique Identifiers (CUIs), which are structured codes derived from multiple medical vocabularies. For instance, 'heroin abuse' is assigned C0600241 as its CUI which also includes eight other synonyms. 'Heroin abuse' is mapped to a separate CUI than 'history of heroin abuse' which is C3266350. The classifier was fed all the CUI predictors/features as inputs into a CUI embedding that was analyzed by our previously trained convolutional neural network (CNN).

Error analysis on misclassifications between the automated NLP opioid misuse classifier and the self-report manual screen
Post-hoc chart review was performed in cases where the NLP classifier was deemed a false-positive or falsenegative against the manual screen. A trained annotator (SB) performed chart review to provide a final likelihood for opioid misuse using all the data available in the EHR. The annotator met an inter-rater reliability of > 0.80 with an addiction specialist (KP) before independent review was performed. The final likelihood for opioid misuse included a Likert scale for definite, highly probable, probable, definitely not, and uncertain for determining opioid misuse. These criteria were developed by consensus using a Delphi approach between a board-certified clinical informatics specialist and internist, board-certified addiction medicine specialist, and board-certified psychiatrist [15]. Substance use characteristics and treatments were compared across Likert groups and displayed in Table 3.
Probable cases required any one of the following: (1) history of opioid misuse evident in the clinical notes but no current documentation for the encounter; (2) provider mention of aberrant drug behavior; (3) evidence of other drug misuse (except alcohol) in addition to prescription opioid use; (4) documented history of opioid misuse but in remission, thus remaining at-risk. Highly probable cases were classified by more than one of the probable case criteria, or provider mention of opioid dependence plus suspicion of misuse in the clinical notes. Definite cases were classified as the patient self-reporting opioid misuse to a provider or documentation by provider of patient currently misusing an opioid. The remainder of cases were categorized as no opioid misuse.

Analysis plan
Statistical tests to compare baseline patient characteristics between opioid misuse and no misuse groups were conducted using the chi-square test for proportions and Wilcoxon-Mann Whitney nonparametric tests for integer variables. Comorbid conditions were defined with International Classification of Disease (ICD) codes based on the Elixhauser comorbidity categories [22]. Missing data analysis was performed to compare the manually screened hospitalizations to the hospitalized group that did not received screening (Additional file 1: Table S1).
The primary outcome was discrimination of the opioid classifier for identifying opioid misuse versus no opioid misuse as measured by the Area Under the Receive Operating Characteristic Curve (AUROC) and the Precision-Recall Curve (PR Curve). The PR Curve is a better discrimination metric for unbalanced datasets [23]. The following test characteristics and their 95% confidence intervals (CI) were reported: sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV). An optimal cut-point level was derived by examining a range of cutpoints including Youden indice [24]. Sensitivity analysis was performed by running the classifier using only the first 24 h of notes to better reflect its potential use as an inpatient screening tool.
The adjustment of models across settings with different prevalence rates, so-called model updating or recalibration, is recommended by the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines [25] to avoid over-or underestimation of a patient's risk. We anticipated these issues may occur in external validation and provided results for both uncalibrated and calibrated models. Calibration plots were examined to assess the reliability and agreement of the classifier predictions against the reference standard. Calibration was formally assessed by the calibration slope, intercept, and visually with a calibration plot. A non-parametric regression with isotonic calibration was used to account for the decrease in prevalence. Isotonic calibration provides a piecewise linear model to predict the sequences of observations that preserves the order as a monotonic function for uncalibrated estimates from our model [26]. Analysis was performed using Python Version 3.6.5 (Python Software Foundation) and RStudio Version 1.1.463 (RStudio Team, Boston, MA). The Institutional Review Board of Rush approved this study. We followed the 2015 guideline for Transparent Reporting of a multivariable Prediction Model for Individual Prognosis Or Diagnosis (TRIPOD): Prediction Model Validation Checklist (Additional file 1: Table S3).

Results
During the study period, there were 82,881 unplanned adult hospitalizations and DAST screening for substance misuse was completed in 67.8% (n = 56,227) with 1.1% (n = 628) of the screened cohort identified with opioid misuse (Fig. 1). The cohort that did not have screening data recorded in the EHR was similar in demographics to the cohort with screening data (Additional file 1: Table S1). Comparisons were made between self-report opioid misuse (via questionnaire)  16:19 and no misuse (Table 1). A lower proportion of comorbidities and substance misuse by ICD codes and a higher proportion of in-hospital death was found in the group without opioid misuse versus the group with opioid misuse by manual screen. The median age of patients with opioid misuse was younger than without misuse, and a greater proportion of patients with opioid misuse were male and non-Hispanic black (p < 0.01) ( Table 1). A greater proportion also had chronic lung disease, depression, polysubstance drug use, and psychiatric conditions (p < 0.01). More patients with opioid misuse were on Medicaid and were discharged against medical advice than patients without misuse ( all deciles of predicted probabilities; therefore, the model was calibrated for the lower prevalence in our hospitalized cohort and provided a better model fit (Fig. 2) Table 2.
Error analysis by chart review of the uncalibrated model identified 1.9% (n = 1091) misclassifications between the opioid classifier and manual screening. In 99% (n = 1081) of the discordant cases between the NLP classifier and the self-report manual screen reference standard, the NLP classifier labelled cases as positive but the self-report manual screens were negative. However, 49.3% (n = 533) of these cases were noted to have at least a probable likelihood for opioid misuse after post-hoc chart review by the annotator, suggesting the NLP classifier correctly labelled the cases as positive and underreporting occurred during the self-report manual screen (Table 3). Of the cases with at least a probable likelihood, 64.7% (n = 345) were determined by the reviewer to be true positives because of prior evidence for misuse in the EHR notes. As the likelihood for opioid misuse increased on the Likert scale by the chart reviews, the predicted probability of the opioid classifier increased as well (Table 3).

Discussion
Our opioid misuse classifier had good discrimination and calibration in external validation in a cohort of hospitalized patients, and it provided a sensitivity and specificity above 95% using the full encounter of notes. Limiting the data to the first 24 h of the hospital encounter, which accounted for 25% of the clinical notes, led to a small drop in performance but continued to demonstrate a   16:19 sensitivity and specificity at 75% or above. Error analysis revealed that under-reporting is common with many of the false-positives being deemed as true-positives based on chart review. Our model may provide a comprehensive and automated approach to opioid misuse identification that may augment current workflow and potentially overcome the current manual screening rate of 68%. Currently, single questionnaire screens for drugs represent universal screening tools supported by national practice guidelines [19,27]. Published results demonstrate 82% sensitivity and 74% specificity for illicit or nonmedical prescription drug misuse [5]. Our opioid misuse classifier achieved similar results given a large available clinical narrative, including sensitivity analysis with the first 24 h of notes. In 2016, there were about 35.7 million hospital stays with a mean length of stay of 4.6 days [28]-ample time for the classifier to also be used for screening and providing an intervention after the first 24 h of the encounter.
The United States Preventive Task Force emphasizes screening tools that do not include drug testing [19,29,30]. The USPSTF conducted a systematic review and identified 30 different screening tools, often with a sensitivity of more than 75% for detecting substance misuse, and report that most studies used structured clinical or diagnostic interview [31]. Post-hoc chart review of the  16:19 misclassifications by the NLP opioid misuse classifier against the manual self-report questionnaire data showed nearly half of the misclassifications of false-positives by the NLP classifier were re-labelled as true positives after in-depth chart review across the patient's hospital encounter. The discrepancy between self-report and clinical documentation possibly reflects underreporting to the screener or missing information not captured in the structured interviews but available in the provider notes. This highlights the value of notes for additional information that may not be captured in self-report. Poor screening and treatment options have led to less than a quarter of patients with opioid misuse receiving treatment-suggesting we need better approaches to identify and treat patients [32,33]. A systematic review on automatable algorithms for opioid misuse revealed the data used in many published algorithms are not routinely available in the EHR, or some algorithms rely solely on diagnostic billing codes which have poor sensitivity [34]. To date, best performing algorithms depend on pharmacy claims data which are not available in EHRs; therefore, impractical to providers and hospitals [35][36][37]. There is little direct evidence to demonstrate the application of NLP and machine learning in routinely collected EHR notes to identify patients with opioid misuse. Validation of our opioid misuse classifier enables a standardized approach to perform screening on patient encounters. This study is a step toward a more automated screening tool that can potentially overcome the current screening rate of 68%. In addition, the NLP classifier may identify additional cases missed by the DAST, which accounted for another 533 positive cases during post-hoc chart review. Because the tool is derived from notes collected during routine care, it may also benefit health systems that do not have mature screening programs with customized data entry.
The descriptive statistics about our cohort support the classifier's performance from the notes. Many of the patients identified by our opioid classifier also had ICD codes for drug misuse with high rates of mental health conditions and alcohol use disorders which are risk factors associated with opioid misuse [35,38,39]. In our chart review of over 1000 encounters, those with a higher likelihood for opioid misuse by the human reviewer also had, on average, a greater predicted probability for opioid misuse by the classifier.
The prevalence of opioid misuse in our health system of 1.1% was similar to other reports in hospitalized patients [40,41]. A national study from the National Emergency Department Sample with data from over 234 million adult ED visits found 1.23% of all visits were related to opioid-related diagnoses (opioid use, dependence, withdrawal, and other related conditions with opioid use such as mental and behavioral disorders associated with opioid use) [42]. In our original development study for the opioid classifier [15], we derived a source cohort with approximately a third having opioid misuse which led to an over prediction by our classifier when applied to Rush University Hospital which has a lower case-rate. Calibration is frequently under-reported for published models but a major reason for failure in models to perform well in external settings and to capture appropriate risk among groups [25,43]. As the severity of the opioid crisis varies over time and by region, prediction performance may also change across hospitals and the populations they serve. This phenomenon is well-described and has been labelled as calibration drift [25]. Continually training a new model is not feasible because it is time-consuming, requires abundant data, and wastes potentially useful information from existing models [44]. Therefore, updating trained models is the appropriate alternative and we demonstrate that calibration of our model to account for changes in case-rates and setting improved the PPV from approximately 37% to 71%. The higher PPV confers a lower number needed to evaluate and is more effective at limiting false alerts to reduce alarm fatigue.
Several limitations for application of the opioid misuse classifier exist. There remains a paucity of evidence into the benefits and harms of screening so the role for automated algorithms to improve health outcomes remains unclear. Quasi-experimental designs like an interrupted time-series to evaluate the replacement of current practice automated tools are next steps in evaluating the effectiveness of the NLP classifier. We provide credence in the predictive and face validity of the tool, but prospective designs are needed for casual inference on health outcomes. Current experiments for deployment of the opioid misuse classifier are registered in clinicaltrials.gov (NCT03833804). Outcomes such as receipt of motivational interviewing, initiation of buprenorphine, and re-hospitalization are among the outcomes of interest. Further, the capital costs for informatics teams at health systems to process clinical notes at point-of-care are substantial [45] and cost analyses are needed to evaluate the resource allocation needs for machine learning algorithms.
Our health system's screening system with the DAST was among the recommended instruments by the National Institute of Drug Abuse (NIDA) [46]. However, it is not the gold standard and has been largely evaluated in psychiatric outpatient settings with little data on its predictive validity in hospitalized patients [17]. Other instruments have reported similar or better sensitivity, including the Tobacco, Alcohol, Prescription medications, and other Substance (TAPS) tool or the World Health Organization World Mental Health Composite International Diagnostic Interview [46,47]. Although our tool identified over 500 potential cases not detected by the manual screen, there were also approximately another 500 cases that were confirmed to be false-positives. Mislabeling individuals with opioid misuse can be highly stigmatizing and additional bias and equity assessments are needed prior to deployment. Treatment may vary across individuals with different levels of misuse, such as unhealthy use versus substance use disorders, but our data did not allow such differentiation to be analyzed. Clinical trials are needed to examine the benefit of machine learning algorithms over existing screening methods and their efficacy in hospitalized patients.
In conclusion, in external validation of our opioid classifier, we demonstrate high accuracy after calibration for identifying hospitalized patients with opioid misuse. An automated NLP algorithm using routinely gathered EHR data may help health systems provide comprehensive screening for targeted interventions.
Abbreviations PHI: Protected health information; CUI: Concept unique identifier; NLP: Natural language processing; EHR: Electronic health record; ICD: International classification of disease; cTAKES: Clinical text and knowledge extraction system; UMLS: Unified medical language system; ROC AUC : Receiver operating characteristic area under the curve; PPV: Positive predictive value; NPV: Negative predictive value; CNN: Convolutional neural network.