import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import re
from sentence_transformers import SentenceTransformer, util
1)
np.random.seed(
= SentenceTransformer('all-MiniLM-L6-v2')
model
# load raw data
= pd.read_csv("../../../data/falls/falls.csv")
falls = pd.read_csv("../../../data/falls/neis.csv")
neis
# process datetime
'treatment_date'] = pd.to_datetime(falls['treatment_date']) falls[
Natural Language Processing and Deep Learning
There’s no question that natural language processing (NLP) facilitated by deep learning has exploded in popularity (much of which is popularized by the ChatGPT family of models). This is an exciting time to be involved in AI and machine learning. However, for the kinds of tasks I typically work on in my day job, a lot of the deep learning models don’t provide much benefit. In fact, for most tabular data problems, random forests + boosting tend to work incredibly well. Areas where deep learning excels, like unstructured text or image input, are not things I find myself working on. That being said, I am always sharpening my skills and dipping my toes into areas where I am least familiar.
A huge advantage today, compared to even ten years ago, is the ecosystem of open data and pre-trained models. HuggingFace in particular has a lot of easily obtainable pre-trained models. Stuff like the Transformers library make it easy for a neophyte like me to hop in and start doing work without too much overhead.
Predicting Elderly Falls from Medical Narratives
For this example I am going to rely on some data from DrivenData - an organization that hosts data competitions. The data here are verified fall events for adults aged 65+. This sample comes more broadly from the National Electronic Injury Survellience System(NEISS). This is useful because the sample of cases here are human-verified falls cases, in which case we have a source of truth. While you could probably get pretty far just doing some regex like str.match("FALL|FELL|SLIPPED")
but it would likely miss more subtle cases. This is where having something like a BERT model is useful.
Let’s say we have a set of verified falls narratives (which we do) and we have a large set of miscellanous narratives that contain falls cases, as well as other injuries that are not falls. Our goal is to find narratives that are likely to be related to elderly fall cases. To do this, we will use the verified falls cases narratives from DataDriven as our “training data” so to speak, and we will use an NLP model to find cases that are semantically similar to these verified falls cases.
Data Setup
To get set up we read in the verified falls narratives, as well as the full sample of NEIS cases from 2022. After reading in our data we can perform some minor data cleaning to the narratives. Specifically, because we want to isolate narrative characterics associated with falls we should exclude the leading information about the patient’s age and sex, as well as some other medical terminology. We can also remap some abbreviations to English and properly extract the actual age of the patient from the narrative.
# define remappings of abbreviations
# and strings to remove from narratives
= {
remap "FX": "FRACTURE",
"INJ": "INJURY",
"LAC": "LACERATION",
"CONT": "CONTUSION",
"CHI" : "CLOSED HEAD INJURY",
"ETOH": "ALCOHOL",
"SDH": "SUBDURAL HEMATOMA",
"NH": "NURSING HOME",
"PT": "PATIENT",
"LT": "LEFT",
"RT": "RIGHT",
"&" : " AND "
}= "YOM|YOF|MOM|MOF|C/O|S/P|H/O|DX"
str_remove
def process_text(txt):
= txt.split()
words = [remap.get(word, word) for word in words]
new_words = " ".join(new_words)
txt
= re.sub("[^a-zA-Z ]", "", txt)
txt = re.sub(str_remove, "", txt)
txt
return re.sub(r"^\s+", "", txt)
def narrative_age(string):
= re.match("^\d+",string)
age
if not age:
= 0
age else:
= age[0]
age
return age
We then apply these to our verified falls data and our raw NEIS data from 2022:
# process narrative text and extract patient age from narrative
'processed_narrative'] = falls['narrative'].apply(process_text)
falls['processed_narrative'] = neis['Narrative_1'].apply(process_text)
neis[
'narrative_age'] = falls['narrative'].apply(narrative_age).astype(int)
falls['narrative_age'] = neis['Narrative_1'].apply(narrative_age).astype(int)
neis[
# neis cases are from 2022, remove from verified falls
= falls[falls['treatment_date'] < "2022-01-01"]
falls
# filter narrative ages to 65+
= falls[falls['narrative_age'] >= 65]
falls = neis[neis['narrative_age'] >= 65] neis
We can see that our coding changes the narratives subtly. For example this string:
'narrative'][15] falls[
'87YOF HAD A FALL TO THE FLOOR AT THE NH STRUCK BACK OF HEAD HEMATOMA TO SCALP'
Is changed to this:
'processed_narrative'][15] falls[
'HAD A FALL TO THE FLOOR AT THE NURSING HOME STRUCK BACK OF HEAD HEMATOMA TO SCALP'
This minimal amount of pre-processing should help the model identify similar cases without being affected by too much extranenous information. In addition, because the typical model has about 30,000 words encoded we need to make sure we avoid abbreviations which will be absent from the model dictionary.
Implementing the Transformer model
We can grab all of our verified fall narratives as well as a random sample of narratives from the 2022 NEIS data. Below we’ll take a sample of 250 cases and run them through our model.
= 250
N = np.random.choice(neis.shape[0], N, replace=False)
idx
= np.array(falls['processed_narrative'])
fall_narrative = np.array(neis['processed_narrative'])[idx] neis_narrative
We take the processed narratives and convert them to tokens using the pre-trained sentence transformer:
= model.encode(fall_narrative)
embed_train = model.encode(neis_narrative) embed_test
We then compute the cosine similarity between the two tensors. What we will end up with is the distance from our NEIS narratives and the verified fall cases. Cases with larger distances should be less likely to contain information about elderly fall cases.
= util.cos_sim(embed_test, embed_train) cos_sim
For simplicity we scale the distances between 0 and 1, so that 1 is most similar and 0 is least similar. We can then just compare the rank-ordered narratives.
= cos_sim.mean(1)
dists = dists.min(), dists.max()
d_min, d_max
= (dists - d_min)/(d_max - d_min)
dists = np.array(dists)
dists
= dict(zip(neis_narrative, dists)) out
Plotting a histogram of the minmax scaled cosine similarity scores shows a lot of narratives that are very similar and a long tail of those that are not so similar. Of course, there isn’t a single cut point of what we would consider acceptable for classification purposes, but we could certainly use these scores in a regression to determine a suitible cut point if we were so interested.
Code
= {
cparams "axes.spines.left": False,
"axes.spines.right": False,
"axes.spines.top": False,
"axes.spines.bottom": False,
"grid.linestyle": "--"
}
="ticks", rc = cparams)
sns.set_theme(style
(="#004488"),
sns.histplot(dists, color"Cosine Similarity (minmax scaled)")
plt.xlabel(
) plt.show()
Results
Time to actually see the results. Our results are stored in a dictionary which allows us to just pull narratives by similarity score. Let’s test it out by looking at the top 10 most similar NEIS narratives:
sorted(out, key=out.get, reverse=True)[:10]
['FELL TO THE FLOOR AT THE NURSING HOME CLOSED HEAD INJURY',
'FELL ON THE FLOOR CLOSED HEAD INJURY',
'WAS AT THE NURSING HOME AND SLIPPED AND FELL TO THE FLOOR STRIKING HIS HEAD SCALP LACERATION',
'PRESENTS AFTER A FALL WHILE WALKING ACROSS THE LIVING ROOM AND HE TRIPPED AND FELL TO THE FLOOR REPORTS HE HIT HIS HEAD AND LEFT SHOULDER ON A RUG FALL CLOSED FRACTURE OF CERVICAL VERTEBRA',
'FELL TO THE FLOOR STRUCK HEAD CLOSED HEAD INJURY ABRASION TO KNEES',
'WAS GETTING INTO BED AND FELL TO THE FLOOR ONTO HEAD CLOSED HEAD INJURY',
'FELL ON FLOOR AT NH INJURY AND BODY PATIENT NS FALL',
'FELL BACKWARDS FROM STEPS CLOSED HEAD INJURY ABRASION HAND',
'PRESENTS WITH HEAD INJURY AFTER A FALL REPORTS HE WAS FOUND ON THE FLOOR IN A RESTAURANT AFTER HE SLIPPED AND FELL HITTING HIS HEAD INJURY OF HEAD',
'WENT TO SIT DOWN AND SOMEONE MOVED HER CHAIR AND SHE FELL BACKWARDS HITTING HER HEAD ON THE FLOOR FALL BLUNT HEAD TRAUMA TAIL BONE PAIN']
And the 10 least similar narratives:
sorted(out, key=out.get, reverse=False)[:10]
['SYNCOPAL EPISODE WHILE FOLDING NS CLOTHINGSYNCOPE',
'WAS COOKING SOME SALMON AND THEN SPRAYED A AEROSOL DEODORANT DUE TO THE SMELL WHICH CAUSED HER TO FEEL THAT SOMETHING WAS STUCK IN HER THROAT FOREIGN BODY SENSATION IN THROAT',
'WAS PLAYING GOLF AND DEVELOPED AMS AND PASSED OUT SYNCOPE',
'CO STABBING RIGHT CHEST PAIN RADIATES TO HER BACK SHORTNESS OF BREATH AFTER HER HHA WAS MOPPING THE FLOOR LAST NIGHT W STRONGPOTENT CLEANING AGENT THAT TRIGGERED HER ASTHMA CHEST PAIN ASTHMA',
'WITH FISH HOOK IN HIS RIGHT INDEX FNIGER HAPPENED AT A LAKE FB RIGHT INDEX FINGER',
'CUT THUMB WITH BROKEN BOTTLE NO OTHER DETAILS LWOT NO ',
'EXERCISING FELT PAIN IN RIGHT LOWER LEG LOWER LEG PAIN',
'CO LEFT SIDED CHEST PAIN FOR THE PAST THREE DAYS AFTER WORKING OUT AT THE GYM LEFT PECTORALIS MUSCLE STRAIN',
'PRESENTS AFTER BEING IN A ROOM FILLED WITH SMOKE FOR HOURS AFTER THERE WAS A FIRE IN HER NEIGHBORS APARTMENT UNKNOWN IF FIRE DEPARTMENT INVOLVED SMOKE INHALATION PAIN IN THROAT ELEVATED TROPONIN',
'ON FOR AF WAS WASHING DISHES AND SLASHED ARM ON A KNIFE LACERATION OF RIGHT FOREARM']
So in general, it did a pretty good job. The most similar cases are all clearly related to falls, while the least similar ones are all a mix of other injuries. While I don’t have any tests here (coming soon!) I suspect this does better than very simple regex queries. If only because it has the ability to find similarities without needing to match on specific strings.
Singular Queries
We can extend this model a bit and create a small class object that will take a single query in, and return the \(K\) most similar narratives. Below, we bundle our functions into a NarrativeQuery
class object. After encoding the narrative we can provide query strings to find sementically similar narratives.
class NarrativeQuery:
def __init__(self, narrative):
self.narrative = narrative
self.narrative_embedding = None
self.model = SentenceTransformer("all-MiniLM-L6-v2")
def encode(self):
self.narrative_embedding = self.model.encode(self.narrative)
def search_narrative(self, query, K = 5):
= self.model.encode(query)
embed_query
= self.cos_sim(self.narrative_embedding, embed_query)
query_out
return sorted(query_out, key=query_out.get, reverse=True)[:K]
def cos_sim(self, embed, embed_query):
= util.cos_sim(embed, embed_query)
cs
= cs.mean(1)
dists = dists.min(), dists.max()
d_min, d_max
= (dists - d_min)/(d_max - d_min)
dists = np.array(dists)
dists
return dict(zip(self.narrative, dists))
This sets it up:
= NarrativeQuery(neis_narrative)
FallsQuery FallsQuery.encode()
…and this performs the search. Here we’re just looking for narratives where a person slipped in a bathtub.
="SLIPPED IN BATHTUB", K = 10) FallsQuery.search_narrative(query
['SLIPPED AND FELL IN THE SHOWER LANDING ONTO BUTTOCKS CONTUSION TO BUTTOCKS',
'SLIPPED ON FLOOR AND FELL AT HOME FALL',
'PRESENTS AFTER A SLIP AND FALL IN THE TUB STRIKING HER HEAD ON THE WALL SYNCOPE FALL HEAD STRIKE',
'FELL IN THE SHOWER FRACTURED UPPER BACK',
'PATIENT FELL IN THE SHOWER AND HIT HER HEAD AND HER LEFT ELBOW LACERATION OF SCALP WITHOUT FOREIGN BODY STRUCK BATH TUB WITH FALL ABRASION OF LEFT ELBOW',
'WAS WALKING FROM THE BATHROOM TO THE BEDROOM AND PASSED OUT FALLING TO THE FLOOR CAUSING A SKIN TEAR TO HIS LEFT ELBOW SKIN TEAR OF LEFT ELBOW',
'SLIPPED AND FELL IN FLOOR AT HOME R HIP FRACTURE',
'FELL IN THE SHOWER AT HOME TWISTING RIGHT KNEE RUPTURE RIGHT PATELAR TENDON',
'WEARING SOCKS SLIPPED AND FELLHEAD INJURYFX FEMUR',
'SLIPEPD AND FELL IN THE SHOWER STRUCK HEAD CLOSED HEAD INJURY CONTUSION TO LEFT HIP']
Now this is cool. Using the sentence transformer we are able to get passages that are similar in style to what we searched, without sharing the exact same language. For example, the search query is "SLIPPED IN BATHTUB"
but we get results like "FELL IN THE SHOWER"
and "SLIP AND FALL IN THE TUB"
. If we were looking specifically for passages related to falls in the bathtub these obviously make sense (many bathtubs are also just showers as well).
Finally
Now, this isn’t probably news to most people that actually regularly work with language models. However, it is quite impressive that with a pre-trained model and very minimal pre-processing, you can obtain reasonable results off-the-shelf. I’ll definitely be keeping my eyes on these models in the future and looking for ways where they can improve my workflow.