An A/B Testing Approach to Prompt Refinement

There are synthesized depictions of self-harm and suicide in this blog post.

What’s up with Prompts?

I am still a relative novice to large-language models (LLMs) when it comes to non-trivial tasks (like asking ChatGPT to summarize an email). Applying these models for complex real-world tasks is not nearly as simple. In fact, I think applying LLMs to solve business-problems is still new enough that I am not sure how many people can claim bonda-fide expertise here. Regardless, in my work I’ve been increasingly asked to use LLMs to automate the processing of large volumes of free-form text and extract structured output. Luckily, this is one task that LLMs are actually well-suited for, unlike many of the very silly attempts to plug it in where it is clearly not useful.

In my ongoing work, I have learned a few things. One of the biggest is that LLMs are very sensitive to the prompt they are given. Often, it can be a matter of wording, question structure, or even where the prompt is placed relative to the input. A few of the LLM companies, like Anthropic, even have some interesting suggestions: like structuring long documents with XML tagging to block off important sections. Outside of some of these more esoteric suggestions, I’ve found that common good practice for creating LLM prompts generally follows these rules:

Be extremely specific
Spell out tasks using bullets or numbered points
Provide examples

This could be as simple as the difference between:

“Summarize this legal document”

“Summarize this housing contract in 3 to 5 sentences. Include a section with numbered bullet points for the costs.”

One gives much more latitude to the LLM to do whatever it thinks is right, compared to the other which has structure and key instructions. Generally you will get better results the less you let the LLM fill in the blanks on its own.

Setting up a testing framework

With that in mind, let’s talk a bit about how I think about prompting.

My “mental model” of prompting

When I think of a prompt, I tend to conceptualize it as containing a few key pieces:

A role which defines the key tasks of the LLM and provides it some structure to its intended job.
A header which contains some initial instructions guiding the next steps in the prompt.
A body which has the bulk of the instructions, text, and other materials that make up the primary tasks of the prompt.
An example of what the intended output is supposed to look like. In a structured text extraction task, this would be the format of the json or other file I want created.
An optional footer that contains some final instructions. Often times I have found these useful to provide a list of things I definitively do NOT want the LLM to do.

A python framework

Applying this to an automated workflow in python can be pretty easily actually. Since we’re just working with text prompts, all I need to do is create a system that concats all the necessary pieces together in a way that creates a full prompt. Below I set up a class to help test multiple prompt versions in a somewhat quicker fashion than creating each one individually. My goal here is to have the ability to test many different prompts, in different configurations, without having to manually copy-and-paste these examples and store them. To accomplish this, I first built out a small prompt creation class:

Code

from itertools import product

class Prompt:
    def __init__(self):
        pass

    def prompt_concat(self, text_list):
        """Concat a list of text, dropping None values"""
        output_text = "\n"  .join(filter(None, text_list))
        output_text += "\n"

        return output_text

    def standard_prompt(
        self,
        header: str | list = None,
        narrative: str | list = None,
        body: str | list = None,
        example_output: str | list = None,
        footer: str | list = None,
        **kwargs
    ) -> list:
        """Create multiple standard prompts based on all combinations of list elements."""
        
        # Ensure all inputs are lists for consistent iteration
        params = [header, narrative, body, example_output, footer]
        param_lists = [[item] if not isinstance(item, list) else item for item in params]
        
        # unpack params, then pass to concat
        prompt_combinations = product(*param_lists)
        prompts = [self.prompt_concat(combination) for combination in prompt_combinations]
        
        return prompts
    
    def standard_prompt_caching(self,
        header: str | list = None,
        narrative: str | list = None,
        body: str | list = None,
        example_output: str | list = None,
        footer: str | list = None,
        **kwargs
    ) -> list:
        """Create multiple standard prompts based on all combinations of list elements.
        This puts the narrative at the end to support OpenAI prompt caching.
        """
        
        # Ensure all inputs are lists for consistent iteration
        params = [body, example_output, footer, header, narrative]
        param_lists = [[item] if not isinstance(item, list) else item for item in params]
        
        # unpack params, then pass to concat
        prompt_combinations = product(*param_lists)
        prompts = [self.prompt_concat(combination) for combination in prompt_combinations]
        
        return prompts
    
    def unstructured_prompt(self, prompt_text_list: list[str])-> str:
        """Create an unstructured prompt, given a list of text"""
        return self.prompt_concat([prompt_text_list])

All this class really does is take text strings and pastes them together using my mental framework. The trick here is that you can pass in a list for any of the parameters, and then we use itertools.product() to create all the possible combinations. For example, passing in 2 versions of a header, body, and example would give you \(2^3=8\) different combinations of the prompts. Also, here I add a “narrative” field which will serve as the input text for each text narrative we will be extracting data from.

Classifying Youth Suicide Narratives

The example problem here is based on a competition that was hosted by the CDC last year on DrivenData. The stated goal of the contest was to extract structured information from free-text narratives derived from police and medical examiner reports. The free text reports look something like this simulated example below:

[Simulated Example]

V was a YYYY. V was discovered deceased at home from a self-inflicted gunshot wound to the head. No medical history was documented in the report. According to V’s close friend and coworker, V had struggled with periods of severe anxiety but had never spoken about self-harm. V’s friend mentioned that several years ago, V had driven recklessly after a personal loss, but it was unclear whether it was an intentional act. On the night of the incident, V had been drinking and sent a message to a relative expressing regret and affection for them and other family members. Toxicology results confirmed the presence of alcohol. No additional details regarding the event were available.

The contest required taking the short narratives and extracting 24 different variables based on definitions from the national violent death reporting system. Most of these are binary indicators about specific behaviors observed in the narrative. Therefore, we need to create a prompt that will instruct our LLM to read the narratives, extract features based on the rules for each of the 24 features, then return the output back to us in a format that we can use for scoring.

Prompt creation

As a demonstration of what this testing can look like, my idea is to run 4 different prompt versions using a sample of narratives and evaluate whether we can observe any improvement in performance based solely on the prompt. As a default the body and example output of my primary prompt looks like this:

BODY1 = """
INSTRUCTIONS:
    Closely follow these instructions:
        - For each variable below return a 0 for 'no' and a 1 for 'yes' unless otherwise stated.
        - If more than two answers are available, return ONE of the numbered values.
        - Rely ONLY on information available in the narrative. Do NOT extrapolate.
        - Return a properly formatted json object where the keys are the variables and the values are the numeric labels.
        - Do NOT return anything other than the label. Do NOT include any discussion or commentary.

VARIABLES:   
    DepressedMood: The person was perceived to be depressed at the time
    MentalIllnessTreatmentCurrnt: Currently in treatment for a mental health or substance abuse problem
    HistoryMentalIllnessTreatmnt: History of ever being treated for a mental health or substance abuse problem
    SuicideAttemptHistory: History of attempting suicide previously
    SuicideThoughtHistory: History of suicidal thoughts or plans
    SubstanceAbuseProblem: The person struggled with a substance abuse problem. This combines AlcoholProblem and SubstanceAbuseOther from the coding manual
    MentalHealthProblem: The person had a mental health condition at the time
    DiagnosisAnxiety: The person had a medical diagnosis of anxiety
    DiagnosisDepressionDysthymia: The person had a medical diagnosis of depression
    DiagnosisBipolar: The person had a medical diagnosis of bipolar
    DiagnosisAdhd: The person had a medical diagnosis of ADHD
    IntimatePartnerProblem: Problems with a current or former intimate partner appear to have contributed
    FamilyRelationship: Relationship problems with a family member (other than an intimate partner) appear to have contributed
    Argument: An argument or conflict appears to have contributed
    SchoolProblem: Problems at or related to school appear to have contributed
    RecentCriminalLegalProblem: Criminal legal problem(s) appear to have contributed
    SuicideNote: The person left a suicide note
    SuicideIntentDisclosed: The person disclosed their thoughts and/or plans to die by suicide to someone else within the last month
    DisclosedToIntimatePartner: Intent was disclosed to a previous or current intimate partner
    DisclosedToOtherFamilyMember: Intent was disclosed to another family member
    DisclosedToFriend: Intent was disclosed to a friend
    InjuryLocationType: The type of place where the suicide took place.
        - 1: House, apartment
        - 2: Motor vehicle (excluding school bus and public transportation)
        - 3: Natural area (e.g., field, river, beaches, woods)
        - 4: Park, playground, public use area
        - 5: Street/road, sidewalk, alley
        - 6: Other
    WeaponType1: Type of weapon used 
        - 1: Blunt instrument
        - 2: Drowning
        - 3: Fall
        - 4: Fire or burns
        - 5: Firearm
        - 6: Hanging, strangulation, suffocation
        - 7: Motor vehicle including buses, motorcycles
        - 8: Other transport vehicle, eg, trains, planes, boats
        - 9: Poisoning
        - 10: Sharp instrument
        - 11: Other (e.g. taser, electrocution, nail gun)
        - 12: Unknown
"""

EXAMPLE_OUTPUT1 = """
EXAMPLE OUTPUT:
{
    "DepressedMood": 1,
    "MentalIllnessTreatmentCurrnt": 0,
    "HistoryMentalIllnessTreatmnt": 0,
    "SuicideAttemptHistory": 0,
    "SuicideThoughtHistory": 0,
    "SubstanceAbuseProblem": 1,
    "MentalHealthProblem": 0,
    "DiagnosisAnxiety": 0,
    "DiagnosisDepressionDysthymia": 0,
    "DiagnosisBipolar": 0,
    "DiagnosisAdhd": 1,
    "IntimatePartnerProblem": 0,
    "FamilyRelationship": 0,
    "Argument": 1,
    "SchoolProblem": 1,
    "RecentCriminalLegalProblem": 0,
    "SuicideNote": 1,
    "SuicideIntentDisclosed": 0,
    "DisclosedToIntimatePartner": 0,
    "DisclosedToOtherFamilyMember": 0,
    "DisclosedToFriend": 0,
    "InjuryLocationType": 1,
    "WeaponType1": 5
}
"""

This prompt is basically a copy-paste from the instructions in the contest, with an example json object to ensure the LLM knows what to output. This is probably the absolute minimum you would need to get some useful output. That being said, there are at least two areas I think we can improve this baseline prompt:

Add detailed descriptions to each feature
Add “few shot” examples

The first one will have us spell out a bit more closely what we actually mean for each variable. For example we might turn this:

DepressedMood: The person was perceived to be depressed at the time

into this:

DepressedMood: The person was perceived to be depressed at the time - 1: Specific signs of depression were noted in the narrative (e.g., sad, withdrawn, hopeless) - 0: No mention of depressive symptoms

The first one is vague, and gives the LLM a lot of freedom to guess what “perceived to be depressed” means. In contrast, the second one asks only to mark it if there were specific mentions of depresion noted in the narrative.

We can also add “few shot” examples to the prompt. This just means we provide the LLM with a few examples of a narrative and output to give it a better idea of what we want. Here is a synthesized example of what this might look like using a simulated narrative:

[Simulated Example]

Code

EXAMPLE_OUTPUT2 = """
Here is an example narrative and expected output:

EXAMPLE NARRATIVE 1:
Officers responded at 0745 hours to a report of a self-inflicted gunshot wound. The V was found in a bedroom with two acquaintances present, a gunshot wound to the left temple, and no exit wound. The V’s girlfriend stated he had been struggling with anxiety and depression, especially with the anniversary of his cousin’s suicide approaching. Before pulling the trigger, the V said, “You won’t believe me until I do it.” A .380 caliber firearm, a handwritten note, alcohol, and unidentified substances were found at the scene.

EXAMPLE OUTPUT 1:
{
    "DepressedMood": 1,
    "MentalIllnessTreatmentCurrnt": 0,
    "HistoryMentalIllnessTreatmnt": 1,
    "SuicideAttemptHistory": 0,
    "SuicideThoughtHistory": 1,
    "SubstanceAbuseProblem": 1,
    "MentalHealthProblem": 1,
    "DiagnosisAnxiety": 1,
    "DiagnosisDepressionDysthymia": 1,
    "DiagnosisBipolar": 0,
    "DiagnosisAdhd": 0,
    "IntimatePartnerProblem": 0,
    "FamilyRelationship": 0,
    "Argument": 0,
    "SchoolProblem": 0,
    "RecentCriminalLegalProblem": 0,
    "SuicideNote": 1,
    "SuicideIntentDisclosed": 1,
    "DisclosedToIntimatePartner": 1,
    "DisclosedToOtherFamilyMember": 0,
    "DisclosedToFriend": 0,
    "InjuryLocationType": 1,
    "WeaponType1": 6
}
"""

If you want to see all of the propmpt versions, you can look at the script under src.prompts.py in the git repo for this post.

Running through ChatGPT

To test all variants of these prompts, I took a sample of 200 cases from the 4000 original narratives, and oversampled from rare categories to ensure I had at least a few examples for each variable (if you are curious, you can look at my uid_weighting.R script for how I did this). Using these 200 cases, I wanted to test the 4 different prompt types by running them through a ChatGPT LLM. Most of the work is handled using the code below, where I take each sample narrative, construct 4 prompts, then append them to a json request template. All of these 200 examples are appended into a jsonlist object, then bulk run in a batch using the OpenAI API. I use one of the cheaper models here, 4.0 mini, which is probably fine for testing purposes. I should also note that tokens are cheap. In a batch run with 1.7 million tokens, I paid 13 cents for the input tokens and 6 cents for the output tokens. I actually messed up the prompt caching for this test, but if I did it correctly it would be 50% cheaper.

One important note - I specify the output must come in the form of a json object by adding "response_format": { "type": "json_object" }. This is ChatGPT’s “json mode”, which is quite handy for this. Oftentimes the LLM might hallucinate an output that is not a valid json and break the workflow.

for row in narratives.iterrows():

    # grab the unique id and text
    single_narrative = row[1]
    id = single_narrative["uid"]
    txt = single_narrative["NarrativeLE"]

    prompt_input = {
        "header": HEADER1,
        "narrative": txt,
        "body": [BODY1, BODY2],
        "example_output": [EXAMPLE_OUTPUT1, EXAMPLE_OUTPUT2],
        "footer": None,
    }

    # create a prompt, pass in the text narrative
    prompt_versions = prompt_creator.standard_prompt_caching(**prompt_input)

    version_num = 0
    for prompt in prompt_versions:
        # now append to list
        json_list.append(
            {
                "custom_id": f"{id}_{version_num}",
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": "gpt-4o-mini",
                    "messages": [
                        {"role": "system", "content": ROLE},
                        {"role": "user", "content": prompt},
                    ],
                    "max_tokens": 500,
                    "response_format": { "type": "json_object" },
                },
            }
        )
        version_num += 1

After passing the jsonlist it takes about an hour to run using the cheapest batch service. We can then download the output jsons and parse our results!

Results

To evaluate the results we use the same metrics provided by the contest: a macro-weighted f1 score for the binary variables, and a micro-weighted f1 score for the multi-class categorical variables.

Below we see that relative to the baseline prompt (model 0) adding descriptions and adding examples had a positive impact on both the f1 score and accuracy in the aggregate. The model with the highest score was the one with both detailed variable descriptions and 3 scored examples. Overall, all of the models did pretty well given how little tweaking I did. The more interesting question is looking at how much quality really was impacted by different prompting styles relative to the general randomness we typically expect out of an LLM.

Prompt Version	F1 Score	Accuracy
0: Baseline	0.763	0.824
1: Add 3 few-shot examples	0.775	0.832
2: Add more descriptions to features	0.778	0.837
3: Add few-shot and more descriptions	0.781	0.839

A solution: remember analysis of variance from your stats 101 course? We can actually use it here to see if there if there is non-zero variation in f1 score attributable to a change in prompts, relative to the variation in question type. Looking below we see, unsurprisingly, almost all the variation is explainable by the question type (meaning that differences in f1 scores are mostly based on the question type).

Code

res <- with(results, aov(f1score ~ as.factor(model_ver) + as.factor(feature)))
broom::tidy(res) %>% kable(digits=3)

term	df	sumsq	meansq	statistic	p.value
as.factor(model_ver)	3	0.004	0.001	4.054	0.01
as.factor(feature)	22	0.584	0.027	76.276	0.00
Residuals	66	0.023	0.000	NA	NA

If we break this down by question type and plot them out, the results are a bit clearer. Below I have the questions ordered based on their variance in f1 scores, so that questions that changed more often based on the prompt are nearer the top. Interestingly, a few questions see a large improvement from the baseline prompt to the more advanced one. DepressedMood has an f1 score of .5 on the original prompt, which increases to about .65 on the final prompt. We see similar results with SchoolProblem, FamilyRelationship, and Argument as well. Questions that were already doing quite good see virtually no change - like WeaponType1. The LLM has a very easy time identifying the weapon, because it is almost always clearly disclosed in the narrative (and is very often a firearm).

Summary

To wrap up this very long blog post, I have a few things to note:

Writing good prompts isn’t hard - but requires structuring questions in a way to ensure you get what you expect to see.
Setting up a testing environment can help automate comparisons of many different prompts.
Principled testing can help reduce manual prompt testing down the road.

We’re still in uncharted waters, comparatively speaking. Things on the LLM front are moving fast!