Gio Circo, Ph.D. – A Blog for Data Stuff

Recipe Linking using LLMs

A deep dive into antiquated French cuisine

Python

Large Language Models

Auguste Escoffier’s 1903 “Le guide culinaire” is a legendary French cookbook. At just under 1,000 pages this book codified much of what modern French cusine has become. In…

Jul 28, 2026

Gio Circo, Ph.D.

Structured Data Extraction Using Local Models

Small LLM Image Processing using NuExtract3

Python

Large Language Models

For this post, I’m using another small model called NuExtract3. This model is actually just a fine-tuned version of a Qwen3.5-4B model, which I had previously run locally on…

Jun 1, 2026

Gio Circo, Ph.D.

A Few Practical Notes on PII Redaction

A ‘layered’ approach using OpenAI’s privacy filter

Python

Large Language Models

For those of us working in the healthcare industry, ensuring the safety of personally-identifiable information (PII) and personal health information (PHI) is a critical…

May 11, 2026

Gio Circo, Ph.D.

How Confident are AI Classifiers About Their Own Confidence?

Injury classsification using the 2024 NEISS

Python

Large Language Models

LLMs have found a lot of practical uses as text classifiers across a ton of different areas. In my day job we commonly use them to sort and classify documents into different…

Apr 8, 2026

Gio Circo, Ph.D.

Structured Data Extraction with Local Models

Classifying 311 complaints with Qwen3.5

Python

Large Language Models

I work with LLMs all day at my day job. I am lucky to be in a position where I have access to the most up-to-date models for all the major players (OpenAI, Anthropic, etc…).…

Mar 11, 2026

Gio Circo, Ph.D.

Visualizing predictor effects using accumulated local effect (ALE) plots

Identifying ‘hot spots’ of car crashes in NYC

Python

Data Science Applications

Data Visualization

One of the major drawbacks with many machine learning models is that explainability is often challenging. Unlike methods that rely on more conventional statistical…

Feb 20, 2026

Gio Circo, Ph.D.

Introducing ‘GridPred’

Spatial grid prediction using machine learning models

Python

Data Science Applications

There are no shortage of crime prediction models available today. In fact, prediction models are fairly ubiquitous nowadays in almost every field. Crime prediction…

Jan 11, 2026

Gio Circo, Ph.D.

Re-Discovering Spatial Scan Statistics

Detecting ‘hot spots’ of SIDS cases

Anomaly Detection

Spatial scan statistics were one of the first topics I studied in grad school. The

Oct 24, 2025

Gio Circo, Ph.D.

And Now For Something Completely Different…

Building a personal wine reccomendation bot

Python

Data Science Applications

Large Language Models

I haven’t ever really talked about any of my other hobbies on this blog. Mostly because there isn’t really a lot of overlap with the things I do for work, and what I do for…

Jul 14, 2025

Gio Circo, Ph.D.

Fine Tuning Your LLM for Fun and Profit

Part I: Building a working model

Python

Data Science Applications

Large Language Models

This is part one of a two part blog series where I will be walking through the steps of building and evaluating a fine-tuned version of an LLM. I initially became interested…

May 27, 2025

Gio Circo, Ph.D.

How to Calculate a Simple Difference-in-Differences

A very short example using lm

Causal Inference

I was recently scanning through some posts on my linkedin page, and saw something interesting (or at least more noteworthy than the 1000 rage-bait or AI posts). This was a…

Mar 30, 2025

Gio Circo, Ph.D.

‘Vibe Coding’ my way into a RAG pipeline

Retrieval-augmented generation with a little help from a friend.

Python

Data Science Applications

For most people who are up-to-date in tech, large language models (LLMs) are nothing new. In fact, they are downright pervasive. One of the largest challenges with LLMs…

Mar 14, 2025

Gio Circo, Ph.D.

Finding Fake Traffic Tickets in CT State Police Data

A decidedly old-school approach

Anomaly Detection

In 2022 news broke that Connecticut State Police officers had been likely submitting tens of thousands of fake traffic tickets. What started from a report by CT Insider about…

Mar 7, 2025

Gio Circo, Ph.D.

An A/B Testing Approach to Prompt Refinement

Testing a text extraction model using ChatGPT

Python

Data Science Applications

There are synthesized depictions of self-harm and suicide in this blog post.

Feb 24, 2025

Gio Circo, Ph.D.

Information Retrieval Using the Retrieve and Rerank Method

Extracting injury narratives from the NEISS

Python

Data Science Applications

The logic behind the “retrieve and rerank” method is that we have two sets of tools that excel at one specific task. Specifically we want to use a combination of a…

Jan 24, 2025

Gio Circo, Ph.D.

I Made An Election Model

A (very, very late) discussion of election forecasting

Bayesian Statistics

The 2024 election is over. Donald Trump beat Kamala Harris by about 2 million popular votes and 86 electoral votes. All the posturing, forecasting, and betting is over. So…

Jan 13, 2025

Gio Circo, Ph.D.

Where Do Crime Guns in Your State Come From?

Tracking gun seizures by state

For this post I rely on some data posted by David Johnson. He kindly made ATF gun trace data freely available here, which I learned about via a post on Twitter:

Dec 5, 2024

Gio Circo, Ph.D.

Executing Python Scripts from the Command Line

Or: how to pass a job interview

Python

Data Science Applications

Here’s a short one for a Friday afternoon.

Sep 20, 2024

Gio Circo, Ph.D.

Going Back to (Bayesian) School

Self-study with Regression and Other Stories

Bayesian Statistics

This is just a bit of fun, mostly for myself. In my day-to-day as a data scientist, I don’t get to use Bayesian statistics as much as I would like. In my role, a lot of the…

Aug 24, 2024

Gio Circo, Ph.D.

Dear Crime Analysts: Why You Should Use SQL Inside of R

Using duckDB in R to speed up analysis

SQL

When I was in grad school working on my Ph.D. I learned a lot about math, statistics, research methods, and experimental design (among a LOT of other things). For a good…

Jul 23, 2024

Gio Circo, Ph.D.

Creating Synthetic Spatial Data

Simulating Gas Stations and Robberies

Spatial Statistics

In my work as a data scientist I have been working increasingly more with synthetic data generation. Synthetic data can be very useful when you are working on products that…

Jun 15, 2024

Gio Circo, Ph.D.

If You Order Chipotle Online, You Are Probably Getting Less Food

Comparing weights of orders

Miscellaneous

Here’s a quick one. The question posed here is “do you get less food if you order your Chipotle order online versus in person?” There are plenty of posts going back years…

Apr 3, 2024

Gio Circo, Ph.D.

Don’t Evaluate Your Model On a SMOTE Dataset

or: try this one weird trick to increase your AUC

Machine Learning & Prediction

I recently found a paper published called “Advancing Recidivism Prediction for Male Juvenile Offenders: A Machine Learning Approach Applied to Prisoners in Hunan Province”.…

Mar 26, 2024

Gio Circo, Ph.D.

An Outsider’s Perspective On Media Mix Modelling

A Bayesian Approach to MMM

Bayesian Statistics

I’m trying something a bit new this time. Typically how I learn is that I see something interesting (either in a blog post, an academic article, or through something a…

Mar 18, 2024

Gio Circo, Ph.D.

How to Draw Lines Between Pairs of Points in R

Visualizing journeys between cities

Spatial Statistics

Here’s a quick one. I was recently asked how you might plot the travel of individuals over time on a map. For example, if you had longitudinal data recording the residences…

Jan 16, 2024

Gio Circo, Ph.D.

Using NLP To Classify Medical Falls

Or: An Old Dog Learning New Tricks

Python

Data Science Applications

There’s no question that natural language processing (NLP) facilitated by deep learning has exploded in popularity (much of which is popularized by the ChatGPT family of…

Jan 8, 2024

Gio Circo, Ph.D.

The Great American Coffee Taste Test

A deeper dive with Bayes

Bayesian Statistics

Miscellaneous

In October I was lucky enough to participate in popular coffee YouTuber James Hoffman’s Great American Coffee Taste Test. In short, participants got 4 samples of coffee and…

Dec 1, 2023

Gio Circo, Ph.D.

Synthetic Controls and Small Areas

A short discussion on ‘microsynthetic’ controls

Causal Inference

Bayesian Statistics

Andrew Gelman recently covered a mildly controversial paper in criminology that suggested that a policy of “de-prosecution” by the Philadelphia District Attorney’s office…

Oct 25, 2023

Gio Circo, Ph.D.

Generating Spatial Risk Features using R

Creating ‘RTM’ style map data

Spatial Statistics

Machine Learning & Prediction

In criminology there is a considerable research on the role that fixed spatial features in the environment have on crime. These spatial risk factors have criminogenic…

Oct 6, 2023

Gio Circo, Ph.D.

Good Mythical Morning: Blind Fast Food Taste Tests

How good are Rhett and Link?

Miscellaneous

So, its no secret that my wife and I are fans of Rhett and Link’s YouTube series Good Mythical Morning. Specifically, we are fans of the large variety of food- related…

Jul 27, 2023

Gio Circo, Ph.D.

Building an Outlier Ensemble from ‘Scratch’

Part 3: Histogram-based anomaly detector

Anomaly Detection

Machine Learning & Prediction

This is the third part of a 3-part series. In the first two posts I described how I built a principal components analysis anomaly detector and a k-nearest neighbors anomaly…

May 14, 2023

Gio Circo, Ph.D.

Building an Outlier Ensemble from ‘Scratch’

Part 2: K-nearest neighbors anomaly detector

Anomaly Detection

Machine Learning & Prediction

This is the second part of a 3-part series. In the previous post I talked a bit about my desire to work on building the pieces of an outlier ensemble from “scratch”…

Apr 25, 2023

Gio Circo, Ph.D.

Anomaly Detection for Time Series

Applying a PCA anomaly detector

Anomaly Detection

Machine Learning & Prediction

Identifying outliers in time series is one of the more common applications for unsupervised anomaly detection. Some of the most common examples come from network intrusion…

Apr 24, 2023

Gio Circo, Ph.D.

Building an Outlier Ensemble from ‘Scratch’

Part 1: Principal components anomaly detector

Anomaly Detection

Machine Learning & Prediction

I strongly believe in “learning by doing”. One of the things I have been working on quite a bit lately is unsupervised anomaly detection. As with many machine-learning…

Apr 7, 2023

Gio Circo, Ph.D.

An Alternative to Buffers for Spatial Merging

Car Crashes in Austin

Spatial Statistics

This is a bit of a mini-blog post based on a workflow that I have used based on some of my own work. A common issue in spatial analysis - and especially in criminology - is…

Mar 20, 2023

Gio Circo, Ph.D.

The Power of Ensembles

Adventures in outlier detection

Python

Anomaly Detection

Machine Learning & Prediction

It’s no secret that ensemble methods are extremely powerful tools in statistical inference, data science, and machine learning. It’s long been known that many “imperfect”…

Feb 15, 2023

Gio Circo, Ph.D.

Injuries at Amazon Warehouses - A Bayesian Approach

Bayesian Statistics

My friend, Andy Wheeler, just recently posted on his blog about reported injuries at Amazon warehouses. As he rightly points out, the apparent high number of injuries at…

Nov 11, 2022

Gio Circo, Ph.D

Categories