Substance Use

Women's Health

Substance use during pregnancy is one of the places where stigma and policy do the most quiet damage. People stop disclosing, providers stop asking, and the data we rely on stop describing real experience. My current work asks whether large-scale digital traces, paired with multiple large language models as a kind of triangulated qualitative team, can surface what surveillance and clinic data miss.

Postdoctoral · Current · Ongoing Submitted to APHA 2026

Qualitative Assessment of Substance Use–Related Perinatal Health Needs in Reddit Discourse Using Multiple Large Language Models (LLM): A Proof-of-Concept Study

Introduction

Substance use presents unique barriers to prenatal care. Stigma and punitive policy environments discourage disclosure and deter care-seeking among pregnant people, and traditional public health surveillance and clinical data offer only a partial picture of how people actually navigate substance use and the health needs that surround it in real time. Social media platforms such as Reddit fill some of that gap, where anonymity and peer community lower the barriers to self-disclosure that clinical encounters tend to raise.

Methods

This proof-of-concept study uses multi-LLM consensus annotation to classify self-reported perinatal experiences in a Reddit corpus assembled from five pregnancy- and substance-use-related subreddits between 2020 and 2024, yielding two million comments. From a random sample of 5,000 posts, three local LLMs (Llama 3.3-70B, GPT-OSS-20B, and Qwen3-30B) independently annotated each post for self-disclosure. The 1,370 posts (27.4%) that reached full three-model consensus were retained as a high-precision label set, and 309 explicit self-report posts were carried into thematic analysis with Claude Sonnet 4.6 and human review.

Results

Medication-use concerns and healthcare-system navigation challenges co-occurred frequently, with insurance barriers and provider communication difficulties showing up together. Across the corpus, a cross-cutting pattern of clinical dismissal emerged. Reddit was functioning as both an emotional support network and an alternative health information source for pregnant people who felt unheard inside medical care.

Conclusion

Multi-LLM consensus annotation with human-in-the-loop validation looks like a feasible, cost-efficient, and scalable way to identify unmet prenatal health needs from large-scale social media data, surfacing attitudes and beliefs that are routinely underreported in clinical settings. These findings establish the protocol for a full-corpus analysis of substance-use-related barriers and inform communication and outreach strategies for reaching pregnant people who are not engaged in adequate care.

Methodological note: Extends a line of Reddit-based public health surveillance developed in the Broniatowski lab at GW. Collaborator Dian Hu, PhD (Systems Engineering, GW; postdoctoral, University of Maryland) brings nearly a decade of expertise in social media data pipelines and NLP for health discourse.

Perinatal Substance Use Reddit Corpus Multi-LLM Thematic Analysis Health Communication

Future Works

In the coming years, my substance use research will draw on three complementary data infrastructures: digital self-disclosure on Reddit, local clinical intervention trial data, and national longitudinal data. The goal is to use these together to surface unmet needs that any single data source would miss, and to inform care models and outreach that meet pregnant people where they actually are.
Reddit

~2 million comments
2020–2024

Clinical RCT

RCT
GWU + Children’s National

HBCD

HEALthy Brain & Child Development
NIH Study

Lay digital voice Local clinical National longitudinal