Logo image
You Have the Floor: A Speaker-Aligned Corpus Derived from the Congressional Record
Journal article   Open access

You Have the Floor: A Speaker-Aligned Corpus Derived from the Congressional Record

Jennifer L Bochenek and Jake Ryland Williams
Proceedings of the International AAAI Conference on Web and Social Media, v 19, pp 2385-2395
07 Jun 2025
url
https://doi.org/10.1609/icwsm.v19i1.35941View
Published, Version of Record (VoR) Open

Abstract

The United States Congressional Record serves as a comprehensive archive of legislative discourse, yet its sheer volume and unstructured format pose significant challenges for researchers interested in analyzing political language, speaker behavior, and ideological framing. This paper presents a new dataset that organizes Congressional speeches by individual speakers. Data was obtained using a mixture of the Congress.gov Application Programming Interface (API) and web-scraping techniques to retrieve the full text of the Congressional Record. After extracting roll-call votes and standardizing the transcripts to remove noisy artifacts and normalize formatting, each speaker is separated into individual files and annotated with metadata including name, political affiliation, years active, district or state represented, and professional social media accounts, if known. This enables fine-grained analysis of rhetorical patterns and linguistic strategies across different political groups as well as time periods. By making the dataset publicly available, we aim to support interdisciplinary research utilizing natural language processing (NLP).

Metrics

10 Record Views

Details

Logo image