Journal article
You Have the Floor: A Speaker-Aligned Corpus Derived from the Congressional Record
Proceedings of the International AAAI Conference on Web and Social Media, v 19, pp 2385-2395
07 Jun 2025
Abstract
The United States Congressional Record serves as a comprehensive archive of legislative discourse, yet its sheer volume and unstructured format pose significant challenges for researchers interested in analyzing political language, speaker behavior, and ideological framing. This paper presents a new dataset that organizes Congressional speeches by individual speakers. Data was obtained using a mixture of the Congress.gov Application Programming Interface (API) and web-scraping techniques to retrieve the full text of the Congressional Record. After extracting roll-call votes and standardizing the transcripts to remove noisy artifacts and normalize formatting, each speaker is separated into individual files and annotated with metadata including name, political affiliation, years active, district or state represented, and professional social media accounts, if known. This enables fine-grained analysis of rhetorical patterns and linguistic strategies across different political groups as well as time periods. By making the dataset publicly available, we aim to support interdisciplinary research utilizing natural language processing (NLP).
Metrics
10 Record Views
Details
- Title
- You Have the Floor: A Speaker-Aligned Corpus Derived from the Congressional Record
- Creators
- Jennifer L Bochenek - Drexel UniversityJake Ryland Williams - Drexel University
- Publication Details
- Proceedings of the International AAAI Conference on Web and Social Media, v 19, pp 2385-2395
- Publisher
- Association for the Advancement of Artificial Intelligence
- Resource Type
- Journal article
- Language
- English
- Academic Unit
- Information Science
- Other Identifier
- 991022061654004721