OntExtract: PROV-O Provenance Tracking for Document Analysis Workflows

Christopher B. Rauch; Hyung Wook Choi; Mat Kelly

doi:10.1109/JCDL67857.2025.00038

Back

Conference paper

OntExtract: PROV-O Provenance Tracking for Document Analysis Workflows

Christopher B. Rauch, Hyung Wook Choi and Mat Kelly

2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp 249-252

15 Dec 2025

DOI: https://doi.org/10.1109/JCDL67857.2025.00038

Featured in Collection : UN Sustainable Development Goals @ Drexel

Additional Links

Abstract

Databases

document processing workflows

Faces

Ilm orchestration

Large language models

Libraries

llm tool use

NLP tool integration

PROV-O provenance

reproducible analysis

Text analysis

Semantics

Researchers have developed sophisticated methods for semantic change detection. These methods depend on foundational document processing operations, such as segmentation and entity extraction, yet users face significant obstacles when trying to combine these underlying tools into analytical workflows. We present OntExtract, a system that provides a unified interface for document processing with integrated provenance tracking. In OntExtract, PROV-O provenance concepts are embedded directly in the database schema. Each processing operation creates a versioned output with corresponding provenance records. The system operates in two modes: API-enhanced mode, which uses large language models to orchestrate tool selection, and standalone mode, which relies on established NLP libraries (spaCy, NLTK, sentence-transformers). Users can apply different processing strategies to the same documents and compare results, while the system tracks complete analytical provenance. The PostgreSQL implementation with pgvector enables semantic similarity search and supports reproducible semantic change analysis.

Metrics

2 Record Views

Details

Title: OntExtract: PROV-O Provenance Tracking for Document Analysis Workflows
Creators: Christopher B. Rauch - Drexel University, Information Science
Hyung Wook Choi - Drexel University, Information Science
Mat Kelly - Drexel University, Information Science
Publication Details: 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp 249-252
Conference: 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (Dekalb, Illinois, United States, 15 Dec 2025–19 Dec 2025)
Publisher: IEEE
Number of pages: 4
Resource Type: Conference paper
Language: English
Academic Unit: Information Science
Web of Science ID: WOS:001710657900028
Scopus ID: 2-s2.0-105033906284
Other Identifier: 9798331568030; 991022172970204721

UN Sustainable Development Goals (SDGs)

This publication has contributed to the advancement of the following goals:

Source: SDGs in the Output

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Web of Science research areas: Computer Science, Information Systems; Computer Science, Interdisciplinary Applications; Information Science & Library Science

OntExtract: PROV-O Provenance Tracking for Document Analysis Workflows

Additional Links

Abstract

Metrics

Details

UN Sustainable Development Goals (SDGs)

InCites Highlights

Drexel University Social media