Logo image
OntExtract: PROV-O Provenance Tracking for Document Analysis Workflows
Conference paper

OntExtract: PROV-O Provenance Tracking for Document Analysis Workflows

Christopher B. Rauch, Hyung Wook Choi and Mat Kelly
2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp 249-252
15 Dec 2025

Abstract

Databases document processing workflows Faces Ilm orchestration Large language models Libraries llm tool use NLP tool integration PROV-O provenance reproducible analysis Text analysis Semantics
Researchers have developed sophisticated methods for semantic change detection. These methods depend on foundational document processing operations, such as segmentation and entity extraction, yet users face significant obstacles when trying to combine these underlying tools into analytical workflows. We present OntExtract, a system that provides a unified interface for document processing with integrated provenance tracking. In OntExtract, PROV-O provenance concepts are embedded directly in the database schema. Each processing operation creates a versioned output with corresponding provenance records. The system operates in two modes: API-enhanced mode, which uses large language models to orchestrate tool selection, and standalone mode, which relies on established NLP libraries (spaCy, NLTK, sentence-transformers). Users can apply different processing strategies to the same documents and compare results, while the system tracks complete analytical provenance. The PostgreSQL implementation with pgvector enables semantic similarity search and supports reproducible semantic change analysis.

Metrics

1 Record Views

Details

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Web of Science research areas
Computer Science, Artificial Intelligence
Logo image