Conference paper
OntExtract: PROV-O Provenance Tracking for Document Analysis Workflows
2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp 249-252
15 Dec 2025
Abstract
Researchers have developed sophisticated methods for semantic change detection. These methods depend on foundational document processing operations, such as segmentation and entity extraction, yet users face significant obstacles when trying to combine these underlying tools into analytical workflows. We present OntExtract, a system that provides a unified interface for document processing with integrated provenance tracking. In OntExtract, PROV-O provenance concepts are embedded directly in the database schema. Each processing operation creates a versioned output with corresponding provenance records. The system operates in two modes: API-enhanced mode, which uses large language models to orchestrate tool selection, and standalone mode, which relies on established NLP libraries (spaCy, NLTK, sentence-transformers). Users can apply different processing strategies to the same documents and compare results, while the system tracks complete analytical provenance. The PostgreSQL implementation with pgvector enables semantic similarity search and supports reproducible semantic change analysis.
Metrics
1 Record Views
Details
- Title
- OntExtract: PROV-O Provenance Tracking for Document Analysis Workflows
- Creators
- Christopher B. Rauch - Drexel University, Information ScienceHyung Wook Choi - Drexel University, Information ScienceMat Kelly - Drexel University, Information Science
- Publication Details
- 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp 249-252
- Conference
- 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (Dekalb, Illinois, United States, 15 Dec 2025–19 Dec 2025)
- Publisher
- IEEE
- Number of pages
- 4
- Resource Type
- Conference paper
- Language
- English
- Academic Unit
- Information Science
- Web of Science ID
- WOS:001710657900028
- Scopus ID
- 2-s2.0-105033906284
- Other Identifier
- 9798331568030; 991022172970204721
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Web of Science research areas
- Computer Science, Artificial Intelligence