Unified visual language modeling for zero-shot multitask inspection of civil infrastructure

Farzad Azizi Zade; Pedram Bazrafshan; Arvin Ebrahimkhanlou

doi:10.1117/12.3090628

Back

Conference proceeding

Unified visual language modeling for zero-shot multitask inspection of civil infrastructure

Farzad Azizi Zade, Pedram Bazrafshan and Arvin Ebrahimkhanlou

Proceedings of SPIE, the international society for optical engineering, v 13951, 139510U

16 Apr 2026

DOI: https://doi.org/10.1117/12.3090628

Additional Links

Abstract

Multi Task Computer Vision

Prompt Engineering

Captioning

Detection

Segmentation

Infrastructure Inspection

This paper examines multi task computer vision for civil infrastructure maintenance by evaluating a unified visual language approach for captioning, semantic segmentation, and object detection on transportation infrastructure imagery. Two pretrained model variants(Florence-2 base and large) were tested in a zero-shot setting, using prompt conditioned sequence to sequence processing and generating three captions per image as prompts for downstream tasks. Evaluation used average precision (AP) and average recall (AR) for open vocabulary detection, caption to phrase grounding detection, and referring expression segmentation. Computational profiling on an NVIDIA T4 indicates the smaller variant requires roughly 2 GB GPU memory with runtimes from 0.2 to 2.3 seconds, while the larger variant requires nearly 4 GB with runtimes from 0.4 to 3.9 seconds and higher CPU and GPU utilization. A complex scene revealed complementary strengths across variants: the base variant succeeded at bridge segmentation where the other failed, while the other detected all small instances (e.g., graffiti on concrete) missed by the first. Thisstudy’s prompt engineering across various prompts identified that a geometry-focused prompt can optimize inspection outcomes, providing practical guidance for deploying these models in real-world infrastructure monitoring applications.

Metrics

1 Record Views

Details

Title: Unified visual language modeling for zero-shot multitask inspection of civil infrastructure
Creators: Farzad Azizi Zade - Independent Researcher (Iran, Islamic Republic of)
Pedram Bazrafshan - Drexel University
Arvin Ebrahimkhanlou - Drexel University
Contributors: Kara J. Peters (Editor) - North Carolina State University
Fabrizio Ricci (Editor) - Univ. degli Studi di Napoli Federico II (Italy)
Piervincenzo Rizzo (Editor) - University of Pittsburgh
Christoph Schaal (Editor) - California State University, Northridge
Publication Details: Proceedings of SPIE, the international society for optical engineering, v 13951, 139510U
Series: Proceedings of SPIE
Publisher: SPIE
Number of pages: 6
Resource Type: Conference proceeding
Language: English
Academic Unit: Civil, Architectural, and Environmental Engineering; Mechanical Engineering and Mechanics
Web of Science ID: WOS:001776710100020
Scopus ID: 2-s2.0-105040390009
Other Identifier: 991022180805704721

Unified visual language modeling for zero-shot multitask inspection of civil infrastructure

Additional Links

Abstract

Metrics

Details

Drexel University Social media