Logo image
Unified visual language modeling for zero-shot multitask inspection of civil infrastructure
Conference proceeding

Unified visual language modeling for zero-shot multitask inspection of civil infrastructure

Farzad Azizi Zade, Pedram Bazrafshan and Arvin Ebrahimkhanlou
Proceedings of SPIE, the international society for optical engineering, v 13951
16 Apr 2026

Abstract

This paper examines multi task computer vision for civil infrastructure maintenance by evaluating a unified visual language approach for captioning, semantic segmentation, and object detection on transportation infrastructure imagery. Two pretrained model variants(Florence-2 base and large) were tested in a zero-shot setting, using prompt conditioned sequence to sequence processing and generating three captions per image as prompts for downstream tasks. Evaluation used average precision (AP) and average recall (AR) for open vocabulary detection, caption to phrase grounding detection, and referring expression segmentation. Computational profiling on an NVIDIA T4 indicates the smaller variant requires roughly 2 GB GPU memory with runtimes from 0.2 to 2.3 seconds, while the larger variant requires nearly 4 GB with runtimes from 0.4 to 3.9 seconds and higher CPU and GPU utilization. A complex scene revealed complementary strengths across variants: the base variant succeeded at bridge segmentation where the other failed, while the other detected all small instances (e.g., graffiti on concrete) missed by the first. Thisstudy’s prompt engineering across various prompts identified that a geometry-focused prompt can optimize inspection outcomes, providing practical guidance for deploying these models in real-world infrastructure monitoring applications.

Metrics

1 Record Views

Details

Logo image