Deep learning (Machine learning) Natural language processing (Computer science) Computer Vision Machine Learning
Just a few years ago, most vision-language modeling techniques consisted of individual processing branches for each modality along with a fusion layer to combine them. While the individual branches frequently leveraged unimodal pre-trained models, the fusion component would need to be learned from scratch. This was especially problematic in limited data scenarios where there is not enough data to sufficiently develop a robust multimodal feature space. The introduction of large-scale language-image pre-training techniques changed this, where few-shot, or even zero-shot, settings are now feasible for a number of standard downstream vision-language tasks. However, many tasks are not easily compatible with these foundation models out-of-the-box, either due to domain shift or a fundamental limitation of the model. In such scenarios, either the model needs to be fine-tuned, potentially destroying the existing weights and ruining its generalizability, or new modules must be added to the model and trained from scratch, requiring a substantial number of training examples to learn this new set of parameters. In this work, we set out to overcome these issues, proposing an alternative paradigm for leveraging foundation models in limited data scenarios that they are ill-suited for out-of-the-box, effectively utilizing their extensive knowledge without destroying their weights or adding new parameters. This enables the impressive few-shot performance of these models to be applied to new tasks. We begin by exploring the development of multimodal learning techniques, from earlier fusion- based models to more recent foundation models, with a particular emphasis on their ability to leverage limited data. Then, we set aside multimodality and turn our attention to a unimodal task, ultrasound video classification. We present a framework for learning robust models, using just a few dozen training examples, by leveraging subject matter expert knowledge, demonstrating the power of more traditional customized learning approaches over naively applying large pre-trained models to the task. Lastly, we combine these two lines of work by treating foundation models as the "experts" to facilitate training object detection models on noisy data in limited data settings. This is a unique task that foundation models are not particularly well suited for out-of-the-box. Yet, we demonstrate that applying segmentation and vision-language foundation models using this unique paradigm, obtains strong results even under severe noise.
Metrics
81 File views/ downloads
34 Record Views
Details
Title
Leveraging Multiple Modalities and Expert Knowledge for Limited Data Scenarios
Creators
Darryl Hannan
Contributors
Edward Kim (Advisor)
Awarding Institution
Drexel University
Degree Awarded
Doctor of Philosophy (Ph.D.)
Publisher
Drexel University; Philadelphia, Pennsylvania
Number of pages
x, 86 pages
Resource Type
Dissertation
Language
English
Academic Unit
Computer Science (Computing) [Historical]; College of Computing and Informatics (2013-2026); Drexel University