Dissertation
Leveraging multiple modalities and expert knowledge for limited data scenarios
Doctor of Philosophy (Ph.D.), Drexel University
Aug 2024
DOI:
https://doi.org/10.17918/00010747
Abstract
Just a few years ago, most vision-language modeling techniques consisted of individual processing branches for each modality along with a fusion layer to combine them. While the individual branches frequently leveraged unimodal pre-trained models, the fusion component would need to be learned from scratch. This was especially problematic in limited data scenarios where there is not enough data to sufficiently develop a robust multimodal feature space. The introduction of large-scale language-image pre-training techniques changed this, where few-shot, or even zero-shot, settings are now feasible for a number of standard downstream vision-language tasks. However, many tasks are not easily compatible with these foundation models out-of-the-box, either due to domain shift or a fundamental limitation of the model. In such scenarios, either the model needs to be fine-tuned, potentially destroying the existing weights and ruining its generalizability, or new modules must be added to the model and trained from scratch, requiring a substantial number of training examples to learn this new set of parameters. In this work, we set out to overcome these issues, proposing an alternative paradigm for leveraging foundation models in limited data scenarios that they are ill-suited for out-of-the-box, effectively utilizing their extensive knowledge without destroying their weights or adding new parameters. This enables the impressive few-shot performance of these models to be applied to new tasks. We begin by exploring the development of multimodal learning techniques, from earlier fusion- based models to more recent foundation models, with a particular emphasis on their ability to leverage limited data. Then, we set aside multimodality and turn our attention to a unimodal task, ultrasound video classification. We present a framework for learning robust models, using just a few dozen training examples, by leveraging subject matter expert knowledge, demonstrating the power of more traditional customized learning approaches over naively applying large pre-trained models to the task. Lastly, we combine these two lines of work by treating foundation models as the "experts" to facilitate training object detection models on noisy data in limited data settings. This is a unique task that foundation models are not particularly well suited for out-of-the-box. Yet, we demonstrate that applying segmentation and vision-language foundation models using this unique paradigm, obtains strong results even under severe noise.
Metrics
87 File views/ downloads
36 Record Views
Details
- Title
- Leveraging multiple modalities and expert knowledge for limited data scenarios
- Creators
- Darryl Hannan
- Contributors
- Edward Kim (Advisor)
- Awarding Institution
- Drexel University
- Degree Awarded
- Doctor of Philosophy (Ph.D.)
- Publisher
- Drexel University; Philadelphia, Pennsylvania
- Number of pages
- x, 86 pages
- Resource Type
- Dissertation
- Language
- English
- Academic Unit
- Computer Science (Computing) (2013-2026); College of Computing and Informatics (2013-2026); Drexel University
- Other Identifier
- 991021906712704721