Conference proceeding
Visual Goal-Step Inference using wikiHow
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), pp 2167-2179
01 Jan 2021
Abstract
Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.
Metrics
Details
- Title
- Visual Goal-Step Inference using wikiHow
- Creators
- Yue Yang - University of PennsylvaniaArtemis Panagopoulou - University of PennsylvaniaQing Lyu - University of PennsylvaniaLi Zhang - University of PennsylvaniaMark Yatskar - University of PennsylvaniaChris Callison-Burch - University of Pennsylvania
- Publication Details
- 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), pp 2167-2179
- Publisher
- Association for Computational Linguistics
- Number of pages
- 13
- Grant note
- 2019-19051600004 / IARPA BETTER Program FA8750-19-2-0201 / DARPA LwLL Program FA8750-19-2-1004 / DARPA KAIROS Program; United States Department of Defense
- Resource Type
- Conference proceeding
- Language
- English
- Academic Unit
- Computer Science
- Web of Science ID
- WOS:000855966302024
- Other Identifier
- 991022123344004721
InCites Highlights
Data related to this publication, from InCites Benchmarking & Analytics tool:
- Web of Science research areas
- Computer Science, Artificial Intelligence
- Computer Science, Interdisciplinary Applications
- Linguistics