Visual Goal-Step Inference using wikiHow

Yue Yang; Artemis Panagopoulou; Qing Lyu; Li Zhang; Mark Yatskar; Chris Callison-Burch

doi:10.18653/v1/2021.emnlp-main.165

Back

Visual Goal-Step Inference using wikiHow

Conference proceeding

Open access

Visual Goal-Step Inference using wikiHow

Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar and Chris Callison-Burch

2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), pp 2167-2179

01 Jan 2021

DOI: https://doi.org/10.18653/v1/2021.emnlp-main.165

Files and links (1)

url

https://doi.org/10.18653/v1/2021.emnlp-main.165View

Published, Version of Record (VoR) Open

Abstract

Computer Science, Artificial Intelligence

Computer Science, Interdisciplinary Applications

Linguistics

Science & Technology

Computer Science

Social Sciences

Technology

Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.

Metrics

9 Record Views

11 citations in Web of Science

18 citations in Scopus

Details

Title: Visual Goal-Step Inference using wikiHow
Creators: Yue Yang - University of Pennsylvania
Artemis Panagopoulou - University of Pennsylvania
Qing Lyu - University of Pennsylvania
Li Zhang - University of Pennsylvania
Mark Yatskar - University of Pennsylvania
Chris Callison-Burch - University of Pennsylvania
Publication Details: 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), pp 2167-2179
Publisher: Association for Computational Linguistics
Number of pages: 13
Grant note: 2019-19051600004 / IARPA BETTER Program FA8750-19-2-0201 / DARPA LwLL Program FA8750-19-2-1004 / DARPA KAIROS Program; United States Department of Defense
Resource Type: Conference proceeding
Language: English
Academic Unit: Computer Science
Web of Science ID: WOS:000855966302024
Scopus ID: 2-s2.0-85127279371
Other Identifier: 991022123344004721

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Web of Science research areas: Computer Science, Artificial Intelligence; Computer Science, Interdisciplinary Applications; Linguistics

Visual Goal-Step Inference using wikiHow

Files and links (1)

Abstract

Metrics

Details

InCites Highlights

Drexel University Social media