Graduate Thesis Or Dissertation


Multi-Modal Semantic Role Labeling and Its Application Public Deposited
  • Semantic Role Labeling (SRL) has garnered significant interest in the field of Natural Language Processing. It aims to capture the underlying predicate-argument structure of a sentence. Successful detection of SRL has led to improvements in various applications of natural language processing, including question answering, information extraction, summarization, and understanding of dialog (Tur et al., 2005; Chen et al., 2013; Wang et al., 2015; Bazrafshan and Gildea, 2013). The advent of Neural Networks has notably enhanced the accuracy of automatic SRL.

    Similarly, Neural Networks have revolutionized Computer Vision tasks, such as object detection and localization, automatic caption generation, and visual question answering (Ren et al., 2015; Zhang et al., 2021a; Tan and Bansal, 2019; Li et al., 2020). The availability of internet-scale data, transformer-based vision language models, and large language models (LLM) has played a pivotal role in the success of vision language tasks (Su et al., 2020; Zhang et al., 2021a; Li et al., 2020, 2023; Alayrac et al., 2022; Zhu et al., 2023). Weak supervision through attention mechanisms and transfer learning has been instrumental in achieving these remarkable results. However, amidst this recent progress of LLMs and vision-language models, the focus on more fundamental topics, such as semantics, has somewhat diminished.

    To gain a comprehensive understanding of an image, it is essential to grasp the roles played by the objects within the scene. In other words, the semantic role of an object can provide richer information about the image. Unfortunately, not much exploration has been done to investigate the potential advantages of incorporating visual semantic role labeling. Thus, we propose to dive into the domain of computer vision and conduct an in-depth study on SRL. In this thesis document, we have examined the benefits of using SRL in multimodal tasks, such as image captioning and retrieval. Furthermore, we have proposed a cross-modal annotation projection-based approach for visual semantic role labeling, leveraging the progress of SRL in the text domain.

Date Issued
  • 2023-07-29
Academic Affiliation
Committee Member
Degree Grantor
Commencement Year
Last Modified
  • 2024-01-18
Resource Type
Rights Statement