News

CALL FOR PAPERS NOVEMBER 2024

IJSAR going to launch new issue Volume 05, Issue 11, November 2024; Open Access; Peer Reviewed Journal; Fast Publication. Please feel free to contact us if you have any questions or comments send email to: editor@scienceijsar.com

IMPACT FACTOR: 6.673

Submission last date: 15th November 2024

TPE-CPL: trainable positional embedding-based contrastive proposal learning for weakly supervised video grounding

×

Error message

  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /home1/sciensrd/public_html/scienceijsar.com/includes/common.inc).
  • Notice: Trying to access array offset on value of type int in element_children() (line 6609 of /home1/sciensrd/public_html/scienceijsar.com/includes/common.inc).
  • Deprecated function: implode(): Passing glue string after array is deprecated. Swap the parameters in drupal_get_feeds() (line 394 of /home1/sciensrd/public_html/scienceijsar.com/includes/common.inc).
Author: 
Richard Hua, Yantao Wang, Jason Zou, Allen Jiang, Sunny Kim and Jong Wook Lee
Page No: 
4347-4348

As videos on the internet become more common, we need to understand the contents of the videos for recognizing important human actions or highlights. Moreover, videos with texts which depict the key points in the videos have encouraged research on video grounding (Gao, Jiyang, et al, 2017). Video grounding is an important task with many applications in video surveillance. Video grounding aims to find a grounding location, which is a video segment semantically corresponding to a query sentence in a long and untrimmed video. Recently, weakly supervised video grounding (Zheng, Minghang et al., 2022) has drawn more attention because it requires little annotation cost. In weakly supervised video grounding, the ground-truth grounding location is not available for training, and only matched pairs of video and query sentence are available. In this paper, we propose Trainable Positional Embedding (TPE)-based contrastive proposal learning for weakly supervised video grounding. The previous method for contrastive proposal learning (Zheng, Minghang et al., 2022) leverages several Gaussian masks which can be positive proposals for finding grounding locations. However, the predefined Sinusoidal positional embedding is used in that method, which is not efficient because it ignores varying information of word positions in the query sentence. To solve this problem, we leverage trainable positional embedding for contrastive proposal learning. We verify that the proposed method improves performance through quantitative experiments, outperforming the previous state-of-the-art methods.

Download PDF: