저널 : Sensors, Vol.15, No. 1, 2021, https://doi.org/10.1049/cvi2.12013
· 논문제목 : “DVC‐Net: A deep neural network model for dense video captioning”
· 저자 : 이수진, 김인철
· 요약 : Dense video captioning (DVC) detects multiple events in an input video and generates natural language sentences to describe each event. Previous studies predominantly used convolutional neural networks to extract visual features from videos but failed to employ high‐level semantics to effectively explain video content such as people, objects, actions, and places, and utilized only limited context information in generating natural language. To overcome these deficiencies, DVC‐Net is proposed, a new deep neural network model that uses high‐level semantics to efficiently represent important events as well as visual features. Additionally, DVC‐Net uses a bidirectional long short‐term memory network, a type of recurrent neural network, to detect events over time. Furthermore, DVC‐Net applies an attention mechanism and context gating to effectively exploit context information in a caption generation step. In experiments conducted versus state‐of‐the‐art models, DVC‐Net presented relative gains of over 1.72% (BLEU@1 score increases from 12.22 to 13.94) and 3.19% (CIDEr score increases from 12.61 to 15.80) on the large‐scale benchmark datasets, namely ActivityNet Captions and MSR‐VTT, respectively.