Audio-Visual Event (AVE) Dataset

by Yapeng TianResearch Only

Audio-Visual Event (AVE) Dataset

We introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event (AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. Audio-Visual Event (AVE) dataset contains 4143 videos covering 28 event categories and videos in AVE are temporally labeled with audio-visual event boundaries.

Dataset Attributes

Label SVG
TasksAudio-guided Visual Attention
Label SVG
CategoriesSounds, Music, Noise
Label SVG
SensorRGB Camera, Audio

Class Labels

Church bellMan speakingdog barkingAirplaneRacing carWoman speakingHelicopterViolinFluteUkeleleFrying foodTruckShofarMotorcycleGuitarTrainClockBanjoGoat Baby cryingBusChainsawCat HorseToilet flushRodentAccordian