by VisualDataLicense : Unknown
This project presents a novel video dataset, named SLAC (Sparsely Labeled ACtions), for action recognition and localization. It consists of over 520K untrimmed videos and 1.75M clip annotations spanning 200 action categories. Using our proposed framework, annotating a video clip takes merely 8.8 seconds on average, representing a saving in labeling time of over 95% compared to the traditional procedure of manual trimming and localization of actions. We show that our large-scale dataset can be used to effectively pretrain action recognition and detection models, significantly improving final metrics on smaller-scale benchmarks after fine-tuning, eg. HMDB-51, UCF-101, ActivityNet, Kinetics.