Unsupervised Representation Learning by Sorting Sequences
Success of DNNs is primarily driven by supervised learning on large (millions of) manually annotated data such as the ImageNet. However, this approach substantially limits the scalability to new problem domains because manual annotations are often expensive and in some cases scarce (e.g., labeling medical images). A new paradigm, called self-supervised learning, seeks to leverage the vast amount of freely available unlabeled images and videos for learning (universally powerful) representations. In such frameworks, a surrogate supervisory task is defined that leverages the inherent structure of the data. native or reconstruction loss function to train the network. Examples include predicting the relative patch positions, reconstructing missing pixel values conditioned on the known surrounding, or predicting one subset of the data channels from another (e.g., predicting color channels from a gray image) [19, 42, 43]. In this work, we propose a surrogate task for self-supervised learning using a large collection of unlabeled videos. Given a tuple of randomly shuffled frames, we train a neural network to sort the images into chronological order. The representations learnt are shown to be competitive with the state of the art on high-level recognition problems like object detection, classification and human action recognition.