Self Supervision Does Not Help Natural Language Supervision at Scale

Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE [31] and SLIP [64] have suggested that these approaches can be effectively combined, but most notably their results use small (100M samples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find…Apple Machine Learning Research