Machine-aided programming tools such as type predictors and code summarizers are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets.
We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, relying only on the raw text of programs.
In particular, we design an unsupervised pretext task by generating textually divergent copies of source functions via automated source-to-source compiler transforms that preserve semantics.
We train a neural model to identify variants of an anchor program within a large batch of negatives. To solve this task, the network must extract program features representing the functionality, not form, of the program.
This is the first application of instance discrimination to code representation learning to our knowledge. We pre-train models over 1.8m unannotated JavaScript methods mined from GitHub. ContraCode pre-training improves code summarization accuracy by 7.9% over supervised approaches and 4.8% over RoBERTa pre-training.
Moreover, our approach is agnostic to model architecture; for a type inference task, contrastive pre-training consistently improves the accuracy of existing baselines.
Finding equivalent programs in a dataset is challenging. In computer vision, random crops of a source image are frequently used as "equivalent" views of it for augmenting training sets or for unsupervised pre-training. However, it's challenging to define similar data augmentations for natural and programming languages. We propose to use automated source-to-source compiler transformations to generate augmentations of programs that preserve functionality. These transforms include dead code elimination, variable renaming and constant folding. We also explore lossy transforms like code deletion that only preserve some of the program semantics.
Contrastive Code Representation Learning (ContraCode) is a pretext representation learning task that uses these code augmentations to construct a challenging discriminative pretext task that requires the model to identify equivalent programs out of a large dataset of distractors.
In doing so, it has to embed the functionality, not form, of the code.
In essence, the domain knowledge from our code transformations induces the knowledge of the structure of programs onto learned representations.
By learning functionality-based representations, a model pre-trained with ContraCode outperforms baselines that are are trained from scratch or pre-trained with reconstruction objectives like masked language modeling. We demonstrate these improvements by finetuning LSTM and Transformer models on type inference and code summarization tasks.
Paras Jain*, Ajay Jain*, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez, Ion Stoica. Contrastive Code Representation Learning. In submission, 2020. * Denotes equal contribution.
@article{jain2020contrastive,
title={Contrastive Code Representation Learning},
author={Paras Jain and Ajay Jain and Tianjun Zhang
and Pieter Abbeel and Joseph E. Gonzalez and Ion Stoica},
year={2020},
journal={arXiv preprint}
}