Contrastive Code Representation Learning

Paras Jain*
UC Berkeley
Ajay Jain*
UC Berkeley
Tianjun Zhang
UC Berkeley
Pieter Abbeel
UC Berkeley
Joseph Gonzalez
UC Berkeley
Ion Stoica
UC Berkeley
Paper (arXiv) GitHub
Conceptual overview of ContraCode For many machine-aided programming tasks, programs with the same functionality should have the same underlying representation. ContraCode learns such representations with contrastive learning: the network is trained to find equivalent programs among many distractors.

Summary

  • Developer tools increasingly use machine learning to understand and modify human-written code.
  • For the best code understanding results, we hypothesize that learned code representations should be similar for functionally equivalent programs and dissimilar for non-equivalent programs.
  • We propose ContraCode: a methodology to learn similar representations for functionally equivalent programs through contrastive pre-training.
  • During pre-training, we apply compiler transformations to generate (approximately) equivalent, textually divergent batches of programs.
  • Finetuned models improve automated code summarization and type inference in JavaScript.

Abstract

Machine-aided programming tools such as type predictors and code summarizers are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, relying only on the raw text of programs. In particular, we design an unsupervised pretext task by generating textually divergent copies of source functions via automated source-to-source compiler transforms that preserve semantics. We train a neural model to identify variants of an anchor program within a large batch of negatives. To solve this task, the network must extract program features representing the functionality, not form, of the program. This is the first application of instance discrimination to code representation learning to our knowledge. We pre-train models over 1.8m unannotated JavaScript methods mined from GitHub. ContraCode pre-training improves code summarization accuracy by 7.9% over supervised approaches and 4.8% over RoBERTa pre-training. Moreover, our approach is agnostic to model architecture; for a type inference task, contrastive pre-training consistently improves the accuracy of existing baselines.


Compiler transforms for code data augmentation

Finding equivalent programs in a dataset is challenging. In computer vision, random crops of a source image are frequently used as "equivalent" views of it for augmenting training sets or for unsupervised pre-training. However, it's challenging to define similar data augmentations for natural and programming languages. We propose to use automated source-to-source compiler transformations to generate augmentations of programs that preserve functionality. These transforms include dead code elimination, variable renaming and constant folding. We also explore lossy transforms like code deletion that only preserve some of the program semantics.

An example JavaScript method from the unlabeled GitHub training set and two semantically equivalent programs. The equivalent programs were automatically generated through compiler transformations, serving as "augmentations" or "views" of the original program.

Contrastive pre-training

Contrastive Code Representation Learning (ContraCode) is a pretext representation learning task that uses these code augmentations to construct a challenging discriminative pretext task that requires the model to identify equivalent programs out of a large dataset of distractors. In doing so, it has to embed the functionality, not form, of the code. In essence, the domain knowledge from our code transformations induces the knowledge of the structure of programs onto learned representations.

ContraCode extends the Momentum Contrast vision pretraining framework to learn an encoder of programs from a database of unlabeled programs and a suite of semantics-preserving transformations.

Finetuning on downstream tasks

By learning functionality-based representations, a model pre-trained with ContraCode outperforms baselines that are are trained from scratch or pre-trained with reconstruction objectives like masked language modeling. We demonstrate these improvements by finetuning LSTM and Transformer models on type inference and code summarization tasks.

After finetuning, an LSTM pretrained with ContraCode predicts the argument and return types of an untyped TypeScript method correctly, which can be useful for developers. A finetuned model can also predict the name of a method from its body, a form of code summarization that demonstrates understanding of the code and could be useful for deobfuscation.

Citation

Paras Jain*, Ajay Jain*, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez, Ion Stoica. Contrastive Code Representation Learning. In submission, 2020. * Denotes equal contribution.

@article{jain2020contrastive,
  title={Contrastive Code Representation Learning},
  author={Paras Jain and Ajay Jain and Tianjun Zhang
  and Pieter Abbeel and Joseph E. Gonzalez and Ion Stoica},
  year={2020},
  journal={arXiv preprint}
}