Vector Space Representations of Executable Code

Author/Creator

Author/Creator ORCID

Date

2017-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Abstract

Modeling executable code in a way that is amenable to machine learning and automated analysis is important for a variety of problems. Current solutions are frequently ad-hoc, with hand-selected features and problem specific models being the standard. Vector space models are frequently applied to a variety of problem areas. This work demonstrates a way to generate dense vector embeddings of executable functions based on their composition. These models can be used to compare functions using standard distance metrics. These vectors are also easily used for a variety of machine learning tasks. A new data set focused on building general purpose representations of executable code, MAML, is used to build these models. Evaluating embeddings is currently an open area of research. Vector space embeddings are considered good if they work for some specific task, but there are no standard criteria for evaluating general purpose embeddings. We propose a set of criteria for evaluating generic code models in a standard way. Vector space models perform comparably with current state-of-the-art specialized models on these evaluations without needing specialized model development.