Representing Programs with Dependency and Function Call Graphs for

Learning Hierarchical Embeddings

Vitaly Romanov, Vladimir Ivanov and Giancarlo Succi

Innopolis University, Innopolis, Russia

Keywords:

Source Code, Embeddings, Hierarchical Embeddings, Graph, Dataset, Machine Learning, Python, Java.

Abstract:

Any source code can be represented as a graph. This kind of representation allows capturing the interaction

between the elements of a program, such as functions, variables, etc. Modeling these interactions can enable

us to infer the purpose of a code snippet, a function, or even an entire program. Lately, more and more

work appear, where source code is represented in the form of a graph. One of the difﬁculties in evaluating the

usefulness of such representation is the lack of a proper dataset and an evaluation metric. Our contribution is in

preparing a dataset that represents programs written in Python and Java source codes in the form of dependency

and function call graphs. In this dataset, multiple projects are analyzed and united into a single graph. The

nodes of the graph represent the functions, variables, classes, methods, interfaces, etc. Nodes for functions

carry information about how these functions are constructed internally, and where they are called from. Such

graphs enable training hierarchical vector representations for source code. Moreover, some functions come

with textual descriptions (docstrings), which allows learning useful tasks such as API search and generation

of documentation.

1 INTRODUCTION

Given the current rate of software development, it is

desirable to develop instruments that can assist the

development process and help to bring more quality

software to life. Speciﬁcally, when considering the

development tasks associated with source code docu-

mentation. Very often, parts of a software project lack

necessary documentation. Combining text and source

code analysis together can enable search through un-

documented code or even facilitate the process of cre-

ating the documentation.

Unfortunately, modern instruments for software

analysis are limited. The reason for this is simply that

those tools are unable to understand the purpose of a

program. In this situation, analysis tools are unable to

provide relevant recommendations or feedback.

Another area that lacks proper instruments is the

creation of the documentation. Source code docu-

mentation is an important factor for software quality,

especially in the case of long-term development and

collaboration. Often, it is necessary to understand the

purpose of the program to get any valuable insight.

However, without proper documentation, it is some-

times hard to do even for experienced programmers.

A tool that can facilitate the interpretation of an un-

Figure 1: Source code graph that captures deﬁnition and us-

age of a simple Python class. Original source code is given

in Listing 1.

documented code will make this process easier.

Lately, there were some developments in the ap-

360

Romanov, V., Ivanov, V. and Succi, G.

Representing Programs with Dependency and Function Call Graphs for Learning Hierarchical Embeddings.

DOI: 10.5220/0009511803600366

In Proceedings of the 22nd International Conference on Enterprise Information Systems (ICEIS 2020) - Volume 2, pages 360-366

ISBN: 978-989-758-423-7

plications of statistical learning and deep learning

techniques for solving tasks that are related to pro-

gram interpretation. Such tasks include generation of

software description (Yao et al., 2019) (Matskevich

and Gordon, 2018) (Alon et al., 2018a) (Aghamoham-

madi et al., 2018), API search (Gu et al., 2018) (Hu-

sain et al., 2019), bug detection (Henkel et al., 2018)

(Li et al., 2019) (Pradel and Sen, 2018) and others.

These problems are hard to formalize, and there is a

general consensus that they are better addressed with

the use of deep learning techniques.

One of the possible improvements that can be

done in the area of source code modeling with deep

learning techniques is changing the representation of

the code. Oftentimes, the code is treated as a sequence

of tokens. Such representation loses a lot of useful

structure in the code that is available to a human since

the human knows programming language grammar.

Another, more natural approach to source code rep-

resentation, is with graphs (Alon et al., 2018b). Re-

cently, due to the advancements in the area of graph

modeling with deep learning, the approaches that rep-

resent the source code in the form of graphs started to

emerge (Allamanis et al., 2017), (Alon et al., 2019),

(Nguyen et al., 2017).

Source code has an inherently hierarchical struc-

ture. A program consists of a sequence of functions

calls. A function can be implemented with the help of

other functions or simply using the standard library of

a programming language. Thus, any program can be

represented with a hierarchy of function calls. We be-

lieve that understanding how to model this hierarchy

will enable interpreting the program’s purpose.

Unfortunately, it is hard to evaluate how useful

a graph representation can be using only existing

datasets. Most published datasets aim at modeling ab-

stract syntax tree (AST) of a function without consid-

ering how this function is used in the rest of the code.

We believe that to understand the purpose of a func-

tion, one should inspect both how the function is used

throughout the source code and the body text of the

function itself. Moreover, when a textual description

of a function is added, a statistical model receives the

maximum amount of information that a programmer

can receive. Considering these three properties of a

function, in our opinion, should give a full description

of its purpose, and enable learning more expressive

vector representations (embeddings) for source code.

The understanding of the purpose of a function en-

ables solving multiple tasks that can facilitate the soft-

ware development process and reduce the risk of er-

ror. The use of textual descriptions can enable the use

of natural language interfaces in the software devel-

opment process, such as API search. The contribution

of this paper is in preparing datasets for representing

source code for Python and Java as dependency and

function call graphs

. Besides a more conventional

function call graph, this dataset contains information

about the dependencies of a particular function. The

list of dependencies can include types of variables,

references for variables, or class ﬁelds. We plan to use

this dataset for learning hierarchical embeddings and

studying the application of such embeddings for solv-

ing different machine learning (ML) tasks for source

code.

The rest of the paper is organized as follows.

Section 2 describes the idea behind the proposed

approach. Section 3 explains how we created the

datasets for Python and Java. Section 4 explores the

related work. Section 5 concludes the paper.

2 METHOD DESCRIPTION

The approach is based on the idea that information

about the function body, how and where the function

is used, and the textual description of a function pro-

vide full information necessary to determine the pur-

pose of a function. We draw inspiration from the area

of natural language processing. Given that the source

code, written in any programming language, can be

treated as a form of language, we can apply existing

NLP techniques to create a language model (Hindle

et al., 2012). However, some differences make the

source code very different from natural language.

First, the source code usually describes a sequence

of transformations. Unlike in natural language, where

there is a signiﬁcant prevalence of nouns, source code

is mostly described by actions, and the intent of a pro-

gram is mostly deﬁned through the composition of

these actions (Hindle et al., 2012).

The second difference is related to the language

vocabulary. Natural languages usually have an evolv-

ing but relatively stable vocabulary with a wide vari-

ety of concepts. In source code, on the other hand, vo-

cabulary can change signiﬁcantly from a program to a

program due to different naming of variables. More-

over, the variables that are essentially the same can

have different names in different scopes. In the case

of natural language, such situations are handled by

co-reference resolution, which is a probabilistic tech-

nique. In the case of source code, those references, in

most of the cases, can be resolved in a deterministic

way (at least in the run-time). This creates an oppor-

tunity to track how different variables and objects are

https://github.com/VitalyRomanov/source-graph-

dataset

Representing Programs with Dependency and Function Call Graphs for Learning Hierarchical Embeddings

361

used across the entire program, discern between them,

and identify their purpose.

Despite the long history, many natural language

models are limited in a way that they primarily oper-

ate on the token level. Creating a programming lan-

guage model creates a unique opportunity to utilize

the graph-like hierarchical structure of a program. It

is a known fact that programs are often organized in

a fashion that allows reusing the source code. In the

graph of an entire program, functions represent sub-

graphs.

Computing embeddings for subgraphs is an open

problem. Source code graphs allow addressing this

problem in an interesting way. The problem of learn-

ing the embeddings for subgraphs can be interpreted

as a problem of learning a composition function that

accepts nodes and edges as inputs and produces an

embedding for the entire subgraph. In the case of the

source code graph, the entire subgraph is addition-

ally represented as a single node (function) that has

connections with the rest of the source code. Being

able to learn the composition function allows ﬁnd-

ing the representation for new functions that were

not present in the dataset, and, in general, opens up

an opportunity for more intuitive transfer learning on

source code, which is different from more conven-

tional sequence-to-sequence approaches for transfer

learning (Devlin et al., 2017).

In our opinion, the best way to measure how well

a model captures the purpose of a program or function

is by evaluating the tasks of generating the annotation

and API search. Both of these tasks require textual

descriptions attached to nodes (functions, modules) in

the graph. One of the advantages of textual descrip-

tions, often given as docstring for functions, is that

they enable the use of more advanced and established

NLP techniques for comparing the similarity of text.

Thus, textual descriptions can be used as additional

regularization for learning graph embeddings. Nodes

that have similar textual descriptions should have sim-

ilar graph embeddings.

With the progress of deep learning, vector repre-

sentations became a widely applied tool for solving

varieties of ML tasks. The advantage of vector repre-

sentations is that they automatically encode properties

of an object, that are useful for solving downstream

tasks. Currently, there is a growing list of problems

that can be solved by learning vector representations

for source code. Some of them include:

• Auto-completion and API Suggestion

In this task, a program tries to recommend addi-

tional API that can be useful for completing the

current program. This task can be viewed as link

prediction in graphs. This is also a variant of API

# ExampleModule2.py

from Ex amp leM odu le1 import Ex amp leC las s

inst a n c e = E x a m p l e C l a s s ( None )

def mai n () :

print( i n sta n c e . m e t h o d 1 ( ) )

mai n ()

# ExampleModule1.py

class E x a mp l e C la s s :

def _ _ i n i t _ _ ( self , a r g u m e n t ) :

sel f . f ield = arg u m e n t

def me t h o d 1 ( self ) :

return sel f . m e t h o d 2 ()

def me t h o d 2 ( self ) :

var i a b l e 1 = 2

var i a b l e 2 = str( 2)

return v a r i a b le 2

Listing 1: A simple class deﬁnition and use in Python.

Source code graph for this code is given in Figure 1.

search;

• API search

Usually involves searching for API using a query

in natural language. Embeddings for functions are

used for creating a search index and learning rank-

ing function;

• Bug Detection

Involves ﬁnding bugs that can be revealed from

the statistical analysis of the code. Embeddings

are used to represent elements of a program;

• Generating Annotations

Generating textual description for a function or

separate lines in the code

• Program Translation

Translation between programming languages

By utilizing the hierarchical structure of a program,

we can regularize embeddings of functions to encode

the operations and other methods that are used inside

this function. Some work on modeling hierarchy of

graphs was done in (Ribeiro et al., 2017), (Chen et al.,

2018),(Ying et al., 2018). At the same time, the de-

pendency links in the source code graph can be used

to learn how and where a speciﬁc function can be

used without even looking at the implementation of

this function.

Based on the ideas described above, we propose

a list of criteria that should be met in order for the

source code graph to be useful:

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

362

1. Presence of the Hierarchy. The graph should con-

tain a hierarchy of function calls. Otherwise, it

becomes hard to model how other functions are

used;

2. Clarity. Functions in the graph mostly should

have a single purpose;

3. Multiple Usages. Functions should ideally be

reused from different parts of the code. Other-

wise, it becomes hard for a statistical model to

understand the purpose of the function;

4. Connectivity. Functions should be connected with

the rest of the program through input arguments

and return values

All of the tasks described above require some addi-

tional level of understanding of the program’s pur-

pose. In our opinion, tasks like API search and code

annotation are the best for evaluating the ability of a

statistical model to capture the program’s purpose. If

a model can correctly generate different textual de-

scriptions for two similar functions, then it is able to

interpret the code. Similarly, one can evaluate how

well a model understands the source code using the

task of API search.

3 DATASET DESCRIPTION

We constructed a source code graph with the help of

an open-source source code indexing tool – Source-

trail

. This tool is capable of creating a graph for

C, C++, Java, and Python. The main advantage of

Sourcetrail over its competitors is the possibility of

exporting nodes and edges of a source code graph

with minimal effort. Moreover, it provides a uniﬁed

interface for creating graphs for several programming

languages at once. This utility can capture different

kinds of relationships between source code units. The

source code units, such as modules, classes, meth-

ods, ﬁelds, and variables, are treated as nodes in the

graph. There are several types of relationships. The

list of available relationships depends on the program-

ming language. Despite different kinds of relation-

ships present, the main goal is to capture the function

call graph. Other types of connections can still be

useful for modeling the source code in the future. We

applied the source code indexing tool Sourcetrail to

a set of Python packages and their dependencies and

also to a set of open-source Java repositories. The de-

tails of the resulting datasets are given below. The

This property is not yet fully satisﬁed in the current

version of the dataset.

www.sourcetrail.com

Table 1: Node types present in Python source graph.

Node Type Count

Function 221822

Class ﬁeld 83077

Class 35798

Module 18097

Class method 14953

Non-indexed symbol 853

Table 2: Edge count in Python graph by edge type.

Edge Type Count

Call 614621

Deﬁne/Contain 431115

Type Use 239543

Import 121752

Inheritance 26525

example of a graph constructed in such way can be

found on Figure 1.

Dataset construction process included several

steps. First, the data was exported from Sourcetrail

format. The resulting graphs were analyzed for con-

nected components. Only the largest component re-

mained, and the rest were ﬁltered out. Sometimes in-

dexing tools can produce erroneous edges. The cor-

rectness of graphs was veriﬁed manually on a small

subsample of nodes and edges. With 99% conﬁdence

source graph for Python has less than 8% of incorrect

connections, and for Java – less than 5%. The esti-

mate was performed on a very small subsample and

is expected to go down as more data is manually ver-

iﬁed.

Python Dataset.

To create the Python source graph dataset, we cre-

ated a virtual environment and installed a collection

of popular packages, including their dependencies.

The collection of packages includes Bokeh, Django,

Fabric, Flask, Matplotlib, Pandas, Requests, Scrapy,

Sklearn, Spacy, and Tensorﬂow. After installing these

packages with dependencies, the python virtual envi-

ronment contained 151 packages. All of those pack-

ages were indexed using Sourcetrail. The count of dif-

ferent types of nodes is shown in the Table 1

. The list

of different types of edges and their count is shown in

Table 2.

Java Dataset.

In the case of Java, we analyzed 15 repositories

that are openly available on GitHub. The repositories

include the source code for such projects as Apache

HTTP Client, crawler4j, Deeplearning4j, JHipster,

The exact count can change in the future versions of the

dataset. Applies to all counts.

Representing Programs with Dependency and Function Call Graphs for Learning Hierarchical Embeddings

363

Table 3: Node types present in Java source graph.

Node Type Count

Method 287393

Field 95983

Class 42329

Non-Indexed Symbol 30438

Type Parameter 5885

Enum Constant 4835

Non-Indexed Package 3792

Interface 3366

Annotation 1012

Enum 919

Built-In Type 9

Table 4: Edge count in Java graph by edge type.

Edge Type Count

Type use 989725

Call 817408

Deﬁne/Contain 475303

Use 368910

Annotation use 168862

Override 75186

Inheritance 32018

Is type 24105

log4j, Mahout, Opennlp, Spring Boot, Spring Frame-

work, Stanford NLP, TableSaw, Thingsboard, Unirest,

WebMagic, and Weka. Unlike for Python, we did not

analyze the packages that are the dependants of these

repositories. The count of nodes and edges by type is

given in tables 3 and 4 respectively. As with the case

of the Python graph, there are many different types of

nodes and edges. For now, we focus on the edges that

create the function call graph. Other types of edges

and nodes can be used in the future.

For each dataset, where it was possible, we ex-

tracted docstrings for functions. Thus, some of the

functions in the dataset come with a textual descrip-

tion.

The datasets described above represent the source

code of selected Python packages and Java reposito-

ries in the form of graphs. The nodes of the graph,

in this case, are the units of programming languages,

such as functions, methods, classes, interfaces, etc. In

these datasets, some of the edges represent function

calls. Thus, it is possible to use this data for train-

ing hierarchical embeddings for the functions. Some

of the functions, for which docstring was available,

have a textual description attached to them. These de-

scriptions enable solving the tasks that require input

or output in the form of natural language. Such ML

tasks for code as API search, annotation generation,

API recommendation, and others can be tested on the

current dataset.

4 RELATED WORK

Existing Datasets.

The idea of representing the source code in the

form of graphs is not new and was performed in dif-

ferent studies. However, bringing together a hier-

archical representation of a program and textual de-

scriptions is still rare, and the existing public datasets

do not provide the data in the desired format.

We looked into datasets for source code search

and description generation – the types of datasets that

come with textual description for functions. All of the

datasets that we have found did not allow construct-

ing an unambiguous hierarchical representation of a

program.

One of the datasets, collected by the researchers

in Edinburgh University, code-docstring-corpus, con-

tains a parallel corpus of python functions and their

docstrings and enables training models for generat-

ing function descriptions from the function bodies

(Miceli-Barone and Sennrich, 2017). A collection of

open-source python repositories was used to collect

the dataset. Many functions come from the same li-

brary. However, there seems to be no easy way to con-

struct an unambiguous function call graph from this

data, mainly due to functions having identical names,

and no easy way to identify where the functions were

imported from.

The dataset published by GitHub, CodeSearch-

Net, contains function bodies, links to the source

repository, and, sometimes, function descriptions

(Husain et al., 2019). The data is available for sev-

eral languages. The dataset contains functions from a

very diverse set of repositories. For this reason, many

functions are not really reused, which makes it hard

to construct a hierarchical representation.

In Facebook’s dataset, Neural-Code-Search-

Evaluation-Dataset, a parallel corpus of code snippets

and their textual descriptions is given (Sachdev et al.,

2018). But, once again, it seems hard to construct

a hierarchical representation. For this reason, this

dataset does not satisfy our goals.

Another project that caught our attention is BOA

projects. It aims at mining a large number of

source code repositories (Dyer et al., 2013). They

have implemented a query language to generate and

mine different types of program graphs, includ-

ing control-ﬂow graphs, control-dependence graphs,

data-dependence graphs, and program-dependence

graphs. However, the data is not readily available

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

364

and should be queried. Moreover, currently, they ana-

lyze only for Java 7 and older, according to the project

web-page.

Techniques for Source Code Embeddings.

One of the goals of creating the source code

graph dataset is learning vector representations for the

source code.

An extensive study pf the subject of source code

embedding was done in (Chen and Monperrus, 2019).

In most of the cases, creating embeddings require

building some graph. Several methods were reported

to build graphs based on token adjacency (Harer et al.,

) (Azcona et al., 2019) (Chen and Monperrus, 2018).

Such an approach is very similar to classical embed-

dings techniques, like Word2Vec, but seems to be

over-simpliﬁed in the case of program source codes.

Another approach for learning embeddings is us-

ing a control-ﬂow graph (DeFreez et al., 2018). While

such an approach would resolve many ambiguities

in graph construction, the process of creating the

control-graph itself is not possible for any arbitrary

language. The same problem arises when using pre-

compiled LLVM code for creating the graph (Ben-

Nun et al., 2018). Moreover, the program becomes

uninterpretable for a programmer. This is one of the

reasons we focus on analyzing sources of a program

as they are without additional preprocessing.

Some other approaches work on with function call

graph and have reported learning meaningful embed-

dings (Lu et al., 2019) (Pradel and Sen, 2018). How-

ever, the data that they have uses is not available pub-

licly.

Several approaches used graph-based source code

representation for ﬁxing bugs in the code (Allama-

nis et al., 2017) (Devlin et al., 2017). Their graph

includes both function call information, as well as

function AST. However, these graphs are not publicly

available.

As of the time of writing this paper, we are not

aware of any work that utilizes everything: hierarchi-

cal representations, function bodies, and textual de-

scriptions at the same time.

5 CONCLUSION

Instruments that can facilitate the software develop-

ment process are needed. Due to the latest advance-

ments in statistical learning, it now becomes possi-

ble to address such problems as source code search

and generation of documentation. Lately, numerous

works address these problems. In this paper, we pre-

sented our vision for an approach to source code anal-

ysis that utilizes hierarchical graph representations,

function bodies, and textual descriptions for func-

tions. It is possible to use this information to train

a model that captures the program’s purpose. We cre-

ated two datasets for Python and Java projects. Each

individual project is represented as a graph. These

graphs were later united together into a single graph

(for a given programming language). This large graph

contains information about both: the internal struc-

ture of each function and how this function is used in

the source code is present. We believe that this in-

formation will be beneﬁcial for learning embeddings

for functions. Such embeddings allow solving such

ML tasks like API search, generation of documenta-

tion, API suggestion, bug detection, and many others.

Both datasets are available on GitHub.

Currently, the datasets are still under develop-

ment, and their improvement is expected within the

next several months. The future work includes in-

corporating AST of programs into the current graph.

The datasets can be used for studying properties of

source code, e.g., the amount of code reuse. However,

the main goal of creating these datasets is to explore

the possibility of creating hierarchical embeddings for

source code.

REFERENCES

Aghamohammadi, A., Heydarnoori, A., and Izadi, M.

(2018). Generating summaries for methods of event-

driven programs: an android case study. arXiv

preprint arXiv:1812.04530.

Allamanis, M., Brockschmidt, M., and Khademi, M.

(2017). Learning to represent programs with graphs.

arXiv preprint arXiv:1711.00740.

Alon, U., Brody, S., Levy, O., and Yahav, E.

(2018a). code2seq: Generating sequences from

structured representations of code. arXiv preprint

arXiv:1808.01400.

Alon, U., Zilberstein, M., Levy, O., and Yahav, E.

(2018b). A general path-based representation for pre-

dicting program properties. ACM SIGPLAN Notices,

53(4):404–419.

Alon, U., Zilberstein, M., Levy, O., and Yahav, E. (2019).

code2vec: Learning distributed representations of

code. Proceedings of the ACM on Programming Lan-

guages, 3(POPL):1–29.

Azcona, D., Arora, P., Hsiao, I.-H., and Smeaton, A. (2019).

user2code2vec: Embeddings for proﬁling students

based on distributional representations of source code.

In Proceedings of the 9th International Conference on

Learning Analytics & Knowledge, pages 86–95.

Ben-Nun, T., Jakobovits, A. S., and Hoeﬂer, T. (2018). Neu-

ral code comprehension: A learnable representation of

code semantics. In Advances in Neural Information

Processing Systems, pages 3585–3597.

Chen, H., Perozzi, B., Hu, Y., and Skiena, S. (2018). Harp:

Hierarchical representation learning for networks. In

Representing Programs with Dependency and Function Call Graphs for Learning Hierarchical Embeddings

365

Thirty-Second AAAI Conference on Artiﬁcial Intelli-

gence.

Chen, Z. and Monperrus, M. (2018). The remarkable role of

similarity in redundancy-based program repair. arXiv

preprint arXiv:1811.05703.

Chen, Z. and Monperrus, M. (2019). A literature study

of embeddings on source code. arXiv preprint

arXiv:1904.03061.

DeFreez, D., Thakur, A. V., and Rubio-Gonz

alez, C.

(2018). Path-based function embedding and its ap-

plication to speciﬁcation mining. arXiv preprint

arXiv:1802.07779.

Devlin, J., Uesato, J., Singh, R., and Kohli, P. (2017). Se-

mantic code repair using neuro-symbolic transforma-

tion networks. arXiv preprint arXiv:1710.11054.

Dyer, R., Nguyen, H. A., Rajan, H., and Nguyen, T. N.

(2013). Boa: A language and infrastructure for ana-

lyzing ultra-large-scale software repositories. In 2013

35th International Conference on Software Engineer-

ing (ICSE), pages 422–431. IEEE.

Gu, X., Zhang, H., and Kim, S. (2018). Deep code search.

In 2018 IEEE/ACM 40th International Conference on

Software Engineering (ICSE), pages 933–944. IEEE.

Harer, J. A., Kim, L. Y., Russell, R. L., Ozdemir, O., Kosta,

L. R., Rangamani, A., Hamilton, L. H., Centeno, G. I.,

Key, J. R., Ellingwood, P. M., et al. Automated soft-

ware vulnerability detection with machine learning.

Henkel, J., Lahiri, S. K., Liblit, B., and Reps, T. (2018).

Code vectors: Understanding programs through em-

bedded abstracted symbolic traces. In Proceedings of

the 2018 26th ACM Joint Meeting on European Soft-

ware Engineering Conference and Symposium on the

Foundations of Software Engineering, pages 163–174.

Hindle, A., Barr, E. T., Su, Z., Gabel, M., and Devanbu,

P. (2012). On the naturalness of software. In 2012

34th International Conference on Software Engineer-

ing (ICSE), pages 837–847. IEEE.

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and

Brockschmidt, M. (2019). Codesearchnet challenge:

Evaluating the state of semantic code search. arXiv

preprint arXiv:1909.09436.

Li, Y., Wang, S., Nguyen, T. N., and Van Nguyen, S. (2019).

Improving bug detection via context-based code rep-

resentation learning and attention-based neural net-

works. Proceedings of the ACM on Programming

Languages, 3(OOPSLA):1–30.

Lu, M., Liu, Y., Li, H., Tan, D., He, X., Bi, W., and Li, W.

(2019). Hyperbolic function embedding: Learning hi-

erarchical representation for functions of source code

in hyperbolic space. Symmetry, 11(2):254.

Matskevich, S. and Gordon, C. S. (2018). Generating com-

ments from source code with ccgs. In Proceedings

of the 4th ACM SIGSOFT International Workshop on

NLP for Software Engineering, pages 26–29.

Miceli-Barone, A. V. and Sennrich, R. (2017). A parallel

corpus of python functions and documentation strings

for automated code documentation and code genera-

tion. In Proceedings of the Eighth International Joint

Conference on Natural Language Processing (Volume

2: Short Papers), pages 314–319.

Nguyen, T. D., Nguyen, A. T., Phan, H. D., and Nguyen,

T. N. (2017). Exploring api embedding for api us-

ages and applications. In 2017 IEEE/ACM 39th Inter-

national Conference on Software Engineering (ICSE),

pages 438–449. IEEE.

Pradel, M. and Sen, K. (2018). Deepbugs: A learn-

ing approach to name-based bug detection. Pro-

ceedings of the ACM on Programming Languages,

2(OOPSLA):1–25.

Ribeiro, L. F., Saverese, P. H., and Figueiredo, D. R.

(2017). struc2vec: Learning node representations

from structural identity. In Proceedings of the 23rd

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, pages 385–394.

Sachdev, S., Li, H., Luan, S., Kim, S., Sen, K., and Chan-

dra, S. (2018). Retrieval on source code: a neural code

search. In Proceedings of the 2nd ACM SIGPLAN In-

ternational Workshop on Machine Learning and Pro-

gramming Languages, pages 31–41.

Yao, Z., Peddamail, J. R., and Sun, H. (2019). Coacor:

code annotation for code retrieval with reinforcement

learning. In The World Wide Web Conference, pages

2203–2214.

Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and

Leskovec, J. (2018). Hierarchical graph representation

learning with differentiable pooling. In Advances in

neural information processing systems, pages 4800–

4810.

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

366