IBM unveiled recently his new project called "CodeNet" which aims to provide researchers with a data set to experiment with using machine learning techniques to create translators from one programming language to another, as well as code generators and parsers.
CodeNet includes a collection of 14 million code samples that solve 4053 common programming problems. In total, the collection contains around 500 million lines of code and covers 55 programming languages, both modern languages such as C ++, Java, Python, and Go, as well as legacy, including COBOL, Pascal, and FORTRAN.
"Software is eating the world," wrote famous American businessman Marc Andreessen in 2011. Fast forward to today: software is found in financial services and healthcare, smartphones and smart homes. Even cars now have more than 100 million lines of code.
Project developments are released under the Apache 2.0 license and the data sets are expected to be released in the public domain.
The examples are annotated and implement identical algorithms in different programming languages. The proposed set is supposed to help train machine learning systems and develop innovations in the field of automatic code translation and analysis, by analogy with the way the ImageNet database of annotated images aided in the development of systems of image recognition and artificial vision. Various programming contests are mentioned as one of the main sources of collection building.
Project CodeNet can specifically drive algorithmic innovation to extract this context with sequence-by-sequence models, much like what we have applied in human languages, to make a more significant dent in machine understanding of code rather than processing of the code. code machine.
Unlike traditional translators based on translation rules, machine learning systems can capture and take into account the context of code usage. When converting from one programming language to another, context is just as important as when translating from one human language to another. It is the lack of contextual awareness that prevents the code from converting from legacy languages like COBOL.
The presence of a large base of algorithm implementations in various languages will help create universal machine learning systems that, instead of live translation between specific languages, manipulate a more abstract representation of the code, independent of specific programming languages.
Such a system can be used as a translator that translates transmitted code in any of the supported languages into its internal abstract representation, from which code in many languages can be generated.
Including the system you can perform bidirectional transformations. For example, banks and government agencies continue to use legacy COBOL projects. A machine learning translator can convert COBOL code to Java representation and optionally translate a Java snippet back to COBOL code.
In addition to translation between languages, CodeNet application areas are mentioned such as the creation of intelligent code search systems and the automation of clone detection, as well as the development of optimizers and systems for automatic code correction.
En particular, The examples presented in CodeNet are provided with metadata describing the results of the performance tests, the size of the resulting program, the memory consumption and the state that allows to distinguish the correct code from the erroneous code (to distinguish the correct code from the incorrect code, examples with errors are specially included in the collection, whose share is 29,5, XNUMX%).
A machine learning system can take this metadata into account to generate the most optimal code or to detect regressions in the analyzed code (the system can understand that the algorithm is not optimally implemented in the transmitted code or contains errors).
Finally If you are interested in learning more about CodeNet, you can check the details In the following link.