Recently, the largest companies engaged in development in the field of machine learning presented the project OpenXLA, intended for joint development of tools to compile and optimize models for machine learning systems.
The project has taken charge of the development of tools that allow unifying the compilation of models prepared in the TensorFlow, PyTorch and JAX frameworks for efficient training and execution on different GPUs and specialized accelerators. Companies such as Google, NVIDIA, AMD, Intel, Meta, Apple, Arm, Alibaba and Amazon joined the joint work of the project.
The OpenXLA Project provides a state-of-the-art ML compiler that can scale amidst the complexity of the ML infrastructure. Its fundamental pillars are performance, scalability, portability, flexibility and extensibility for users. With OpenXLA, we aspire to unlock the real potential of AI by accelerating its development and delivery.
OpenXLA enables developers to compile and optimize models from all leading ML frameworks for efficient training and servicing on a wide variety of hardware. Developers using OpenXLA will see significant improvements in training time, performance, service latency, and ultimately time to market and compute costs.
It is hoped that by joining efforts of the main research teams and representatives of the community, it will be possible to stimulate the development of machine learning systems and solve the problem of infrastructure fragmentation for various frameworks and teams.
OpenXLA allows to implement effective support for various hardware, regardless of the framework on which the machine learning model is based. OpenXLA is expected to reduce model training time, improve performance, reduce latency, reduce computing overhead, and reduce time to market.
OpenXLA consists of three main components, the code of which is distributed under the Apache 2.0 license:
- XLA (accelerated linear algebra) is a compiler that allows you to optimize machine learning models for high-performance execution on different hardware platforms, including GPUs, CPUs, and specialized accelerators from various manufacturers.
- StableHLO is a basic specification and implementation of a set of High-Level Operations (HLOs) for use in machine learning system models. It acts as a layer between machine learning frameworks and compilers that transform the model to run on specific hardware. Layers are prepared to generate models in StableHLO format for the PyTorch, TensorFlow and JAX frameworks. The MHLO suite is used as the basis for StableHLO, which is extended with support for serialization and version control.
- IREE (Intermediate Representation Execution Environment) is a compiler and runtime that converts machine learning models into a universal intermediate representation based on the MLIR (Intermediate Multi-Level Representation) format of the LLVM project. Of the features, the possibility of precompilation (ahead of time), support for flow control, the ability to use dynamic elements in models, optimization for different CPUs and GPUs, and low overhead are highlighted.
Regarding the main advantages of OpenXLA, it is mentioned that optimal performance has been achieved without having to delve into writing code device-specific, in addition to provide out-of-the-box optimizations, including simplification of algebraic expressions, efficient memory allocation, execution scheduling, taking into account the reduction of maximum memory consumption and overheads.
Another advantage is the simplification of scaling and parallelization of calculations. It is enough for a developer to add annotations for a subset of critical tensors, on the basis of which the compiler can automatically generate code for parallel computing.
It is also highlighted that portability is provided with support for multiple hardware platforms, such as AMD and NVIDIA GPUs, x86 and ARM CPUs, Google TPU ML Accelerators, AWS Trainium Inferentia IPUs, Graphcore, and Wafer-Scale Engine Cerebras.
Support for connecting extensions with the implementation of additional functions, as support for writing deep machine learning primitives using CUDA, HIP, SYCL, Triton and other languages for parallel computing, as well as the possibility of manual adjustment of bottlenecks in models.
Finally, if you are interested in knowing more about it, you can consult the details in the following link.
Be the first to comment