Kedro is the first open source tool developed by a division of consulting firm McKinsey. It was created to be used by data scientists and engineers. Is a code library that can be used to create data and pipes, the building blocks of a machine learning project.
McKinsey & Company is an American global management consulting company. Performs qualitative and quantitative analyzes to evaluate management decisions in the public and private sectors. His clients include 80% of the world's largest corporations.
Table of Contents
First open source tool
The company had never before released one of the tools developed in-house under an open source license. In fact, Kedro was born as proprietary software. However, when the relationship with the company ended, customers no longer had access to the program.
The name Kedro derives from the Greek word for center or core. It was chosen because this open source tool provides crucial code for producing advanced analysis projects.
Kedro has two main advantages:
- It enables teams to collaborate more easily by structuring analytical code in a uniform way.
- It allows all components to flow seamlessly through all stages of a project.
- Consolidation of data sources,
- Data cleaning
- Feature Creation
- Feed the data into machine learning models for explanatory or predictive analysis.
Kedro too helps deliver ready-to-use code. This makes it really useful for data scientists who are not usually experts in software creation.
Why is Kedro useful?
Open source tools like Kedro allow reduce the time it takes to transform a prototype into production code by weeks. Analysts can spend less time coding and more time troubleshooting their customers.
Kedro helps teams create modular data channels, tested, reproducible in any environment and versioned, allowing users to access previous data states. That same code can go from a single developer's laptop to an enterprise-grade project using cloud computing. It can also be used with all industries, models and data sources.
McKinsey has already used Kedro on more than 50 projects to date. According to one executive, customers especially like the visualization of the pipes. They immediately see the different stages of transformation, the types of models involved, and can trace the results back to the raw data source.
McKinsey not the first company not directly related to technology which publishes open source tools. Uber and Airbnb had already done it.
Kendro Features and Installation
Kedro is a workflow development tool for the creation of robust, scalable, deployable, reproducible and versioned data channels.
What are the main characteristics of Kedro?
1. Project template and coding standards
- An easy-to-use, standard project template
- Settings for credentials, registration, data upload and Jupyter Notebooks / Lab.
- Test-driven development using pytest
- Sphinx integration to produce well-documented code
2. Data extraction and versioning
- Separation of the computing layer from the data management layer, including support for different data formats and storage options.
- Versions for your data sets and machine learning models
3. Modularity and abstraction of pipes
- Support for pure Python functions, nodes, to divide large chunks of code into small independent sections.
- Automatic resolution of dependencies between nodes
4. Extensibility of features
- A plugin system that injects commands into Kedro's command line interface (CLI): Kedro-Airflow, making it easy to prototype your data pipeline in Kedro before deploying it to Airflow, a workflow scheduler . Kedro-Docker, a tool for packing and shipping Kedro projects in containers
- Kedro can be deployed locally, on premises and in the cloud (AWS, Azure, and GCP) or in clusters (EMR, Azure HDinsight, GCP, and Databricks).
We can install Kedro on our pre-referenced Linux distribution by doing:
sudo apt install python3-pip
pip install kedro
pip3 install kedro -U
We can see the documentation with:
More information can be found at the project page