GitHub Repo Raider and the Automation of Machine Learning

Since X never, ever marks the spot, this article raids the GitHub repos in search of quality automated machine learning resources. Read on for projects and papers to help understand and implement AutoML.



GitHub Repo Raider

 

GitHub is a clearinghouse for all sorts of open source projects, including those for machine learning, automated and otherwise.

What's automated machine learning? It's automating the automation of automation, of course! More specifically, automated machine learning is the use of automated techniques, be they learned methods or simple heuristics, used for algorithm selection, hyperparameter tuning, architecture design, or any other conceivable portion of a machine learning implementation.

Switching gears, Indiana Jones is one of the greatest characters to ever grace the silver screen. Raiders of the Lost Ark, the first movie in which the character was featured, is a personal favorite, film adored by millions. The rest of the (current) quadrilogy movies run alternately hot and cold, but even the poorest quality Indiana Jones is better than 95% of available cinema.

What does Indiana Jones have to do with machine learning, or GitHub? Not much, really. But we are going to co-opt the concept of his first film — the idea of marauding for something of value (an ancient relic in that case) — and apply it to searching GitHub for open source repositories of value.

Why are we doing this?

 

"Fortune and glory, kid. Fortune and glory."
 
—Dr. Henry (Indiana) Jones, Jr.

 

And so, in a vaguely Indiana Jones-esque article, let's take a look at some of the most popular automated machine learning resources — both project and paper repos — on GitHub.

 

"Jock, start the engine!"
 
—Dr. Henry Jones, Jr., again

 

 

Projects

 
The projects range from those which aid in algorithm selection to those assisting with hyperparameter tuning to architecture design and beyond. They are designed for linear machine learning algorithms, ensemble approaches, and/or neural networks. They are all connected in that they are all Python projects, or at the very least have Python APIs.

 

"Snakes. Why did it have to be snakes?"
 
—Also Dr. Henry Jones, Jr.

 

The projects are in no particular order, other than I tried to arrange them in increasing complexity, in terms of both capabilities and algorithms they support. It isn't perfect, but it's something.

 
auto-sklearn

auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.

auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading our paper published at NIPS 2015.

 
TPOT: Tree-based Pipeline Optimization Tool

TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

TPOT is built on top of scikit-learn, so all of the code it generates should look familiar... if you're familiar with scikit-learn, anyway.

 
Featuretools

Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.

 
SMAC3: Sequential Model-based Algorithm Configuration

SMAC is a tool for algorithm configuration to optimize the parameters of arbitrary algorithms across a set of instances. This also includes hyperparameter optimization of ML algorithms. The main core consists of Bayesian Optimization in combination with a aggressive racing mechanism to efficiently decide which of two configuration performs better.

 
AlphaPy

AlphaPy is a machine learning framework for both speculators and data scientists. It is written in Python with the scikit-learn and pandas libraries, as well as many other helpful libraries for feature engineering and visualization.

 
MLBox

MLBox is a powerful Automated Machine Learning python library. It provides the following features:

  • Fast reading and distributed data preprocessing/cleaning/formatting
  • Highly robust feature selection and leak detection
  • Accurate hyper-parameter optimization in high-dimensional space
  • State-of-the art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,...)
  • Prediction with models interpretation

 
AutoKeras

Auto-Keras is an open source software library for automated machine learning (AutoML). It is developed by DATA Lab at Texas A&M University and community contributors. The ultimate goal of AutoML is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background. Auto-Keras provides functions to automatically search for architecture and hyperparameters of deep learning models.

 
Keras Tuner

A hyperparameter tuner for Keras, specifically for tf.keras with TensorFlow 2.0.

Here's how to perform hyperparameter tuning for a single-layer dense neural network using random search.

First, we define a model-building function. It takes an argument hp from which you can sample hyperparameters, such as hp.Int('units', min_value=32, max_value=512, step=32) (an integer from a certain range).

 
NNI: Neural Network Intelligence

NNI (Neural Network Intelligence) is a toolkit to help users run automated machine learning (AutoML) experiments. The tool dispatches and runs trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in different environments like local machine, remote servers and cloud.

 
Auto-PyTorch

Automatic architecture search and hyperparameter optimization for PyTorch.

This a very early pre-alpha version of our upcoming Auto-PyTorch. So far, Auto-PyTorch supports featurized data (classification, regression) and image data (classification).

 
AdaNet

AdaNet is a lightweight TensorFlow-based framework for automatically learning high-quality models with minimal expert intervention. AdaNet builds on recent AutoML efforts to be fast and flexible while providing learning guarantees. Importantly, AdaNet provides a general framework for not only learning a neural network architecture, but also for learning to ensemble to obtain even better models.

 
Ludwig

Ludwig is a toolbox built on top of TensorFlow that allows to train and test deep learning models without the need to write code.

It can be used by practitioners to quickly train and test deep learning models as well as by researchers to obtain strong baselines to compare against and have an experimentation setting that ensures comparability by performing standard data preprocessing and visualization.

Ludwig provides a set of model architectures that can be combined together to create an end-to-end model for a given use case. As an analogy, if deep learning libraries provide the building blocks to make your building, Ludwig provides the buildings to make your city, and you can chose among the available buildings or add your own building to the set of available ones.

 

Papers

 
When it comes to automated machine learning, does this remark resemble you?

 

"You’re meddling with powers you can’t possibly comprehend."
 
—Marcus Brody

 

It shouldn't, and you can change that by reading a selection of the most important, influential, and explanatory research papers on the subject.

The following pair of repos can be raided for just such paper selections.

 
Awesome AutoML Papers

Awesome-AutoML-Papers is a curated list of automated machine learning papers, articles, tutorials, slides and projects. Star this repository, and then you can keep abreast of the latest developments of this booming research field. Thanks to all the people who made contributions to this project. Join us and you are welcome to be a contributor.

 
Awesome AutoML

Curating a list of AutoML-related research, tools, projects and other resources.

 
This wraps up our vaguely Indiana Jones-esque overview of automated machine learning GitHub repo raiding. Will there be a follow-up installment? Like the upcoming (???) Indy 5, we will have to wait and see.

 
Related: