Scikit-Learn vs mlr for Machine Learning

How does the scikit-learn machine learning library for Python compare to the mlr package for R? Following along with a machine learning workflow through each approach, and see if you can gain a competitive advantage by knowing both frameworks.



 

Scikit-Learn is known for its easily understandable API for Python users, and MLR became an alternative to the popular Caret package with a larger suite of available algorithms and an easy way of tuning hyperparameters. These two packages are somewhat in competition due to the debate where many people involved in analytics turn to Python for machine learning and R for statistical analysis.

One of the reasons for a preference in using Python could be that current R packages for machine learning are provided via other packages that contain the algorithm. The packages are called through MLR but still require extra installation. Even external feature selection libraries are needed, and they will have other external dependencies that need to be satisfied as well.

Scikit-Learn is dubbed as a unified API to a number of machine learning algorithms that do not require the user to call any more libraries.

This by no means discredits R. R is still a major component in the data science world regardless of what an online poll would say. Anyone with a background in Statistics and or Mathematics will know why you should use R (regardless of whether they use it themselves they recognize the appeal).

Now we will take a look at how a user would go through a typical machine learning workflow. We will proceed with Logistic Regression in Scikit-Learn and Decision Tree in MLR.

 

Creating Your Training and Test Data

  • Scikit-Learn
    • x_train, x_test, y_train, y_test = train_test_split(x,y,test_size)
      This is the simplest way to partition datasets in sci-kit learn. The test_size is to determine what percentage of the data goes into the test set. train_test_split will create a train and test set automatically in one line of code. x is the set of features and y is the target variable.
  • MLR
    • train <- sample(1:nrow(data), 0.8 * nrow(data))
    • test <- setdiff(1:nrow(train), train)
    • MLR does not have an in-built function to subset datasets, so users need to rely on other R functions for this. This is an example of creating an 80/20 train test set.

Choosing an Algorithm

  • Scikit-Learn
    • LogisticRegression()
      The classifier is simply chosen and initialized by calling an obviously named function that makes it easy to identify.
  • MLR
    • makeLearner('classif.rpart')
      The algorithm is called a learner, and this function is called to initialize it.
    • makeClassifTask(data=, target=)
      If we are doing classification, we need to make a call to initialize a classification task. This function will take two arguments: your training data and the name of the target variable.

Hyperparameter Tuning

In either package, there is a process to follow when tuning hyperparameters. You first need to specify which parameters you want to change and the space of those parameters. Then conduct either a grid search or random search to find the best combination of parameter estimates that give you the best outcome (i.e., either minimize error or maximize accuracy).

  • Scikit-Learn
    • penalty = ['l2']
    • C = np.logspace(0, 4, 10)
    • dual= [False]
    • max_iter= [100,110,120,130,140]
    • hyperparameters = dict(C=C, penalty=penalty, dual=dual, max_iter=max_iter)
    • GridSearchCV(logreg, hyperparameters, cv=5, verbose=0)
    • clf.fit(x_train, y_train)
  • MLR
    • makeParamSet( makeDiscreteParam("minsplit", values=seq(5,10,1)), makeDiscreteParam("minbucket", values=seq(round(5/3,0), round(10/3,0), 1)), makeNumericParam("cp", lower = 0.01, upper = 0.05), makeDiscreteParam("maxcompete", values=6), makeDiscreteParam("usesurrogate", values=0), makeDiscreteParam("maxdepth", values=10) )
    • ctrl = makeTuneControlGrid()
    • rdesc = makeResampleDesc("CV", iters = 3L, stratify=TRUE)
    • tuneParams(learner=dt_prob, resampling=rdesc, measures=list(tpr,auc, fnr, mmce, tnr, setAggregation(tpr, test.sd)), par.set=dt_param, control=ctrl, task=dt_task, show.info = TRUE) )
    • setHyperPars(learner, par.vals = tuneParams$x)

Training

Both packages provide one line codes for trainig a model.

  • Scikit-Learn
    • LogisticRegression().fit(x_train50, y_train50)
  • MLR
    • train(learner, task)

This is arguably one of the simpler steps in the process. The most arduous step would be tuning hyperparameters and feature selection.

Prediction

Just like training the model, prediction can be done with one line of code.

  • Scikit-Learn
    • LogisticRegression().predict(x_test)
  • MLR
    • predict(trained model, newdata)

Scikit-learn will return an array of predicted labels while MLR will return a data frame of predicted labels.

Model Evaluation

The most popular method for evaluating a supervised classifier will be a confusion matrix from which you can obtain accuracy, error, precision, recall etc.

  • Scikit-Learn
    • confusion_matrix(y_test, prediction) OR
    • classification_report(y_test,prediction)
  • MLR
    • performance(prediction, measures = list(tpr,auc,mmce, acc,tnr)) OR
    • calculateROCMeasures(prediction)

Both packages offer more than one method of obtaining a confusion matrix. However, for an informative view in the easiest possible fashion, Python is not as informative as R. The first python code will only return a matrix with no labels. The user has to go back to the documentation to decipher which columns and rows correspond to which category. The second method has a better and more informative output, but it will only generate precision, recall, F1 score, and support. This is also the more important performance measures in an imbalanced classification problem.

Decision Thresholding (i.e., Changing the Classification threshold)

A threshold in a classification problem is a given probability that classifies each instance into a predicted category. The default threshold would always be 0.5 (i.e., 50%). This is a major point of difference when conducting machine learning in Python and R. R offers one-line-of-code solution to manipulating the threshold to account for class imbalances. Python does not have a built-in function for this and is up to the user to programmatically manipulate the threshold by defining their custom scripts/functions.

A pair of graphs showing decision thresholds.

  • Scikit-Learn
    • There is no one standard way of thresholding in Scikitlearn. Check out this article for one way that you can implement it yourself: Fine-tuning a Classifier in Scikit-Learn
  • MLR
    • setThreshold(prediction, threshold)
      This one line of code in mlr will automatically change your threshold and can be passed as an argument to calculate your new performance metrics (i.e., confusion matrix)

 

Conclusion

In the end, both MLR and Scikit-Learn have their pros and cons when dealing with machine learning. Our comparison focused on using either one for machine learning and does not serve as a reason to use one instead of the other. Knowing both is what can give a true competitive advantage to someone in the field. The conceptual understanding of the process will make it easier to use either tool.

 

Original. Reposted with permission.

Related: