Variants of Gradient Boosting

LightGBM

Catboost

XGBoost

Types of finding the best split

Think about Big O notations !!!

Approach 1- Presort Algorithms

O(n(Features)*n(No of data points))

Traditional Implementations found in Sklearn ,also used in Xgboost

How to reduce this Time complexity in this algorithm when we are using Big datasets

diff

Approach 2 -Histogram Based Algorithms

IDEA 1

Use the Histograms to find the Bins for each feature

Use this bins to find the best split (This is supported by the fact that the Spliting on real value or bins does't cost much difference in accuracy)

Note : USing bins may also prevents from Overfiting

histogram

Apprach 3-Gradient Based Strategy

IDEA 2

Use Gradients to find the best split !!!

But How ??

What does the Large gradients and Small gradients with respect ro the Loss function tries to tell you?

Gradient-based One-Side Sampling (GOSS)

Taking Sparsity of the data as advantage

Ignoring sparse inputs (xgboost and lightGBM)

Xgboost proposes to ignore the 0 features when computing the split, then allocating all the data with missing values to whichever side of the split reduces the loss more. This reduces the number of samples that have to be used when evaluating each split, speeding up the training process.

goss_exp

Approach 4 -Exclusive Feature Bundling (lightGBM)

* some features are never non-zero together

In [3]:
from sklearn.ensemble import GradientBoostingClassifier,GradientBoostingRegressor
from sklearn.datasets import load_boston,load_wine
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn import metrics
import math
import numpy as np

def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_test), y_test),
                m.score(X_train, y_train), m.score(X_test, y_test)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)
house_price=load_boston(return_X_y=False)
#house_price['data']
#house_price['feature_names']
#house_price['target']
X_df=pd.DataFrame(data=house_price['data'],columns=house_price['feature_names'])
y=house_price['target']
X_df.head(10)
Out[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394.12 5.21
6 0.08829 12.5 7.87 0.0 0.524 6.012 66.6 5.5605 5.0 311.0 15.2 395.60 12.43
7 0.14455 12.5 7.87 0.0 0.524 6.172 96.1 5.9505 5.0 311.0 15.2 396.90 19.15
8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0 15.2 386.63 29.93
9 0.17004 12.5 7.87 0.0 0.524 6.004 85.9 6.5921 5.0 311.0 15.2 386.71 17.10
In [12]:
X_df.nunique()
Out[12]:
CRIM       504
ZN          26
INDUS       76
CHAS         2
NOX         81
RM         446
AGE        356
DIS        412
RAD          9
TAX         66
PTRATIO     46
B          357
LSTAT      455
dtype: int64

Catboost DEMO

In [45]:
#pred=m.predict(X_test)
plt.plot(y_test,label='orig')
plt.plot(preds1,label='pred')
plt.legend()
plt.show()
rmse(y_test,preds1)
Out[45]:
4.501594069582915
In [28]:
#pred=m.predict(X_test)
plt.plot(y_test,label='orig')
plt.plot(preds,label='pred')
plt.legend()
plt.show()
rmse(y_test,preds)
Out[28]:
5.893656062165955