Введение в анализ данных, ИАД-1¶

НИУ ВШЭ, 2018-19 учебный год¶

Домашнее задание №4. Градиентный бустинг своими руками¶

Задание выполнил(а): Подчезерцев Алексей

Общая информация¶

Дата выдачи: 27.04.2019

Дедлайн: 23:59 12.05.2019

Оценивание и штрафы¶

За сдачу задания позже срока на итоговую оценку за задание накладывается штраф в размере 1 балл в день, но получить отрицательную оценку нельзя.

Внимание! Домашнее задание выполняется самостоятельно. «Похожие» решения считаются плагиатом и все задействованные студенты (в том числе те, у кого списали) не могут получить за него больше 0 баллов.

Формат сдачи¶

Стирать условия нельзя!

Загрузка файлов с решениями происходит в системе Anytask.

Формат названия файла: homework_04_Подчезерцев_Алексей.ipynb

Задание 1. (0.5 балла)

Мы будем использовать данные соревнования Home Credit Default Risk.

Загрузите таблицу application_train.csv;
Запишите в Y столбец с целевой переменной;
Удалите ненужные столбцы (для этого воспользуйтесь описанием);
Определите тип столбцов и заполните пропуски - стратегия произвольная;
Разбейте выборку в соотношении 70:30 с random_state=0.

Так как в данных значительный дисбаланс классов, в качестве метрики качества везде будем использовать площадь под precision-recall кривой.

import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

%%time
X_orig = pd.read_csv('application_train.csv', index_col=0)

Wall time: 16.7 s

%%time
X = X_orig.copy()

Wall time: 171 ms

Приведем в числовой вид бинарные признаки

%%time
X["CODE_GENDER"] = X.apply(lambda x: 1 if x["CODE_GENDER"] == 'M' else 0, axis=1)
X["FLAG_OWN_CAR"] = X.apply(lambda x: 1 if x["FLAG_OWN_CAR"] == 'Y' else 0, axis=1)
X["FLAG_OWN_REALTY"] = X.apply(lambda x: 1 if x["FLAG_OWN_REALTY"] == 'Y' else 0, axis=1)
X["NAME_CONTRACT_TYPE"] = X.apply(lambda x: 1 if x["NAME_CONTRACT_TYPE"] == 'Revolving loans' else 0, axis=1)
#X = X.drop("NAME_TYPE_SUITE", axis=1)

Wall time: 23.2 s

def plot_corr(D, size):
    corr = D.corr()
    corr = np.abs(corr)
    f, ax = plt.subplots(figsize=(size, size))
    cmap = plt.cm.Oranges
    sns.heatmap(corr, cmap=cmap,
            xticklabels=corr.columns,
            yticklabels=corr.columns)

%%time
plot_corr(X, 60)

Wall time: 20.2 s

Метрики дома сильно коррелируют друг с другом, избавимся от них

%%time
#
X.drop("APARTMENTS_AVG", axis=1, inplace=True)
X.drop("BASEMENTAREA_AVG", axis=1, inplace=True)
# X.drop("YEARS_BEGINEXPLUATATION_AVG", axis=1, inplace=True)
X.drop("YEARS_BUILD_AVG", axis=1, inplace=True)
X.drop("COMMONAREA_AVG", axis=1, inplace=True)
X.drop("ELEVATORS_AVG", axis=1, inplace=True)
X.drop("ENTRANCES_AVG", axis=1, inplace=True)
X.drop("FLOORSMAX_AVG", axis=1, inplace=True)
X.drop("FLOORSMIN_AVG", axis=1, inplace=True)
X.drop("LANDAREA_AVG", axis=1, inplace=True)
X.drop("LIVINGAPARTMENTS_AVG", axis=1, inplace=True)
# X.drop("LIVINGAREA_AVG", axis=1, inplace=True)
X.drop("NONLIVINGAPARTMENTS_AVG", axis=1, inplace=True)
X.drop("NONLIVINGAREA_AVG", axis=1, inplace=True)
X.drop("APARTMENTS_MODE", axis=1, inplace=True)
X.drop("BASEMENTAREA_MODE", axis=1, inplace=True)
X.drop("YEARS_BEGINEXPLUATATION_MODE", axis=1, inplace=True)
X.drop("YEARS_BUILD_MODE", axis=1, inplace=True)
X.drop("COMMONAREA_MODE", axis=1, inplace=True)
X.drop("ELEVATORS_MODE", axis=1, inplace=True)
X.drop("ENTRANCES_MODE", axis=1, inplace=True)
X.drop("FLOORSMAX_MODE", axis=1, inplace=True)
X.drop("FLOORSMIN_MODE", axis=1, inplace=True)
X.drop("LANDAREA_MODE", axis=1, inplace=True)
X.drop("LIVINGAPARTMENTS_MODE", axis=1, inplace=True)
X.drop("LIVINGAREA_MODE", axis=1, inplace=True)
X.drop("NONLIVINGAPARTMENTS_MODE", axis=1, inplace=True)
X.drop("NONLIVINGAREA_MODE", axis=1, inplace=True)
X.drop("APARTMENTS_MEDI", axis=1, inplace=True)
X.drop("BASEMENTAREA_MEDI", axis=1, inplace=True)
X.drop("YEARS_BEGINEXPLUATATION_MEDI", axis=1, inplace=True)
X.drop("YEARS_BUILD_MEDI", axis=1, inplace=True)
X.drop("COMMONAREA_MEDI", axis=1, inplace=True)
X.drop("ELEVATORS_MEDI", axis=1, inplace=True)
X.drop("ENTRANCES_MEDI", axis=1, inplace=True)
X.drop("FLOORSMAX_MEDI", axis=1, inplace=True)
X.drop("FLOORSMIN_MEDI", axis=1, inplace=True)
X.drop("LANDAREA_MEDI", axis=1, inplace=True)
X.drop("LIVINGAPARTMENTS_MEDI", axis=1, inplace=True)
X.drop("LIVINGAREA_MEDI", axis=1, inplace=True)
X.drop("NONLIVINGAPARTMENTS_MEDI", axis=1, inplace=True)
X.drop("NONLIVINGAREA_MEDI", axis=1, inplace=True)
X.drop("FONDKAPREMONT_MODE", axis=1, inplace=True)
X.drop("HOUSETYPE_MODE", axis=1, inplace=True)
X.drop("TOTALAREA_MODE", axis=1, inplace=True)
X.drop("WALLSMATERIAL_MODE", axis=1, inplace=True)
X.drop("EMERGENCYSTATE_MODE", axis=1, inplace=True)

Wall time: 8.33 s

Дропнем некоторые хорошо взаимокоррелируемые значения

X.drop("AMT_ANNUITY", axis=1, inplace=True)
X.drop("AMT_GOODS_PRICE", axis=1, inplace=True)
X.drop("DAYS_EMPLOYED", axis=1, inplace=True)
X.drop("REGION_RATING_CLIENT_W_CITY", axis=1, inplace=True)
X.drop("LIVE_CITY_NOT_WORK_CITY", axis=1, inplace=True)
X.drop("LIVE_REGION_NOT_WORK_REGION", axis=1, inplace=True)
X.drop("OBS_60_CNT_SOCIAL_CIRCLE", axis=1, inplace=True)
X.drop("DEF_60_CNT_SOCIAL_CIRCLE", axis=1, inplace=True)

%%time
plot_corr(X, 40)

Wall time: 4.43 s

X.head()

_x_null=X.isnull().sum(axis=0)
_x_null = _x_null[_x_null != 0]
_x_null.shape

(18,)

_x_null[_x_null != 0]

NAME_TYPE_SUITE                  1292
OWN_CAR_AGE                    202929
OCCUPATION_TYPE                 96391
CNT_FAM_MEMBERS                     2
EXT_SOURCE_1                   173378
EXT_SOURCE_2                      660
EXT_SOURCE_3                    60965
YEARS_BEGINEXPLUATATION_AVG    150007
LIVINGAREA_AVG                 154350
OBS_30_CNT_SOCIAL_CIRCLE         1021
DEF_30_CNT_SOCIAL_CIRCLE         1021
DAYS_LAST_PHONE_CHANGE              1
AMT_REQ_CREDIT_BUREAU_HOUR      41519
AMT_REQ_CREDIT_BUREAU_DAY       41519
AMT_REQ_CREDIT_BUREAU_WEEK      41519
AMT_REQ_CREDIT_BUREAU_MON       41519
AMT_REQ_CREDIT_BUREAU_QRT       41519
AMT_REQ_CREDIT_BUREAU_YEAR      41519
dtype: int64

X["OWN_CAR_AGE"].fillna(0, inplace=True)
X["OCCUPATION_TYPE"].fillna('', inplace=True)
X["EXT_SOURCE_1"].fillna(0.5, inplace=True)
X["EXT_SOURCE_2"].fillna(0.5, inplace=True)
X["EXT_SOURCE_3"].fillna(0.5, inplace=True)
X["YEARS_BEGINEXPLUATATION_AVG"].fillna(0.5, inplace=True)
X["LIVINGAREA_AVG"].fillna(0.5, inplace=True)

X["AMT_REQ_CREDIT_BUREAU_HOUR"].fillna(0, inplace=True)
X["AMT_REQ_CREDIT_BUREAU_DAY"].fillna(0, inplace=True)
X["AMT_REQ_CREDIT_BUREAU_WEEK"].fillna(0, inplace=True)
X["AMT_REQ_CREDIT_BUREAU_MON"].fillna(0, inplace=True)
X["AMT_REQ_CREDIT_BUREAU_QRT"].fillna(0, inplace=True)
X["AMT_REQ_CREDIT_BUREAU_YEAR"].fillna(0, inplace=True)
X.head()

_x_null=X.isnull().sum(axis=0)
_x_null = _x_null[_x_null != 0]
_x_null.shape

(5,)

_x_null[_x_null != 0]

NAME_TYPE_SUITE             1292
CNT_FAM_MEMBERS                2
OBS_30_CNT_SOCIAL_CIRCLE    1021
DEF_30_CNT_SOCIAL_CIRCLE    1021
DAYS_LAST_PHONE_CHANGE         1
dtype: int64

X.dropna(axis = 0, inplace=True)

X.shape

(305197, 68)

X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 305197 entries, 100002 to 456255
Data columns (total 68 columns):
TARGET                         305197 non-null int64
NAME_CONTRACT_TYPE             305197 non-null int64
CODE_GENDER                    305197 non-null int64
FLAG_OWN_CAR                   305197 non-null int64
FLAG_OWN_REALTY                305197 non-null int64
CNT_CHILDREN                   305197 non-null int64
AMT_INCOME_TOTAL               305197 non-null float64
AMT_CREDIT                     305197 non-null float64
NAME_TYPE_SUITE                305197 non-null object
NAME_INCOME_TYPE               305197 non-null object
NAME_EDUCATION_TYPE            305197 non-null object
NAME_FAMILY_STATUS             305197 non-null object
NAME_HOUSING_TYPE              305197 non-null object
REGION_POPULATION_RELATIVE     305197 non-null float64
DAYS_BIRTH                     305197 non-null int64
DAYS_REGISTRATION              305197 non-null float64
DAYS_ID_PUBLISH                305197 non-null int64
OWN_CAR_AGE                    305197 non-null float64
FLAG_MOBIL                     305197 non-null int64
FLAG_EMP_PHONE                 305197 non-null int64
FLAG_WORK_PHONE                305197 non-null int64
FLAG_CONT_MOBILE               305197 non-null int64
FLAG_PHONE                     305197 non-null int64
FLAG_EMAIL                     305197 non-null int64
OCCUPATION_TYPE                305197 non-null object
CNT_FAM_MEMBERS                305197 non-null float64
REGION_RATING_CLIENT           305197 non-null int64
WEEKDAY_APPR_PROCESS_START     305197 non-null object
HOUR_APPR_PROCESS_START        305197 non-null int64
REG_REGION_NOT_LIVE_REGION     305197 non-null int64
REG_REGION_NOT_WORK_REGION     305197 non-null int64
REG_CITY_NOT_LIVE_CITY         305197 non-null int64
REG_CITY_NOT_WORK_CITY         305197 non-null int64
ORGANIZATION_TYPE              305197 non-null object
EXT_SOURCE_1                   305197 non-null float64
EXT_SOURCE_2                   305197 non-null float64
EXT_SOURCE_3                   305197 non-null float64
YEARS_BEGINEXPLUATATION_AVG    305197 non-null float64
LIVINGAREA_AVG                 305197 non-null float64
OBS_30_CNT_SOCIAL_CIRCLE       305197 non-null float64
DEF_30_CNT_SOCIAL_CIRCLE       305197 non-null float64
DAYS_LAST_PHONE_CHANGE         305197 non-null float64
FLAG_DOCUMENT_2                305197 non-null int64
FLAG_DOCUMENT_3                305197 non-null int64
FLAG_DOCUMENT_4                305197 non-null int64
FLAG_DOCUMENT_5                305197 non-null int64
FLAG_DOCUMENT_6                305197 non-null int64
FLAG_DOCUMENT_7                305197 non-null int64
FLAG_DOCUMENT_8                305197 non-null int64
FLAG_DOCUMENT_9                305197 non-null int64
FLAG_DOCUMENT_10               305197 non-null int64
FLAG_DOCUMENT_11               305197 non-null int64
FLAG_DOCUMENT_12               305197 non-null int64
FLAG_DOCUMENT_13               305197 non-null int64
FLAG_DOCUMENT_14               305197 non-null int64
FLAG_DOCUMENT_15               305197 non-null int64
FLAG_DOCUMENT_16               305197 non-null int64
FLAG_DOCUMENT_17               305197 non-null int64
FLAG_DOCUMENT_18               305197 non-null int64
FLAG_DOCUMENT_19               305197 non-null int64
FLAG_DOCUMENT_20               305197 non-null int64
FLAG_DOCUMENT_21               305197 non-null int64
AMT_REQ_CREDIT_BUREAU_HOUR     305197 non-null float64
AMT_REQ_CREDIT_BUREAU_DAY      305197 non-null float64
AMT_REQ_CREDIT_BUREAU_WEEK     305197 non-null float64
AMT_REQ_CREDIT_BUREAU_MON      305197 non-null float64
AMT_REQ_CREDIT_BUREAU_QRT      305197 non-null float64
AMT_REQ_CREDIT_BUREAU_YEAR     305197 non-null float64
dtypes: float64(20), int64(40), object(8)
memory usage: 160.7+ MB

X = X.reset_index(drop=True)

y = X["TARGET"]
X = X.drop("TARGET", axis=1)

X.head()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Задание 2. (1.5 балла)

Обучите реализации градиентного бустинга LightGBM и Catboost на вещественных признаках без подбора параметров. Почему получилась заметная разница в качестве?

В этом и последующих экспериментах необходимо измерять время обучения моделей.

from catboost import CatBoostClassifier 
import lightgbm as lgb
from sklearn.metrics import precision_recall_curve, auc

def pr_auc(y_true, y_scores):
    precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
    return auc(recall, precision)

X_train_2 = X_train.select_dtypes(include=["int64", "float64"])
X_test_2 = X_test.select_dtypes(include=["int64", "float64"])

%%time
yandex_clf = CatBoostClassifier(logging_level='Silent')
yandex_clf.fit(X_train_2, y_train)

Wall time: 2min 35s

y_predict_2 = yandex_clf.predict_proba(X_test_2)[:,1]
print("PR-AUC for default yandex boost", pr_auc(y_test, y_predict_2))

PR-AUC for default yandex boost 0.2219287676589921

%%time
l_clf = lgb.LGBMClassifier()
l_clf.fit(X_train_2, y_train)

Wall time: 5.04 s

y_predict_2 = l_clf.predict_proba(X_test_2)[:,1]
print("PR-AUC for default lgb", pr_auc(y_test, y_predict_2))

PR-AUC for default lgb 0.22241316543186374

Алгоритмы catboost и lightgbm имеют разные дефолтные параметры, поэтому отличаются метрики качества и время работы.

Задание 3. (2 балла)

Подберите с CV=3 оптимальные параметры алгоритмов, изменяя:

глубину деревьев;
количество деревьев;
темп обучения;
оптимизируемый функционал.

Проанализируйте соотношения глубины и количества деревьев в зависимости от алгоритма.

from sklearn.model_selection import KFold
from time import time
CV = 3
kf = KFold(n_splits=CV,shuffle=True, random_state=0)

X_3 = X.select_dtypes(include=["int64", "float64"])

def calc_CV_classifier(kf, x, y, classifier):
    _auc = []
    _times = []
    for train_index, test_index in kf.split(x):
        begin = time()
        classifier.fit(x.values[train_index], y[train_index])
        _times.append(time() - begin)
        y_predict = classifier.predict_proba(x.values[test_index])[:,1]
        _auc.append(pr_auc(y[test_index], y_predict))
    d = {}
    d["auc_mean"] = [sum(_auc)/len(_auc)]
    d["time_mean"] = [sum(_times)/len(_times)] 
    for i in range(len(_auc)):
        d["auc_" + str(i + 1)] = [_auc[i]]
    for i in range(len(_times)):   
        d["time_" + str(i + 1)] = [_times[i] ]
    return pd.DataFrame(data=d)

%%time
max_trees = 100
learning_rate = 0.1
result_depth = pd.DataFrame()
for depth in range(1, 10, 2):
    ya_clf = CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate)
    ya_result = calc_CV_classifier(kf, X_3, y, ya_clf)
    ya_result.insert(0, "name", ["ya_depth_" + str(depth)])
    result_depth = result_depth.append(ya_result, ignore_index=True)
  
    l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate)
    l_result = calc_CV_classifier(kf, X_3, y, l_clf)
    l_result.insert(0, "name", ["lgbm_depth_" + str(depth)])
    result_depth = result_depth.append(l_result, ignore_index=True)   

result_depth.set_index('name')

Wall time: 8min 24s

result_depth.sort_values("auc_mean")[::-1]

def log_int_iterator(start, end, step):
    i = start
    while i <= end:
        yield int(i)
        i *= step

%%time
depth = 7
learning_rate = 0.1
result_trees = pd.DataFrame()
for max_trees in log_int_iterator(80, 160, 1.3):
    ya_clf = CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate)
    ya_result = calc_CV_classifier(kf, X_3, y, ya_clf)
    ya_result.insert(0, "name", ["ya_trees_" + str(max_trees)])
    result_trees = result_trees.append(ya_result, ignore_index=True)
for max_trees in log_int_iterator(160, 320, 1.3):  
    l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate)
    l_result = calc_CV_classifier(kf, X_3, y, l_clf)
    l_result.insert(0, "name", ["lgbm_trees_" + str(max_trees)])
    result_trees = result_trees.append(l_result, ignore_index=True)   

result_trees.set_index('name')

Wall time: 6min

result_trees.sort_values("auc_mean")[::-1]

def log_iterator(start, end, step):
    i = start
    while i <= end:
        yield i
        i *= step

%%time
depth = 7
max_trees = 100
result_rate = pd.DataFrame()
for learning_rate in log_iterator(0.1, 0.2, 1.1):
    ya_clf = CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate)
    ya_result = calc_CV_classifier(kf, X_3, y, ya_clf)
    ya_result.insert(0, "name", ["ya_rate_" + str(learning_rate)])
    result_rate = result_rate.append(ya_result, ignore_index=True)
for learning_rate in log_iterator(0.01, 0.1, 1.2):  
    l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate)
    l_result = calc_CV_classifier(kf, X_3, y, l_clf)
    l_result.insert(0, "name", ["lgbm_rate_" + str(learning_rate)])
    result_rate = result_rate.append(l_result, ignore_index=True)   

result_rate.set_index('name')

Wall time: 11min 1s

result_rate.sort_values("auc_mean")[::-1]

%%time
depth = 7
max_trees = 100
learning_rate=0.1
result_obj = pd.DataFrame()
for loss_function in ["CrossEntropy", "Logloss", None]:
    ya_clf = CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate, loss_function=loss_function)
    ya_result = calc_CV_classifier(kf, X_3, y, ya_clf)
    ya_result.insert(0, "name", ["ya_loss_function_" + str(loss_function)])
    result_obj = result_obj.append(ya_result, ignore_index=True)

Wall time: 2min 31s

%%time
result_obj_2 = result_obj.copy()

l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate)
l_result = calc_CV_classifier(kf, X_3, y, l_clf)
l_result.insert(0, "name", ["lgbm_objective_None"])
result_obj_2 = result_obj_2.append(l_result, ignore_index=True)   

l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate, is_unbalance=True)
l_result = calc_CV_classifier(kf, X_3, y, l_clf)
l_result.insert(0, "name", ["lgbm_objective_is_unbalance=True"])
result_obj_2 = result_obj_2.append(l_result, ignore_index=True)   

result_obj_2.set_index('name')

Wall time: 32.7 s

%%time
result_obj_3 = result_obj_2.copy()
for metric in ["binary_logloss", "binary_error", "cross_entropy"]:  
    l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate, is_unbalance=True, metric=metric)
    l_result = calc_CV_classifier(kf, X_3, y, l_clf)
    l_result.insert(0, "name", ["lgbm_objective_" + str(metric) + "_is_unbalance=True"])
    result_obj_3 = result_obj_3.append(l_result, ignore_index=True)

Wall time: 35.3 s

result_obj_3.sort_values("auc_mean")[::-1]

Чем больше количество деревьев и глубина построения, тем дольше работает алгоритм. До некоторого числа растет качество моделей, затем оно начинает падать

Задание 4. (3.5 балла)

Добавьте категориальные признаки к вещественным следующими способами:

как OHE признаки;
как счетчики со сглаживанием.

При подсчете счетчиков запрещается использование циклов.

На получившихся датасетах подберите параметры у каждого из алгоритмов. Как меняется время, необходимое для обучения модели в зависимости от способа кодирования? Сравните полученные результаты с встроенными методами обработки категориальных признаков.

def calc_classifier(x_train, y_train, x_test, y_test, classifier):
    begin = time()
    classifier.fit(x_train, y_train)
    _times = time() - begin
    y_predict = classifier.predict_proba(x_test)[:,1]
    _auc = pr_auc(y_test, y_predict)
    d = {}
    d["auc"] = [_auc]
    d["time"] = [_times]
    return pd.DataFrame(data=d)

X_train_OHE = X_train.copy()
X_test_OHE = X_test.copy()
y_train_OHE = y_train.copy()
y_test_OHE = y_test.copy()

%%time
for i in X_train_OHE:
    if X_train_OHE[i].dtype == 'object':
        X_train_OHE = pd.concat([X_train_OHE, pd.get_dummies(X_train_OHE[i], prefix = i)], axis=1)       
        X_test_OHE = pd.concat([X_test_OHE, pd.get_dummies(X_test_OHE[i], prefix = i)], axis=1) 
X_train_OHE.head()

Wall time: 3.21 s

X_train_OHE = X_train_OHE.select_dtypes(include=["int64", "float64", "uint8"])
X_test_OHE = X_test_OHE.select_dtypes(include=["int64", "float64", "uint8"])

%%time
max_trees = 100
learning_rate = 0.1
result_depth_OHE = pd.DataFrame()
for depth in range(5, 10, 2):
    ya_result = calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE, CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_depth_" + str(depth)])
    result_depth_OHE = result_depth_OHE.append(ya_result, ignore_index=True)
for depth in range(11, 20, 2):  
    l_result = calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE,  lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_depth_" + str(depth)])
    result_depth_OHE = result_depth_OHE.append(l_result, ignore_index=True)   

result_depth_OHE.set_index('name')

Wall time: 1min 47s

result_depth_OHE.sort_values("auc")[::-1]

%%time
learning_rate = 0.1
result_trees_OHE = pd.DataFrame()
for max_trees in log_int_iterator(80, 160, 1.3):
    ya_result = calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE,  
                                CatBoostClassifier(logging_level='Silent', depth=7, iterations=max_trees, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_trees_" + str(max_trees)])
    result_trees_OHE = result_trees_OHE.append(ya_result, ignore_index=True)
for max_trees in log_int_iterator(160, 320, 1.3):  
    l_result = calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE,  
                               lgb.LGBMClassifier(max_depth=17, n_estimators=max_trees, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_trees_" + str(max_trees)])
    result_trees_OHE = result_trees_OHE.append(l_result, ignore_index=True)   

result_trees_OHE.set_index('name')

Wall time: 1min 53s

result_trees_OHE.sort_values("auc")[::-1]

%%time
result_rate_OHE = pd.DataFrame()
for learning_rate in log_iterator(0.1, 0.2, 1.1):
    ya_result =calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE,  
                               CatBoostClassifier(logging_level='Silent', depth=7, iterations=150, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_rate_" + str(learning_rate)])
    result_rate_OHE = result_rate_OHE.append(ya_result, ignore_index=True)
for learning_rate in log_iterator(0.01, 0.1, 1.2):  
    l_result = calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE, 
                               lgb.LGBMClassifier(max_depth=17, n_estimators=150, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_rate_" + str(learning_rate)])
    result_rate_OHE = result_rate_OHE.append(l_result, ignore_index=True)   

result_rate_OHE.set_index('name')

Wall time: 9min 4s

result_rate_OHE.sort_values("auc")[::-1]

X_train_SC = X_train.copy()
X_test_SC = X_test.copy()
y_train_SC = y_train.copy()
y_test_SC = y_test.copy()

%%time
y_mean = y_train_SC.mean()
for i in X_train_SC:
    if X_train_SC[i].dtype == 'object':
        # Fit
        uniq = pd.DataFrame(data=X_train_SC[i].unique(), columns=["unique"])
        val = uniq.apply(lambda x: 1 if y_train_SC[X_train_SC[X_train_SC[i] == x["unique"]].index].mean() > y_mean else 0, axis=1)
        uniq.insert(loc=1, column='value', value=val)
        uniq = uniq.set_index("unique")
        # Transform
        X_train_SC[i] = X_train_SC.apply(lambda x: uniq.loc[x[i],"value"], axis=1)
        X_test_SC[i] = X_test_SC.apply(lambda x: uniq.loc[x[i],"value"], axis=1)

Wall time: 1min 23s

X_train_SC = X_train_SC.select_dtypes(include=["int64", "float64", "uint8"])
X_test_SC = X_test_SC.select_dtypes(include=["int64", "float64", "uint8"])

%%time
max_trees = 100
learning_rate = 0.1
result_depth_SC = pd.DataFrame()
for depth in range(8, 16, 2):
    ya_result =calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,  
                               CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_depth_" + str(depth)])
    result_depth_SC = result_depth_SC.append(ya_result, ignore_index=True)
for depth in range(5, 10, 2):  
    l_result = calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,   
                               lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_depth_" + str(depth)])
    result_depth_SC = result_depth_SC.append(l_result, ignore_index=True)   

result_depth_SC.set_index('name')

Wall time: 5min 31s

result_depth_SC.sort_values("auc")[::-1]

%%time
learning_rate = 0.1
result_trees_SC = pd.DataFrame()
for max_trees in log_int_iterator(80, 300, 1.3):
    ya_result = calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,  
                                CatBoostClassifier(logging_level='Silent', depth=8, iterations=max_trees, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_trees_" + str(max_trees)])
    result_trees_SC = result_trees_SC.append(ya_result, ignore_index=True)
for max_trees in log_int_iterator(160, 320, 1.3):  
    l_result = calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,  
                               lgb.LGBMClassifier(max_depth=7, n_estimators=max_trees, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_trees_" + str(max_trees)])
    result_trees_SC = result_trees_SC.append(l_result, ignore_index=True)   

result_trees_SC.set_index('name')

Wall time: 3min 53s

result_trees_SC.sort_values("auc")[::-1]

%%time
result_rate_SC = pd.DataFrame()
for learning_rate in log_iterator(0.1, 0.2, 1.1):
    ya_result = calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,  
                                CatBoostClassifier(logging_level='Silent', depth=8, iterations=300, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_rate_" + str(learning_rate)])
    result_rate_SC = result_rate_SC.append(ya_result, ignore_index=True)
for learning_rate in log_iterator(0.01, 0.1, 1.2):  
    l_result = calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,  
                               lgb.LGBMClassifier(max_depth=7, n_estimators=160, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_rate_" + str(learning_rate)])
    result_rate_SC = result_rate_SC.append(l_result, ignore_index=True)   

result_rate_SC.set_index('name')

Wall time: 11min

result_rate_SC.sort_values("auc")[::-1]

CatBoost categorical

from catboost import Pool, CatBoostClassifier

categorical = np.array(X_train.select_dtypes(include=["object"]).columns)
categorical

array(['NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE'], dtype=object)

train_dataset = Pool(data=X_train,
                     label=y_train,
                     cat_features=categorical)

eval_dataset = Pool(data=X_test,
                    label=y_test,
                    cat_features=categorical)

model = CatBoostClassifier(iterations=135,
                           learning_rate=0.146,
                           depth=7)

model.fit(train_dataset)

0:	learn: 0.5498153	total: 558ms	remaining: 1m 14s
1:	learn: 0.4577457	total: 932ms	remaining: 1m 1s
2:	learn: 0.3933832	total: 1.43s	remaining: 1m 2s
3:	learn: 0.3515783	total: 1.92s	remaining: 1m 2s
4:	learn: 0.3221710	total: 2.36s	remaining: 1m 1s
5:	learn: 0.3031934	total: 2.81s	remaining: 1m
6:	learn: 0.2900960	total: 3.29s	remaining: 1m
7:	learn: 0.2802120	total: 3.75s	remaining: 59.6s
8:	learn: 0.2739610	total: 4.21s	remaining: 58.9s
9:	learn: 0.2686102	total: 4.66s	remaining: 58.3s
10:	learn: 0.2651570	total: 5.11s	remaining: 57.6s
11:	learn: 0.2620787	total: 5.58s	remaining: 57.2s
12:	learn: 0.2600087	total: 6.22s	remaining: 58.4s
13:	learn: 0.2583669	total: 6.83s	remaining: 59s
14:	learn: 0.2566506	total: 7.33s	remaining: 58.7s
15:	learn: 0.2554154	total: 7.82s	remaining: 58.1s
16:	learn: 0.2546885	total: 8.36s	remaining: 58s
17:	learn: 0.2540173	total: 8.91s	remaining: 57.9s
18:	learn: 0.2531824	total: 9.38s	remaining: 57.3s
19:	learn: 0.2527277	total: 10s	remaining: 57.7s
20:	learn: 0.2523273	total: 10.8s	remaining: 58.6s
21:	learn: 0.2519660	total: 11.4s	remaining: 58.5s
22:	learn: 0.2515779	total: 12s	remaining: 58.3s
23:	learn: 0.2511539	total: 12.5s	remaining: 57.6s
24:	learn: 0.2508874	total: 13s	remaining: 57.1s
25:	learn: 0.2504573	total: 13.5s	remaining: 56.5s
26:	learn: 0.2501976	total: 14s	remaining: 55.9s
27:	learn: 0.2499544	total: 14.5s	remaining: 55.4s
28:	learn: 0.2497565	total: 15s	remaining: 54.8s
29:	learn: 0.2496000	total: 15.5s	remaining: 54.2s
30:	learn: 0.2492999	total: 16s	remaining: 53.8s
31:	learn: 0.2491192	total: 16.5s	remaining: 53.2s
32:	learn: 0.2488717	total: 17.2s	remaining: 53.2s
33:	learn: 0.2487257	total: 17.7s	remaining: 52.7s
34:	learn: 0.2485779	total: 18.2s	remaining: 52s
35:	learn: 0.2485027	total: 18.6s	remaining: 51.2s
36:	learn: 0.2483549	total: 19.1s	remaining: 50.5s
37:	learn: 0.2482492	total: 19.5s	remaining: 49.9s
38:	learn: 0.2481184	total: 20s	remaining: 49.2s
39:	learn: 0.2479598	total: 20.5s	remaining: 48.6s
40:	learn: 0.2477822	total: 20.9s	remaining: 48s
41:	learn: 0.2476596	total: 21.4s	remaining: 47.3s
42:	learn: 0.2476343	total: 21.9s	remaining: 46.8s
43:	learn: 0.2475439	total: 22.5s	remaining: 46.5s
44:	learn: 0.2474827	total: 22.9s	remaining: 45.8s
45:	learn: 0.2473764	total: 23.4s	remaining: 45.3s
46:	learn: 0.2472550	total: 23.9s	remaining: 44.7s
47:	learn: 0.2471388	total: 24.4s	remaining: 44.2s
48:	learn: 0.2470800	total: 24.8s	remaining: 43.5s
49:	learn: 0.2469901	total: 25.3s	remaining: 43s
50:	learn: 0.2469083	total: 25.8s	remaining: 42.4s
51:	learn: 0.2467997	total: 26.2s	remaining: 41.9s
52:	learn: 0.2466692	total: 26.7s	remaining: 41.3s
53:	learn: 0.2465628	total: 27.1s	remaining: 40.7s
54:	learn: 0.2464672	total: 27.6s	remaining: 40.1s
55:	learn: 0.2463405	total: 28s	remaining: 39.6s
56:	learn: 0.2462651	total: 28.5s	remaining: 38.9s
57:	learn: 0.2461808	total: 28.9s	remaining: 38.4s
58:	learn: 0.2461140	total: 29.4s	remaining: 37.8s
59:	learn: 0.2460520	total: 29.8s	remaining: 37.3s
60:	learn: 0.2459615	total: 30.3s	remaining: 36.8s
61:	learn: 0.2458719	total: 30.8s	remaining: 36.3s
62:	learn: 0.2457610	total: 31.4s	remaining: 35.9s
63:	learn: 0.2456988	total: 31.9s	remaining: 35.4s
64:	learn: 0.2456547	total: 32.4s	remaining: 34.9s
65:	learn: 0.2455689	total: 32.8s	remaining: 34.3s
66:	learn: 0.2455301	total: 33.3s	remaining: 33.8s
67:	learn: 0.2454726	total: 33.7s	remaining: 33.2s
68:	learn: 0.2453989	total: 34.2s	remaining: 32.7s
69:	learn: 0.2453339	total: 34.7s	remaining: 32.3s
70:	learn: 0.2452627	total: 35.3s	remaining: 31.8s
71:	learn: 0.2451527	total: 35.7s	remaining: 31.3s
72:	learn: 0.2450547	total: 36.3s	remaining: 30.9s
73:	learn: 0.2449666	total: 36.9s	remaining: 30.4s
74:	learn: 0.2448939	total: 37.3s	remaining: 29.9s
75:	learn: 0.2448265	total: 37.8s	remaining: 29.3s
76:	learn: 0.2447641	total: 38.2s	remaining: 28.8s
77:	learn: 0.2446385	total: 38.7s	remaining: 28.2s
78:	learn: 0.2445640	total: 39.1s	remaining: 27.7s
79:	learn: 0.2444399	total: 39.5s	remaining: 27.2s
80:	learn: 0.2443790	total: 40s	remaining: 26.7s
81:	learn: 0.2443560	total: 40.4s	remaining: 26.1s
82:	learn: 0.2442477	total: 40.9s	remaining: 25.6s
83:	learn: 0.2442240	total: 41.3s	remaining: 25.1s
84:	learn: 0.2441174	total: 41.8s	remaining: 24.6s
85:	learn: 0.2440652	total: 42.2s	remaining: 24s
86:	learn: 0.2440192	total: 42.6s	remaining: 23.5s
87:	learn: 0.2439386	total: 43.1s	remaining: 23s
88:	learn: 0.2438737	total: 43.6s	remaining: 22.5s
89:	learn: 0.2437947	total: 44s	remaining: 22s
90:	learn: 0.2437083	total: 44.4s	remaining: 21.5s
91:	learn: 0.2436437	total: 44.8s	remaining: 20.9s
92:	learn: 0.2435485	total: 45.2s	remaining: 20.4s
93:	learn: 0.2434764	total: 45.7s	remaining: 19.9s
94:	learn: 0.2433881	total: 46.3s	remaining: 19.5s
95:	learn: 0.2433211	total: 46.8s	remaining: 19s
96:	learn: 0.2432510	total: 47.2s	remaining: 18.5s
97:	learn: 0.2431556	total: 47.7s	remaining: 18s
98:	learn: 0.2430725	total: 48.1s	remaining: 17.5s
99:	learn: 0.2429764	total: 48.7s	remaining: 17.1s
100:	learn: 0.2429229	total: 49.3s	remaining: 16.6s
101:	learn: 0.2428362	total: 49.7s	remaining: 16.1s
102:	learn: 0.2427608	total: 50.2s	remaining: 15.6s
103:	learn: 0.2426981	total: 50.6s	remaining: 15.1s
104:	learn: 0.2426320	total: 51.2s	remaining: 14.6s
105:	learn: 0.2425525	total: 51.7s	remaining: 14.1s
106:	learn: 0.2424648	total: 52.2s	remaining: 13.7s
107:	learn: 0.2423618	total: 52.7s	remaining: 13.2s
108:	learn: 0.2422521	total: 53.1s	remaining: 12.7s
109:	learn: 0.2422300	total: 53.5s	remaining: 12.2s
110:	learn: 0.2421200	total: 54s	remaining: 11.7s
111:	learn: 0.2420304	total: 54.4s	remaining: 11.2s
112:	learn: 0.2419500	total: 54.9s	remaining: 10.7s
113:	learn: 0.2418972	total: 55.4s	remaining: 10.2s
114:	learn: 0.2418163	total: 55.8s	remaining: 9.71s
115:	learn: 0.2417474	total: 56.3s	remaining: 9.22s
116:	learn: 0.2417198	total: 56.8s	remaining: 8.73s
117:	learn: 0.2416341	total: 57.2s	remaining: 8.24s
118:	learn: 0.2415379	total: 57.7s	remaining: 7.76s
119:	learn: 0.2414737	total: 58.2s	remaining: 7.27s
120:	learn: 0.2414033	total: 58.7s	remaining: 6.79s
121:	learn: 0.2413545	total: 59.1s	remaining: 6.3s
122:	learn: 0.2412670	total: 59.6s	remaining: 5.82s
123:	learn: 0.2411448	total: 1m	remaining: 5.33s
124:	learn: 0.2410740	total: 1m	remaining: 4.84s
125:	learn: 0.2409829	total: 1m	remaining: 4.35s
126:	learn: 0.2409350	total: 1m 1s	remaining: 3.87s
127:	learn: 0.2408700	total: 1m 1s	remaining: 3.38s
128:	learn: 0.2407545	total: 1m 2s	remaining: 2.9s
129:	learn: 0.2407068	total: 1m 2s	remaining: 2.42s
130:	learn: 0.2406661	total: 1m 3s	remaining: 1.94s
131:	learn: 0.2406052	total: 1m 4s	remaining: 1.46s
132:	learn: 0.2405590	total: 1m 4s	remaining: 971ms
133:	learn: 0.2405088	total: 1m 5s	remaining: 486ms
134:	learn: 0.2404435	total: 1m 5s	remaining: 0us

<catboost.core.CatBoostClassifier at 0x17e15c71240>

preds_proba = model.predict_proba(eval_dataset)[:,1]
print("Catboost pr auc", pr_auc(y_test, preds_proba))

Catboost pr auc 0.22378657110847214

AUC 0.223777 time 59.692731 для SC

AUC 0.225587 time 31.980033 для OHE

Стандартный метод обработки примерно похож по времени и качеству на SC

X_train_lgbm = X_train.copy()
X_test_lgbm = X_test.copy()

l_cat = "name:" + ','.join(categorical)

for i in categorical:
    X_train_lgbm[i] = X_train_lgbm[i].astype('category')
    X_test_lgbm[i] = X_test_lgbm[i].astype('category')

lgb_train = lgb.Dataset(X_train_lgbm, y_train)

%%time
lgb_params = {
    'objective': 'binary',
    'learning_rate'    : 0.074,
    'max_depth'        : 5,
    'n_estimators'     : 160}

gbm = lgb.train(lgb_params,
                lgb_train,
                num_boost_round=100)

C:\Users\alex1\Anaconda3\lib\site-packages\lightgbm\engine.py:118: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))

Wall time: 5.05 s

print("Lgbm pr auc", pr_auc(y_test,  gbm.predict(X_test_lgbm)))

Lgbm pr auc 0.2251404634039964

AUC 0.228050 time 9.101934 для SC

AUC 0.228120 time 11.166834 для OHE

Ручная обработка дала немного лучшие результаты, однако затрачивала больше времени. Возможная причина - различные гиперпараметры, подобранные для выборок без категориальных признаках и с их обработкой.

Conclusion:

Smoth counter счетчики обрабатываются несколько быстрее чем OHE метод. Возможно это обусловлено разным количеством признаков в выборках после обработки. В то же время, оптимальные гиперпараметры подбираются разные, что говорит о разной природе данных

LightGBM и CatBoost позволяют работать с категориальными признаками, для lgbm необходимо конвертировать тип столбцов и просто передать, далее он сам разберется. Catboost принимает список категориальных столбцов и с ними работает.

Задание 5. (1 балл)

Реализуйте блендинг (получение ответов нескольких моделей и взятие их с весами (их нужно подбирать на обучающей выборке)) подобранных в предыдущем задании моделей и сравните качество.

models = {
    "cat_OHE" : {"model":CatBoostClassifier(logging_level='Silent', depth=7, iterations=150, learning_rate=0.12)},
    "lgbm_OHE" : {"model":lgb.LGBMClassifier(max_depth=17, n_estimators=200, learning_rate=0.074)},
    "cat_SC" : {"model":CatBoostClassifier(logging_level='Silent', depth=8, iterations=300, learning_rate=0.1)},
    "lgbm_SC" : {"model":lgb.LGBMClassifier(max_depth=7, n_estimators=160, learning_rate=0.062)},
}

kf = KFold(n_splits=2,shuffle=True, random_state=0)
_y_train = y_train_OHE
_y_train = _y_train.reset_index(drop=True)
for k,v in models.items():
    _x_train = None
    if k.endswith("_OHE"):
        _x_train = X_train_OHE
    else:
        _x_train = X_train_SC
    _x_train = _x_train.reset_index(drop=True)

    for train_index, test_index in kf.split(_x_train):
        begin = time()
        v["model"].fit(_x_train.values[train_index], _y_train[train_index])
        print(k, "fit time: ", time() - begin)
        v["proba"] = v["model"].predict_proba(_x_train.values[test_index])[:,1]
        v["auc"] = pr_auc(_y_train[test_index], v["proba"])
        break

cat_OHE fit time:  20.19306516647339
lgbm_OHE fit time:  4.79600191116333
cat_SC fit time:  44.660996437072754
lgbm_SC fit time:  4.1190056800842285

from sklearn.linear_model import LinearRegression

d_blend = pd.DataFrame()
for k,v in models.items():
    d_blend[k] = pd.Series(v['proba'])
d_blend.head()

lr_blend = LinearRegression()
lr_blend.fit(d_blend, _y_train[test_index])
lr_blend.coef_

array([0.2876268 , 0.12899986, 0.3083336 , 0.26534119])

out = pd.DataFrame()
for k,v in models.items(): 
    _x_test = None
    if k.endswith("_OHE"):
        _x_test = X_test_OHE
    else:
        _x_test = X_test_SC
    out[k] = v["model"].predict_proba(_x_test)[:,1]
out.head()

result_predict = np.sum(out.values * 0.5, axis=1)
result_predict

array([0.37159091, 0.37848636, 0.1152988 , ..., 0.06779153, 0.05902516,
       0.10185661])

_auc = pr_auc(y_test_OHE,result_predict)
_auc

0.22509396174519794

result_blend = pd.DataFrame()
for k,v in models.items(): 
    result_blend = result_blend.append(pd.DataFrame(data={"name":[k], "auc":v["auc"]}), ignore_index=True)
result_blend = result_blend.append(pd.DataFrame(data={"name":["blend"], "auc":[_auc]}), ignore_index=True)

result_blend.sort_values("auc")[::-1]

Блендинг не повысил качество

Задание 6. (1.5 балла)

В задании 3 вы подобрали гиперпараметры для LightGBM и CatBoost на вещественных признаках. Визуализируйте важности признаков, посчитанные этими алгоритмами, в виде горизонтального bar-plot (отсортируйте признаки по убыванию важности, подпишите названия признаков по оси y).

Для каждого из двух алгоритмов удалите неважные признаки (обычно по bar-plot хорошо видно порог на важность, с которого начинается хвост неважных признаков) и обучите ту же модель на получившихся данных. Сильно ли упало качество при удалении признаков, которые модель считает неважными?

import matplotlib.pyplot as plt
%matplotlib inline

_x_train = X_train.select_dtypes(include=["int64", "float64"])
_x_test = X_test.select_dtypes(include=["int64", "float64"])

%%time
clf = lgb.LGBMClassifier(max_depth=5, n_estimators=160, learning_rate=0.074)
clf.fit(_x_train, y_train)
feature_imp = pd.DataFrame(sorted(zip(clf.feature_importances_,_x_train.columns)), columns=['Value','Feature'])
print("pr-auc", pr_auc(y_test, clf.predict_proba(_x_test)[:,1]))

pr-auc 0.22302033721552292
Wall time: 6.4 s

objects = feature_imp['Feature']
y_pos = np.arange(len(objects))
performance = feature_imp['Value']
plt.figure(figsize=(20,20))
plt.grid(True)
plt.title('Важность признаков для LGBM')
plt.barh(y_pos, performance, align='center', alpha=0.5)
plt.yticks(y_pos, objects)
plt.show()

select = feature_imp.sort_values("Value")[::-1]["Feature"][:41].values
_x_train_lgb = _x_train[select]
_x_test_lgb = _x_test[select]

%%time
clf = lgb.LGBMClassifier(max_depth=5, n_estimators=160, learning_rate=0.074)
clf.fit(_x_train_lgb, y_train)
feature_imp = pd.DataFrame(sorted(zip(clf.feature_importances_,_x_train_lgb.columns)), columns=['Value','Feature'])
print("pr-auc", pr_auc(y_test, clf.predict_proba(_x_test_lgb)[:,1]))

pr-auc 0.22301141951869569
Wall time: 5.15 s

%%time
clf = CatBoostClassifier(max_depth=7, n_estimators=135, learning_rate=0.146)
clf.fit(_x_train, y_train)
feature_imp = pd.DataFrame(sorted(zip(clf.feature_importances_,_x_train.columns)), columns=['Value','Feature'])
print("pr-auc", pr_auc(y_test, clf.predict_proba(_x_test)[:,1]))

0:	learn: 0.5466391	total: 164ms	remaining: 22s
1:	learn: 0.4562286	total: 321ms	remaining: 21.3s
2:	learn: 0.3935334	total: 481ms	remaining: 21.2s
3:	learn: 0.3543266	total: 638ms	remaining: 20.9s
4:	learn: 0.3248818	total: 790ms	remaining: 20.5s
5:	learn: 0.3053074	total: 945ms	remaining: 20.3s
6:	learn: 0.2903538	total: 1.14s	remaining: 20.8s
7:	learn: 0.2813296	total: 1.31s	remaining: 20.7s
8:	learn: 0.2747577	total: 1.47s	remaining: 20.6s
9:	learn: 0.2693616	total: 1.63s	remaining: 20.4s
10:	learn: 0.2647987	total: 1.84s	remaining: 20.7s
11:	learn: 0.2621068	total: 2.05s	remaining: 21s
12:	learn: 0.2598813	total: 2.29s	remaining: 21.4s
13:	learn: 0.2583308	total: 2.51s	remaining: 21.7s
14:	learn: 0.2571862	total: 2.7s	remaining: 21.6s
15:	learn: 0.2559531	total: 2.88s	remaining: 21.4s
16:	learn: 0.2550236	total: 3.12s	remaining: 21.7s
17:	learn: 0.2543727	total: 3.31s	remaining: 21.5s
18:	learn: 0.2536770	total: 3.49s	remaining: 21.3s
19:	learn: 0.2532385	total: 3.65s	remaining: 21s
20:	learn: 0.2528858	total: 3.82s	remaining: 20.7s
21:	learn: 0.2525923	total: 3.98s	remaining: 20.4s
22:	learn: 0.2521166	total: 4.15s	remaining: 20.2s
23:	learn: 0.2517843	total: 4.32s	remaining: 20s
24:	learn: 0.2514605	total: 4.49s	remaining: 19.8s
25:	learn: 0.2511555	total: 4.66s	remaining: 19.6s
26:	learn: 0.2508952	total: 4.93s	remaining: 19.7s
27:	learn: 0.2507467	total: 5.17s	remaining: 19.8s
28:	learn: 0.2505553	total: 5.37s	remaining: 19.6s
29:	learn: 0.2504023	total: 5.55s	remaining: 19.4s
30:	learn: 0.2502301	total: 5.74s	remaining: 19.2s
31:	learn: 0.2499631	total: 5.93s	remaining: 19.1s
32:	learn: 0.2497816	total: 6.09s	remaining: 18.8s
33:	learn: 0.2496380	total: 6.28s	remaining: 18.6s
34:	learn: 0.2494756	total: 6.47s	remaining: 18.5s
35:	learn: 0.2492645	total: 6.66s	remaining: 18.3s
36:	learn: 0.2490400	total: 6.85s	remaining: 18.1s
37:	learn: 0.2489032	total: 7.05s	remaining: 18s
38:	learn: 0.2487495	total: 7.24s	remaining: 17.8s
39:	learn: 0.2485909	total: 7.43s	remaining: 17.6s
40:	learn: 0.2484669	total: 7.61s	remaining: 17.4s
41:	learn: 0.2483230	total: 7.81s	remaining: 17.3s
42:	learn: 0.2482206	total: 7.97s	remaining: 17.1s
43:	learn: 0.2481188	total: 8.17s	remaining: 16.9s
44:	learn: 0.2479314	total: 8.36s	remaining: 16.7s
45:	learn: 0.2478285	total: 8.53s	remaining: 16.5s
46:	learn: 0.2477250	total: 8.7s	remaining: 16.3s
47:	learn: 0.2476175	total: 8.87s	remaining: 16.1s
48:	learn: 0.2475045	total: 9.06s	remaining: 15.9s
49:	learn: 0.2474546	total: 9.23s	remaining: 15.7s
50:	learn: 0.2473639	total: 9.43s	remaining: 15.5s
51:	learn: 0.2472873	total: 9.59s	remaining: 15.3s
52:	learn: 0.2471622	total: 9.77s	remaining: 15.1s
53:	learn: 0.2470383	total: 9.93s	remaining: 14.9s
54:	learn: 0.2469383	total: 10.1s	remaining: 14.7s
55:	learn: 0.2467582	total: 10.3s	remaining: 14.5s
56:	learn: 0.2467027	total: 10.5s	remaining: 14.3s
57:	learn: 0.2466219	total: 10.7s	remaining: 14.1s
58:	learn: 0.2465000	total: 10.8s	remaining: 14s
59:	learn: 0.2464183	total: 11s	remaining: 13.8s
60:	learn: 0.2463628	total: 11.2s	remaining: 13.6s
61:	learn: 0.2462682	total: 11.4s	remaining: 13.4s
62:	learn: 0.2461664	total: 11.5s	remaining: 13.2s
63:	learn: 0.2460555	total: 11.6s	remaining: 12.9s
64:	learn: 0.2459491	total: 11.8s	remaining: 12.7s
65:	learn: 0.2459013	total: 12s	remaining: 12.5s
66:	learn: 0.2458418	total: 12.1s	remaining: 12.3s
67:	learn: 0.2457953	total: 12.2s	remaining: 12s
68:	learn: 0.2457364	total: 12.3s	remaining: 11.8s
69:	learn: 0.2456418	total: 12.5s	remaining: 11.6s
70:	learn: 0.2455152	total: 12.6s	remaining: 11.4s
71:	learn: 0.2454336	total: 12.7s	remaining: 11.1s
72:	learn: 0.2453576	total: 12.9s	remaining: 10.9s
73:	learn: 0.2452786	total: 13s	remaining: 10.7s
74:	learn: 0.2451672	total: 13.2s	remaining: 10.6s
75:	learn: 0.2451019	total: 13.4s	remaining: 10.4s
76:	learn: 0.2450247	total: 13.5s	remaining: 10.2s
77:	learn: 0.2449193	total: 13.6s	remaining: 9.96s
78:	learn: 0.2448351	total: 13.8s	remaining: 9.81s
79:	learn: 0.2447962	total: 14s	remaining: 9.62s
80:	learn: 0.2447685	total: 14.1s	remaining: 9.41s
81:	learn: 0.2446774	total: 14.2s	remaining: 9.21s
82:	learn: 0.2446143	total: 14.4s	remaining: 9.01s
83:	learn: 0.2445683	total: 14.5s	remaining: 8.83s
84:	learn: 0.2445222	total: 14.7s	remaining: 8.67s
85:	learn: 0.2444601	total: 14.9s	remaining: 8.47s
86:	learn: 0.2443848	total: 15s	remaining: 8.29s
87:	learn: 0.2442999	total: 15.1s	remaining: 8.09s
88:	learn: 0.2442105	total: 15.3s	remaining: 7.9s
89:	learn: 0.2441195	total: 15.4s	remaining: 7.71s
90:	learn: 0.2440422	total: 15.5s	remaining: 7.52s
91:	learn: 0.2439909	total: 15.7s	remaining: 7.32s
92:	learn: 0.2439323	total: 15.8s	remaining: 7.14s
93:	learn: 0.2438433	total: 15.9s	remaining: 6.96s
94:	learn: 0.2437950	total: 16.1s	remaining: 6.77s
95:	learn: 0.2437059	total: 16.2s	remaining: 6.58s
96:	learn: 0.2436633	total: 16.3s	remaining: 6.4s
97:	learn: 0.2435801	total: 16.5s	remaining: 6.22s
98:	learn: 0.2435162	total: 16.6s	remaining: 6.05s
99:	learn: 0.2434214	total: 16.8s	remaining: 5.88s
100:	learn: 0.2433601	total: 16.9s	remaining: 5.71s
101:	learn: 0.2433146	total: 17.1s	remaining: 5.53s
102:	learn: 0.2432440	total: 17.3s	remaining: 5.37s
103:	learn: 0.2431750	total: 17.5s	remaining: 5.21s
104:	learn: 0.2431027	total: 17.7s	remaining: 5.04s
105:	learn: 0.2429946	total: 17.8s	remaining: 4.87s
106:	learn: 0.2429409	total: 17.9s	remaining: 4.69s
107:	learn: 0.2428753	total: 18.1s	remaining: 4.51s
108:	learn: 0.2427856	total: 18.2s	remaining: 4.34s
109:	learn: 0.2427089	total: 18.3s	remaining: 4.16s
110:	learn: 0.2426494	total: 18.4s	remaining: 3.99s
111:	learn: 0.2425567	total: 18.6s	remaining: 3.82s
112:	learn: 0.2424810	total: 18.8s	remaining: 3.65s
113:	learn: 0.2423885	total: 18.9s	remaining: 3.49s
114:	learn: 0.2423468	total: 19.1s	remaining: 3.33s
115:	learn: 0.2423182	total: 19.3s	remaining: 3.16s
116:	learn: 0.2422620	total: 19.4s	remaining: 2.99s
117:	learn: 0.2421555	total: 19.5s	remaining: 2.81s
118:	learn: 0.2420774	total: 19.7s	remaining: 2.65s
119:	learn: 0.2420280	total: 19.8s	remaining: 2.48s
120:	learn: 0.2419559	total: 20s	remaining: 2.31s
121:	learn: 0.2419029	total: 20.1s	remaining: 2.15s
122:	learn: 0.2418020	total: 20.3s	remaining: 1.98s
123:	learn: 0.2416871	total: 20.4s	remaining: 1.81s
124:	learn: 0.2416060	total: 20.5s	remaining: 1.64s
125:	learn: 0.2415634	total: 20.7s	remaining: 1.48s
126:	learn: 0.2415027	total: 20.9s	remaining: 1.31s
127:	learn: 0.2414403	total: 21s	remaining: 1.15s
128:	learn: 0.2413465	total: 21.1s	remaining: 983ms
129:	learn: 0.2412980	total: 21.3s	remaining: 818ms
130:	learn: 0.2412372	total: 21.4s	remaining: 653ms
131:	learn: 0.2411406	total: 21.5s	remaining: 489ms
132:	learn: 0.2410715	total: 21.7s	remaining: 326ms
133:	learn: 0.2409998	total: 21.8s	remaining: 163ms
134:	learn: 0.2409475	total: 21.9s	remaining: 0us
pr-auc 0.21880762137363477
Wall time: 28.5 s

objects = feature_imp['Feature']
y_pos = np.arange(len(objects))
performance = feature_imp['Value']
plt.figure(figsize=(20,20))
plt.grid(True)
plt.title('Важность признаков для CatBoost')
plt.barh(y_pos, performance, align='center', alpha=0.5)
plt.yticks(y_pos, objects)
plt.show()

select = feature_imp.sort_values("Value")[::-1]["Feature"][:30].values
_x_train_lgb = _x_train[select]
_x_test_lgb = _x_test[select]

%%time
clf = lgb.LGBMClassifier(max_depth=5, n_estimators=160, learning_rate=0.074)
clf.fit(_x_train_lgb, y_train)
feature_imp = pd.DataFrame(sorted(zip(clf.feature_importances_,_x_train_lgb.columns)), columns=['Value','Feature'])
print("pr-auc", pr_auc(y_test, clf.predict_proba(_x_test_lgb)[:,1]))

pr-auc 0.22260249357719117
Wall time: 4.55 s

После селеции качество lgbm практически не изменилось, качество же catboost'a даже несколько повысилось

	name	auc_mean	time_mean	auc_1	auc_2	auc_3	time_1	time_2	time_3
5	lgbm_depth_5	0.230291	5.593335	0.223429	0.239176	0.228270	8.566999	4.053004	4.160001
7	lgbm_depth_7	0.230158	7.535334	0.223212	0.238963	0.228299	7.185001	5.077002	10.343999
9	lgbm_depth_9	0.229801	6.132000	0.222157	0.238988	0.228258	4.259003	3.853998	10.283000
6	ya_depth_7	0.226959	29.459334	0.220238	0.235859	0.224781	28.780998	32.040005	27.556998
3	lgbm_depth_3	0.225972	3.838332	0.219705	0.233778	0.224432	5.414995	3.076999	3.023002
8	ya_depth_9	0.225970	39.175508	0.218561	0.234640	0.224711	33.615525	42.603997	41.307001
4	ya_depth_5	0.225545	23.423001	0.218210	0.233063	0.225363	26.113002	23.535000	20.621000
2	ya_depth_3	0.221001	22.313004	0.214489	0.227418	0.221094	23.228005	25.003004	18.708003
1	lgbm_depth_1	0.209676	2.289332	0.203954	0.214938	0.210136	2.458002	2.238996	2.170998
0	ya_depth_1	0.208214	14.256666	0.202143	0.213642	0.208857	11.011999	11.923000	19.834999

	name	auc_mean	time_mean	auc_1	auc_2	auc_3	time_1	time_2	time_3
3	lgbm_trees_160	0.229816	9.093333	0.223193	0.238417	0.227838	8.981001	7.939996	10.359002
4	lgbm_trees_208	0.228892	14.178001	0.222513	0.237266	0.226896	17.672997	7.059005	17.802001
2	ya_trees_135	0.228842	28.218336	0.221562	0.238308	0.226655	28.609002	28.146004	27.900002
5	lgbm_trees_270	0.227802	12.322666	0.221931	0.235679	0.225797	8.879999	19.197000	8.890999
1	ya_trees_104	0.227117	24.395667	0.220216	0.236105	0.225030	25.467998	23.500001	24.219000
0	ya_trees_80	0.225866	21.336667	0.219667	0.234109	0.223822	18.848001	19.586998	25.575003

	name	auc_mean	time_mean	auc_1	auc_2	auc_3	time_1	time_2	time_3
19	lgbm_rate_0.07430083706879999	0.230610	6.343334	0.223719	0.238573	0.229538	8.356007	5.977113	4.696882
18	lgbm_rate_0.06191736422399999	0.229380	3.762666	0.222366	0.236994	0.228779	3.863997	3.548001	3.876001
20	lgbm_rate_0.08916100448255998	0.229258	4.138690	0.222003	0.237898	0.227873	5.296533	3.645535	3.474001
4	ya_rate_0.14641000000000004	0.228098	18.970668	0.221503	0.236878	0.225914	16.927000	22.295013	17.689990
17	lgbm_rate_0.05159780351999999	0.227895	4.158113	0.221117	0.235593	0.226975	4.400130	3.935571	4.138638
2	ya_rate_0.12100000000000002	0.227721	18.315660	0.221285	0.234925	0.226953	18.229000	19.092980	17.624999
3	ya_rate_0.13310000000000002	0.227720	16.135347	0.221592	0.235467	0.226100	16.002001	15.514999	16.889040
1	ya_rate_0.11000000000000001	0.227653	17.608335	0.220584	0.236274	0.226101	17.459003	17.710998	17.655004
6	ya_rate_0.17715610000000007	0.227413	15.298333	0.220997	0.234458	0.226783	15.132999	15.297000	15.465001
16	lgbm_rate_0.0429981696	0.227347	3.868698	0.220067	0.234728	0.227248	4.092998	3.639997	3.873099
5	ya_rate_0.16105100000000006	0.227260	15.276347	0.220639	0.235530	0.225611	15.016040	15.704001	15.109000
0	ya_rate_0.1	0.226959	17.759002	0.220238	0.235859	0.224781	19.750005	16.958997	16.568003
7	ya_rate_0.1948717100000001	0.225389	16.137670	0.216041	0.235298	0.224828	15.439003	15.184999	17.789006
15	lgbm_rate_0.035831808	0.224815	4.458728	0.217962	0.231457	0.225026	3.899990	4.599999	4.876196
14	lgbm_rate_0.02985984	0.222554	7.380854	0.216032	0.228817	0.222811	5.888035	7.944530	8.309998
13	lgbm_rate_0.0248832	0.220647	4.616668	0.214776	0.226311	0.220854	5.398001	4.208001	4.244002
12	lgbm_rate_0.020736	0.218367	4.615000	0.212683	0.223832	0.218585	4.768999	4.459004	4.616998
11	lgbm_rate_0.01728	0.216139	5.270002	0.210808	0.220951	0.216658	4.735011	5.180998	5.893996
10	lgbm_rate_0.0144	0.213921	5.904333	0.209288	0.218457	0.214017	7.004999	5.497999	5.210001
9	lgbm_rate_0.012	0.211729	5.848511	0.206909	0.216211	0.212066	4.890003	7.043994	5.611537
8	lgbm_rate_0.01	0.209891	5.171331	0.205540	0.214099	0.210033	4.892997	5.100001	5.520996

	name	auc_mean	time_mean	auc_1	auc_2	auc_3	time_1	time_2	time_3
7	lgbm_objective_cross_entropy_is_unbalance=True	0.231029	3.276335	0.225010	0.239833	0.228245	3.272002	3.303002	3.254000
6	lgbm_objective_binary_error_is_unbalance=True	0.231029	3.336332	0.225010	0.239833	0.228245	3.349998	3.188997	3.470000
5	lgbm_objective_binary_logloss_is_unbalance=True	0.231029	3.257665	0.225010	0.239833	0.228245	3.384996	3.169000	3.218999
4	lgbm_objective_is_unbalance=True	0.231029	5.420999	0.225010	0.239833	0.228245	6.117998	5.380999	4.764000
3	lgbm_objective_None	0.230158	3.901331	0.223212	0.238963	0.228299	3.951995	3.689000	4.062997
1	ya_loss_function_Logloss	0.226959	17.127992	0.220238	0.235859	0.224781	17.185999	16.607998	17.589980
0	ya_loss_function_CrossEntropy	0.226959	17.556313	0.220238	0.235859	0.224781	16.852995	17.666000	18.149945
2	ya_loss_function_None	0.225346	11.816336	0.218689	0.232687	0.224663	11.424005	11.717007	12.307996

	name	auc	time
6	lgbm_depth_17	0.226655	4.973000
7	lgbm_depth_19	0.226366	5.048001
5	lgbm_depth_15	0.225366	5.063996
4	lgbm_depth_13	0.225254	5.848997
3	lgbm_depth_11	0.224522	5.578999
1	ya_depth_7	0.222873	21.397999
2	ya_depth_9	0.222511	27.170990
0	ya_depth_5	0.221202	21.255000

	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	NAME_TYPE_SUITE	NAME_INCOME_TYPE	...	FLAG_DOCUMENT_18	FLAG_DOCUMENT_19	FLAG_DOCUMENT_20	FLAG_DOCUMENT_21	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR
100002	1	0	1	0	1	0	202500.0	406597.5	Unaccompanied	Working	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	1.0
100003	0	0	0	0	0	0	270000.0	1293502.5	Family	State servant	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
100004	0	1	1	1	1	0	67500.0	135000.0	Unaccompanied	Working	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0
100006	0	0	0	0	1	0	135000.0	312682.5	Unaccompanied	Working	...	0	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN
100007	0	0	1	0	1	0	121500.0	513000.0	Unaccompanied	Working	...	0	0	0	0	0.0	0.0	0.0	0.0	0.0	0.0

	name	auc	time
3	lgbm_trees_160	0.226276	6.691996
4	lgbm_trees_208	0.225808	8.567997
5	lgbm_trees_270	0.225065	10.282003
2	ya_trees_135	0.224759	28.804001
1	ya_trees_104	0.223038	25.267998
0	ya_trees_80	0.221178	22.034606

	name	auc	time
19	lgbm_rate_0.07430083706879999	0.228120	11.166834
18	lgbm_rate_0.06191736422399999	0.228070	62.689053
17	lgbm_rate_0.05159780351999999	0.226134	20.864165
16	lgbm_rate_0.0429981696	0.225639	18.839656
2	ya_rate_0.12100000000000002	0.225587	31.980033
20	lgbm_rate_0.08916100448255998	0.225213	11.447431
15	lgbm_rate_0.035831808	0.224927	13.206527
0	ya_rate_0.1	0.224669	31.208994
3	ya_rate_0.13310000000000002	0.224383	33.362998
1	ya_rate_0.11000000000000001	0.223548	31.609998
4	ya_rate_0.14641000000000004	0.223369	33.700764
5	ya_rate_0.16105100000000006	0.223367	35.113729
7	ya_rate_0.1948717100000001	0.222927	38.494523
6	ya_rate_0.17715610000000007	0.222734	34.553633
14	lgbm_rate_0.02985984	0.222442	13.770530
13	lgbm_rate_0.0248832	0.221124	11.955902
12	lgbm_rate_0.020736	0.219033	14.907905
11	lgbm_rate_0.01728	0.216863	15.059998
10	lgbm_rate_0.0144	0.214668	14.947731
9	lgbm_rate_0.012	0.212405	11.160001
8	lgbm_rate_0.01	0.210092	12.293635

	name	auc	time
5	lgbm_depth_7	0.226819	4.406003
4	lgbm_depth_5	0.225525	4.298995
6	lgbm_depth_9	0.224344	4.514000
0	ya_depth_8	0.222469	19.968033
1	ya_depth_10	0.221058	34.812993
2	ya_depth_12	0.213479	69.627002
3	ya_depth_14	0.203703	187.957524

	name	auc	time
6	lgbm_trees_160	0.226953	7.666000
7	lgbm_trees_208	0.226483	10.101635
8	lgbm_trees_270	0.225873	8.554997
5	ya_trees_297	0.223792	58.516913
4	ya_trees_228	0.223316	42.901617
3	ya_trees_175	0.223121	32.979997
2	ya_trees_135	0.223056	25.090998
1	ya_trees_104	0.222755	20.033993
0	ya_trees_80	0.220863	17.984996

	name	auc	time
18	lgbm_rate_0.06191736422399999	0.228050	9.101934
19	lgbm_rate_0.07430083706879999	0.227807	6.801994
17	lgbm_rate_0.05159780351999999	0.227005	7.220998
16	lgbm_rate_0.0429981696	0.225853	7.941535
20	lgbm_rate_0.08916100448255998	0.225570	6.334002
15	lgbm_rate_0.035831808	0.225211	7.306051
0	ya_rate_0.1	0.223777	59.692731
3	ya_rate_0.13310000000000002	0.223437	65.876862
14	lgbm_rate_0.02985984	0.223329	7.364002
1	ya_rate_0.11000000000000001	0.222403	56.800084
13	lgbm_rate_0.0248832	0.222137	9.043053
2	ya_rate_0.12100000000000002	0.220917	66.717576
12	lgbm_rate_0.020736	0.219990	8.970075
4	ya_rate_0.14641000000000004	0.218410	62.318781
11	lgbm_rate_0.01728	0.218278	20.330832
10	lgbm_rate_0.0144	0.216346	17.754997
6	ya_rate_0.17715610000000007	0.215640	59.077204
5	ya_rate_0.16105100000000006	0.215003	55.292202
9	lgbm_rate_0.012	0.214070	8.715777
7	ya_rate_0.1948717100000001	0.212816	85.972667
8	lgbm_rate_0.01	0.211875	10.011747

	cat_OHE	lgbm_OHE	cat_SC	lgbm_SC
0	0.068491	0.048327	0.057223	0.055038
1	0.018775	0.026063	0.016626	0.028834
2	0.041735	0.042472	0.042970	0.034474
3	0.024000	0.027915	0.016231	0.026459
4	0.039678	0.040463	0.030689	0.070291

	cat_OHE	lgbm_OHE	cat_SC	lgbm_SC
0	0.168339	0.229026	0.146101	0.199716
1	0.158058	0.167274	0.285554	0.146087
2	0.056792	0.051607	0.074889	0.047308
3	0.152083	0.163874	0.198038	0.183191
4	0.080180	0.077358	0.059342	0.053366

	name	auc
3	lgbm_SC	0.230242
2	cat_SC	0.230041
0	cat_OHE	0.228619
1	lgbm_OHE	0.227083
4	blend	0.224872