Введение в анализ данных, ИАД-1

НИУ ВШЭ, 2018-19 учебный год

Домашнее задание №4. Градиентный бустинг своими руками

Задание выполнил(а): Подчезерцев Алексей

Общая информация

Дата выдачи: 27.04.2019

Дедлайн: 23:59 12.05.2019

Оценивание и штрафы

За сдачу задания позже срока на итоговую оценку за задание накладывается штраф в размере 1 балл в день, но получить отрицательную оценку нельзя.

Внимание! Домашнее задание выполняется самостоятельно. «Похожие» решения считаются плагиатом и все задействованные студенты (в том числе те, у кого списали) не могут получить за него больше 0 баллов.

Формат сдачи

Стирать условия нельзя!

Загрузка файлов с решениями происходит в системе Anytask.

Формат названия файла: homework_04_Подчезерцев_Алексей.ipynb

Задание 1. (0.5 балла)

Мы будем использовать данные соревнования Home Credit Default Risk.

  • Загрузите таблицу application_train.csv;
  • Запишите в Y столбец с целевой переменной;
  • Удалите ненужные столбцы (для этого воспользуйтесь описанием);
  • Определите тип столбцов и заполните пропуски - стратегия произвольная;
  • Разбейте выборку в соотношении 70:30 с random_state=0.

Так как в данных значительный дисбаланс классов, в качестве метрики качества везде будем использовать площадь под precision-recall кривой.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
In [2]:
%%time
X_orig = pd.read_csv('application_train.csv', index_col=0)
Wall time: 16.7 s
In [3]:
%%time
X = X_orig.copy()
Wall time: 171 ms

Приведем в числовой вид бинарные признаки

In [4]:
%%time
X["CODE_GENDER"] = X.apply(lambda x: 1 if x["CODE_GENDER"] == 'M' else 0, axis=1)
X["FLAG_OWN_CAR"] = X.apply(lambda x: 1 if x["FLAG_OWN_CAR"] == 'Y' else 0, axis=1)
X["FLAG_OWN_REALTY"] = X.apply(lambda x: 1 if x["FLAG_OWN_REALTY"] == 'Y' else 0, axis=1)
X["NAME_CONTRACT_TYPE"] = X.apply(lambda x: 1 if x["NAME_CONTRACT_TYPE"] == 'Revolving loans' else 0, axis=1)
#X = X.drop("NAME_TYPE_SUITE", axis=1)
Wall time: 23.2 s
In [9]:
def plot_corr(D, size):
    corr = D.corr()
    corr = np.abs(corr)
    f, ax = plt.subplots(figsize=(size, size))
    cmap = plt.cm.Oranges
    sns.heatmap(corr, cmap=cmap,
            xticklabels=corr.columns,
            yticklabels=corr.columns)
In [157]:
%%time
plot_corr(X, 60)
Wall time: 20.2 s

Метрики дома сильно коррелируют друг с другом, избавимся от них

In [5]:
%%time
#
X.drop("APARTMENTS_AVG", axis=1, inplace=True)
X.drop("BASEMENTAREA_AVG", axis=1, inplace=True)
# X.drop("YEARS_BEGINEXPLUATATION_AVG", axis=1, inplace=True)
X.drop("YEARS_BUILD_AVG", axis=1, inplace=True)
X.drop("COMMONAREA_AVG", axis=1, inplace=True)
X.drop("ELEVATORS_AVG", axis=1, inplace=True)
X.drop("ENTRANCES_AVG", axis=1, inplace=True)
X.drop("FLOORSMAX_AVG", axis=1, inplace=True)
X.drop("FLOORSMIN_AVG", axis=1, inplace=True)
X.drop("LANDAREA_AVG", axis=1, inplace=True)
X.drop("LIVINGAPARTMENTS_AVG", axis=1, inplace=True)
# X.drop("LIVINGAREA_AVG", axis=1, inplace=True)
X.drop("NONLIVINGAPARTMENTS_AVG", axis=1, inplace=True)
X.drop("NONLIVINGAREA_AVG", axis=1, inplace=True)
X.drop("APARTMENTS_MODE", axis=1, inplace=True)
X.drop("BASEMENTAREA_MODE", axis=1, inplace=True)
X.drop("YEARS_BEGINEXPLUATATION_MODE", axis=1, inplace=True)
X.drop("YEARS_BUILD_MODE", axis=1, inplace=True)
X.drop("COMMONAREA_MODE", axis=1, inplace=True)
X.drop("ELEVATORS_MODE", axis=1, inplace=True)
X.drop("ENTRANCES_MODE", axis=1, inplace=True)
X.drop("FLOORSMAX_MODE", axis=1, inplace=True)
X.drop("FLOORSMIN_MODE", axis=1, inplace=True)
X.drop("LANDAREA_MODE", axis=1, inplace=True)
X.drop("LIVINGAPARTMENTS_MODE", axis=1, inplace=True)
X.drop("LIVINGAREA_MODE", axis=1, inplace=True)
X.drop("NONLIVINGAPARTMENTS_MODE", axis=1, inplace=True)
X.drop("NONLIVINGAREA_MODE", axis=1, inplace=True)
X.drop("APARTMENTS_MEDI", axis=1, inplace=True)
X.drop("BASEMENTAREA_MEDI", axis=1, inplace=True)
X.drop("YEARS_BEGINEXPLUATATION_MEDI", axis=1, inplace=True)
X.drop("YEARS_BUILD_MEDI", axis=1, inplace=True)
X.drop("COMMONAREA_MEDI", axis=1, inplace=True)
X.drop("ELEVATORS_MEDI", axis=1, inplace=True)
X.drop("ENTRANCES_MEDI", axis=1, inplace=True)
X.drop("FLOORSMAX_MEDI", axis=1, inplace=True)
X.drop("FLOORSMIN_MEDI", axis=1, inplace=True)
X.drop("LANDAREA_MEDI", axis=1, inplace=True)
X.drop("LIVINGAPARTMENTS_MEDI", axis=1, inplace=True)
X.drop("LIVINGAREA_MEDI", axis=1, inplace=True)
X.drop("NONLIVINGAPARTMENTS_MEDI", axis=1, inplace=True)
X.drop("NONLIVINGAREA_MEDI", axis=1, inplace=True)
X.drop("FONDKAPREMONT_MODE", axis=1, inplace=True)
X.drop("HOUSETYPE_MODE", axis=1, inplace=True)
X.drop("TOTALAREA_MODE", axis=1, inplace=True)
X.drop("WALLSMATERIAL_MODE", axis=1, inplace=True)
X.drop("EMERGENCYSTATE_MODE", axis=1, inplace=True)
Wall time: 8.33 s

Дропнем некоторые хорошо взаимокоррелируемые значения

In [6]:
X.drop("AMT_ANNUITY", axis=1, inplace=True)
X.drop("AMT_GOODS_PRICE", axis=1, inplace=True)
X.drop("DAYS_EMPLOYED", axis=1, inplace=True)
X.drop("REGION_RATING_CLIENT_W_CITY", axis=1, inplace=True)
X.drop("LIVE_CITY_NOT_WORK_CITY", axis=1, inplace=True)
X.drop("LIVE_REGION_NOT_WORK_REGION", axis=1, inplace=True)
X.drop("OBS_60_CNT_SOCIAL_CIRCLE", axis=1, inplace=True)
X.drop("DEF_60_CNT_SOCIAL_CIRCLE", axis=1, inplace=True)
In [163]:
%%time
plot_corr(X, 40)
Wall time: 4.43 s
In [164]:
X.head()
Out[164]:
TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT NAME_TYPE_SUITE NAME_INCOME_TYPE ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR
100002 1 0 1 0 1 0 202500.0 406597.5 Unaccompanied Working ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
100003 0 0 0 0 0 0 270000.0 1293502.5 Family State servant ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
100004 0 1 1 1 1 0 67500.0 135000.0 Unaccompanied Working ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
100006 0 0 0 0 1 0 135000.0 312682.5 Unaccompanied Working ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
100007 0 0 1 0 1 0 121500.0 513000.0 Unaccompanied Working ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 68 columns

In [165]:
_x_null=X.isnull().sum(axis=0)
_x_null = _x_null[_x_null != 0]
_x_null.shape
Out[165]:
(18,)
In [166]:
_x_null[_x_null != 0]
Out[166]:
NAME_TYPE_SUITE                  1292
OWN_CAR_AGE                    202929
OCCUPATION_TYPE                 96391
CNT_FAM_MEMBERS                     2
EXT_SOURCE_1                   173378
EXT_SOURCE_2                      660
EXT_SOURCE_3                    60965
YEARS_BEGINEXPLUATATION_AVG    150007
LIVINGAREA_AVG                 154350
OBS_30_CNT_SOCIAL_CIRCLE         1021
DEF_30_CNT_SOCIAL_CIRCLE         1021
DAYS_LAST_PHONE_CHANGE              1
AMT_REQ_CREDIT_BUREAU_HOUR      41519
AMT_REQ_CREDIT_BUREAU_DAY       41519
AMT_REQ_CREDIT_BUREAU_WEEK      41519
AMT_REQ_CREDIT_BUREAU_MON       41519
AMT_REQ_CREDIT_BUREAU_QRT       41519
AMT_REQ_CREDIT_BUREAU_YEAR      41519
dtype: int64
In [7]:
X["OWN_CAR_AGE"].fillna(0, inplace=True)
X["OCCUPATION_TYPE"].fillna('', inplace=True)
X["EXT_SOURCE_1"].fillna(0.5, inplace=True)
X["EXT_SOURCE_2"].fillna(0.5, inplace=True)
X["EXT_SOURCE_3"].fillna(0.5, inplace=True)
X["YEARS_BEGINEXPLUATATION_AVG"].fillna(0.5, inplace=True)
X["LIVINGAREA_AVG"].fillna(0.5, inplace=True)

X["AMT_REQ_CREDIT_BUREAU_HOUR"].fillna(0, inplace=True)
X["AMT_REQ_CREDIT_BUREAU_DAY"].fillna(0, inplace=True)
X["AMT_REQ_CREDIT_BUREAU_WEEK"].fillna(0, inplace=True)
X["AMT_REQ_CREDIT_BUREAU_MON"].fillna(0, inplace=True)
X["AMT_REQ_CREDIT_BUREAU_QRT"].fillna(0, inplace=True)
X["AMT_REQ_CREDIT_BUREAU_YEAR"].fillna(0, inplace=True)
X.head()
Out[7]:
TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT NAME_TYPE_SUITE NAME_INCOME_TYPE ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR
100002 1 0 1 0 1 0 202500.0 406597.5 Unaccompanied Working ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
100003 0 0 0 0 0 0 270000.0 1293502.5 Family State servant ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
100004 0 1 1 1 1 0 67500.0 135000.0 Unaccompanied Working ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
100006 0 0 0 0 1 0 135000.0 312682.5 Unaccompanied Working ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
100007 0 0 1 0 1 0 121500.0 513000.0 Unaccompanied Working ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 68 columns

In [253]:
_x_null=X.isnull().sum(axis=0)
_x_null = _x_null[_x_null != 0]
_x_null.shape
Out[253]:
(5,)
In [254]:
_x_null[_x_null != 0]
Out[254]:
NAME_TYPE_SUITE             1292
CNT_FAM_MEMBERS                2
OBS_30_CNT_SOCIAL_CIRCLE    1021
DEF_30_CNT_SOCIAL_CIRCLE    1021
DAYS_LAST_PHONE_CHANGE         1
dtype: int64
In [8]:
X.dropna(axis = 0, inplace=True)
In [9]:
X.shape
Out[9]:
(305197, 68)
In [10]:
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 305197 entries, 100002 to 456255
Data columns (total 68 columns):
TARGET                         305197 non-null int64
NAME_CONTRACT_TYPE             305197 non-null int64
CODE_GENDER                    305197 non-null int64
FLAG_OWN_CAR                   305197 non-null int64
FLAG_OWN_REALTY                305197 non-null int64
CNT_CHILDREN                   305197 non-null int64
AMT_INCOME_TOTAL               305197 non-null float64
AMT_CREDIT                     305197 non-null float64
NAME_TYPE_SUITE                305197 non-null object
NAME_INCOME_TYPE               305197 non-null object
NAME_EDUCATION_TYPE            305197 non-null object
NAME_FAMILY_STATUS             305197 non-null object
NAME_HOUSING_TYPE              305197 non-null object
REGION_POPULATION_RELATIVE     305197 non-null float64
DAYS_BIRTH                     305197 non-null int64
DAYS_REGISTRATION              305197 non-null float64
DAYS_ID_PUBLISH                305197 non-null int64
OWN_CAR_AGE                    305197 non-null float64
FLAG_MOBIL                     305197 non-null int64
FLAG_EMP_PHONE                 305197 non-null int64
FLAG_WORK_PHONE                305197 non-null int64
FLAG_CONT_MOBILE               305197 non-null int64
FLAG_PHONE                     305197 non-null int64
FLAG_EMAIL                     305197 non-null int64
OCCUPATION_TYPE                305197 non-null object
CNT_FAM_MEMBERS                305197 non-null float64
REGION_RATING_CLIENT           305197 non-null int64
WEEKDAY_APPR_PROCESS_START     305197 non-null object
HOUR_APPR_PROCESS_START        305197 non-null int64
REG_REGION_NOT_LIVE_REGION     305197 non-null int64
REG_REGION_NOT_WORK_REGION     305197 non-null int64
REG_CITY_NOT_LIVE_CITY         305197 non-null int64
REG_CITY_NOT_WORK_CITY         305197 non-null int64
ORGANIZATION_TYPE              305197 non-null object
EXT_SOURCE_1                   305197 non-null float64
EXT_SOURCE_2                   305197 non-null float64
EXT_SOURCE_3                   305197 non-null float64
YEARS_BEGINEXPLUATATION_AVG    305197 non-null float64
LIVINGAREA_AVG                 305197 non-null float64
OBS_30_CNT_SOCIAL_CIRCLE       305197 non-null float64
DEF_30_CNT_SOCIAL_CIRCLE       305197 non-null float64
DAYS_LAST_PHONE_CHANGE         305197 non-null float64
FLAG_DOCUMENT_2                305197 non-null int64
FLAG_DOCUMENT_3                305197 non-null int64
FLAG_DOCUMENT_4                305197 non-null int64
FLAG_DOCUMENT_5                305197 non-null int64
FLAG_DOCUMENT_6                305197 non-null int64
FLAG_DOCUMENT_7                305197 non-null int64
FLAG_DOCUMENT_8                305197 non-null int64
FLAG_DOCUMENT_9                305197 non-null int64
FLAG_DOCUMENT_10               305197 non-null int64
FLAG_DOCUMENT_11               305197 non-null int64
FLAG_DOCUMENT_12               305197 non-null int64
FLAG_DOCUMENT_13               305197 non-null int64
FLAG_DOCUMENT_14               305197 non-null int64
FLAG_DOCUMENT_15               305197 non-null int64
FLAG_DOCUMENT_16               305197 non-null int64
FLAG_DOCUMENT_17               305197 non-null int64
FLAG_DOCUMENT_18               305197 non-null int64
FLAG_DOCUMENT_19               305197 non-null int64
FLAG_DOCUMENT_20               305197 non-null int64
FLAG_DOCUMENT_21               305197 non-null int64
AMT_REQ_CREDIT_BUREAU_HOUR     305197 non-null float64
AMT_REQ_CREDIT_BUREAU_DAY      305197 non-null float64
AMT_REQ_CREDIT_BUREAU_WEEK     305197 non-null float64
AMT_REQ_CREDIT_BUREAU_MON      305197 non-null float64
AMT_REQ_CREDIT_BUREAU_QRT      305197 non-null float64
AMT_REQ_CREDIT_BUREAU_YEAR     305197 non-null float64
dtypes: float64(20), int64(40), object(8)
memory usage: 160.7+ MB
In [11]:
X = X.reset_index(drop=True)
In [12]:
y = X["TARGET"]
X = X.drop("TARGET", axis=1)
In [13]:
X.head()
Out[13]:
NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 0 1 0 1 0 202500.0 406597.5 Unaccompanied Working Secondary / secondary special ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 0 0 0 0 0 270000.0 1293502.5 Family State servant Higher education ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 1 1 1 1 0 67500.0 135000.0 Unaccompanied Working Secondary / secondary special ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 0 0 0 1 0 135000.0 312682.5 Unaccompanied Working Secondary / secondary special ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
4 0 1 0 1 0 121500.0 513000.0 Unaccompanied Working Secondary / secondary special ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 67 columns

In [14]:
from sklearn.model_selection import train_test_split
In [94]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Задание 2. (1.5 балла)

Обучите реализации градиентного бустинга LightGBM и Catboost на вещественных признаках без подбора параметров. Почему получилась заметная разница в качестве?

В этом и последующих экспериментах необходимо измерять время обучения моделей.

In [16]:
from catboost import CatBoostClassifier 
import lightgbm as lgb
from sklearn.metrics import precision_recall_curve, auc
In [17]:
def pr_auc(y_true, y_scores):
    precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
    return auc(recall, precision)
In [581]:
X_train_2 = X_train.select_dtypes(include=["int64", "float64"])
X_test_2 = X_test.select_dtypes(include=["int64", "float64"])
In [571]:
%%time
yandex_clf = CatBoostClassifier(logging_level='Silent')
yandex_clf.fit(X_train_2, y_train)
Wall time: 2min 35s
In [572]:
y_predict_2 = yandex_clf.predict_proba(X_test_2)[:,1]
print("PR-AUC for default yandex boost", pr_auc(y_test, y_predict_2))
PR-AUC for default yandex boost 0.2219287676589921
In [582]:
%%time
l_clf = lgb.LGBMClassifier()
l_clf.fit(X_train_2, y_train)
Wall time: 5.04 s
In [583]:
y_predict_2 = l_clf.predict_proba(X_test_2)[:,1]
print("PR-AUC for default lgb", pr_auc(y_test, y_predict_2))
PR-AUC for default lgb 0.22241316543186374

Алгоритмы catboost и lightgbm имеют разные дефолтные параметры, поэтому отличаются метрики качества и время работы.

Задание 3. (2 балла)

Подберите с CV=3 оптимальные параметры алгоритмов, изменяя:

  • глубину деревьев;
  • количество деревьев;
  • темп обучения;
  • оптимизируемый функционал.

Проанализируйте соотношения глубины и количества деревьев в зависимости от алгоритма.

In [18]:
from sklearn.model_selection import KFold
from time import time
CV = 3
kf = KFold(n_splits=CV,shuffle=True, random_state=0)
In [25]:
X_3 = X.select_dtypes(include=["int64", "float64"])
In [23]:
def calc_CV_classifier(kf, x, y, classifier):
    _auc = []
    _times = []
    for train_index, test_index in kf.split(x):
        begin = time()
        classifier.fit(x.values[train_index], y[train_index])
        _times.append(time() - begin)
        y_predict = classifier.predict_proba(x.values[test_index])[:,1]
        _auc.append(pr_auc(y[test_index], y_predict))
    d = {}
    d["auc_mean"] = [sum(_auc)/len(_auc)]
    d["time_mean"] = [sum(_times)/len(_times)] 
    for i in range(len(_auc)):
        d["auc_" + str(i + 1)] = [_auc[i]]
    for i in range(len(_times)):   
        d["time_" + str(i + 1)] = [_times[i] ]
    return pd.DataFrame(data=d)
In [519]:
%%time
max_trees = 100
learning_rate = 0.1
result_depth = pd.DataFrame()
for depth in range(1, 10, 2):
    ya_clf = CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate)
    ya_result = calc_CV_classifier(kf, X_3, y, ya_clf)
    ya_result.insert(0, "name", ["ya_depth_" + str(depth)])
    result_depth = result_depth.append(ya_result, ignore_index=True)
  
    l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate)
    l_result = calc_CV_classifier(kf, X_3, y, l_clf)
    l_result.insert(0, "name", ["lgbm_depth_" + str(depth)])
    result_depth = result_depth.append(l_result, ignore_index=True)   

result_depth.set_index('name')
Wall time: 8min 24s
In [520]:
result_depth.sort_values("auc_mean")[::-1]
Out[520]:
name auc_mean time_mean auc_1 auc_2 auc_3 time_1 time_2 time_3
5 lgbm_depth_5 0.230291 5.593335 0.223429 0.239176 0.228270 8.566999 4.053004 4.160001
7 lgbm_depth_7 0.230158 7.535334 0.223212 0.238963 0.228299 7.185001 5.077002 10.343999
9 lgbm_depth_9 0.229801 6.132000 0.222157 0.238988 0.228258 4.259003 3.853998 10.283000
6 ya_depth_7 0.226959 29.459334 0.220238 0.235859 0.224781 28.780998 32.040005 27.556998
3 lgbm_depth_3 0.225972 3.838332 0.219705 0.233778 0.224432 5.414995 3.076999 3.023002
8 ya_depth_9 0.225970 39.175508 0.218561 0.234640 0.224711 33.615525 42.603997 41.307001
4 ya_depth_5 0.225545 23.423001 0.218210 0.233063 0.225363 26.113002 23.535000 20.621000
2 ya_depth_3 0.221001 22.313004 0.214489 0.227418 0.221094 23.228005 25.003004 18.708003
1 lgbm_depth_1 0.209676 2.289332 0.203954 0.214938 0.210136 2.458002 2.238996 2.170998
0 ya_depth_1 0.208214 14.256666 0.202143 0.213642 0.208857 11.011999 11.923000 19.834999
In [19]:
def log_int_iterator(start, end, step):
    i = start
    while i <= end:
        yield int(i)
        i *= step
In [522]:
%%time
depth = 7
learning_rate = 0.1
result_trees = pd.DataFrame()
for max_trees in log_int_iterator(80, 160, 1.3):
    ya_clf = CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate)
    ya_result = calc_CV_classifier(kf, X_3, y, ya_clf)
    ya_result.insert(0, "name", ["ya_trees_" + str(max_trees)])
    result_trees = result_trees.append(ya_result, ignore_index=True)
for max_trees in log_int_iterator(160, 320, 1.3):  
    l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate)
    l_result = calc_CV_classifier(kf, X_3, y, l_clf)
    l_result.insert(0, "name", ["lgbm_trees_" + str(max_trees)])
    result_trees = result_trees.append(l_result, ignore_index=True)   

result_trees.set_index('name')
Wall time: 6min
In [523]:
result_trees.sort_values("auc_mean")[::-1]
Out[523]:
name auc_mean time_mean auc_1 auc_2 auc_3 time_1 time_2 time_3
3 lgbm_trees_160 0.229816 9.093333 0.223193 0.238417 0.227838 8.981001 7.939996 10.359002
4 lgbm_trees_208 0.228892 14.178001 0.222513 0.237266 0.226896 17.672997 7.059005 17.802001
2 ya_trees_135 0.228842 28.218336 0.221562 0.238308 0.226655 28.609002 28.146004 27.900002
5 lgbm_trees_270 0.227802 12.322666 0.221931 0.235679 0.225797 8.879999 19.197000 8.890999
1 ya_trees_104 0.227117 24.395667 0.220216 0.236105 0.225030 25.467998 23.500001 24.219000
0 ya_trees_80 0.225866 21.336667 0.219667 0.234109 0.223822 18.848001 19.586998 25.575003
In [20]:
def log_iterator(start, end, step):
    i = start
    while i <= end:
        yield i
        i *= step
In [26]:
%%time
depth = 7
max_trees = 100
result_rate = pd.DataFrame()
for learning_rate in log_iterator(0.1, 0.2, 1.1):
    ya_clf = CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate)
    ya_result = calc_CV_classifier(kf, X_3, y, ya_clf)
    ya_result.insert(0, "name", ["ya_rate_" + str(learning_rate)])
    result_rate = result_rate.append(ya_result, ignore_index=True)
for learning_rate in log_iterator(0.01, 0.1, 1.2):  
    l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate)
    l_result = calc_CV_classifier(kf, X_3, y, l_clf)
    l_result.insert(0, "name", ["lgbm_rate_" + str(learning_rate)])
    result_rate = result_rate.append(l_result, ignore_index=True)   

result_rate.set_index('name')
Wall time: 11min 1s
In [27]:
result_rate.sort_values("auc_mean")[::-1]
Out[27]:
name auc_mean time_mean auc_1 auc_2 auc_3 time_1 time_2 time_3
19 lgbm_rate_0.07430083706879999 0.230610 6.343334 0.223719 0.238573 0.229538 8.356007 5.977113 4.696882
18 lgbm_rate_0.06191736422399999 0.229380 3.762666 0.222366 0.236994 0.228779 3.863997 3.548001 3.876001
20 lgbm_rate_0.08916100448255998 0.229258 4.138690 0.222003 0.237898 0.227873 5.296533 3.645535 3.474001
4 ya_rate_0.14641000000000004 0.228098 18.970668 0.221503 0.236878 0.225914 16.927000 22.295013 17.689990
17 lgbm_rate_0.05159780351999999 0.227895 4.158113 0.221117 0.235593 0.226975 4.400130 3.935571 4.138638
2 ya_rate_0.12100000000000002 0.227721 18.315660 0.221285 0.234925 0.226953 18.229000 19.092980 17.624999
3 ya_rate_0.13310000000000002 0.227720 16.135347 0.221592 0.235467 0.226100 16.002001 15.514999 16.889040
1 ya_rate_0.11000000000000001 0.227653 17.608335 0.220584 0.236274 0.226101 17.459003 17.710998 17.655004
6 ya_rate_0.17715610000000007 0.227413 15.298333 0.220997 0.234458 0.226783 15.132999 15.297000 15.465001
16 lgbm_rate_0.0429981696 0.227347 3.868698 0.220067 0.234728 0.227248 4.092998 3.639997 3.873099
5 ya_rate_0.16105100000000006 0.227260 15.276347 0.220639 0.235530 0.225611 15.016040 15.704001 15.109000
0 ya_rate_0.1 0.226959 17.759002 0.220238 0.235859 0.224781 19.750005 16.958997 16.568003
7 ya_rate_0.1948717100000001 0.225389 16.137670 0.216041 0.235298 0.224828 15.439003 15.184999 17.789006
15 lgbm_rate_0.035831808 0.224815 4.458728 0.217962 0.231457 0.225026 3.899990 4.599999 4.876196
14 lgbm_rate_0.02985984 0.222554 7.380854 0.216032 0.228817 0.222811 5.888035 7.944530 8.309998
13 lgbm_rate_0.0248832 0.220647 4.616668 0.214776 0.226311 0.220854 5.398001 4.208001 4.244002
12 lgbm_rate_0.020736 0.218367 4.615000 0.212683 0.223832 0.218585 4.768999 4.459004 4.616998
11 lgbm_rate_0.01728 0.216139 5.270002 0.210808 0.220951 0.216658 4.735011 5.180998 5.893996
10 lgbm_rate_0.0144 0.213921 5.904333 0.209288 0.218457 0.214017 7.004999 5.497999 5.210001
9 lgbm_rate_0.012 0.211729 5.848511 0.206909 0.216211 0.212066 4.890003 7.043994 5.611537
8 lgbm_rate_0.01 0.209891 5.171331 0.205540 0.214099 0.210033 4.892997 5.100001 5.520996
In [560]:
%%time
depth = 7
max_trees = 100
learning_rate=0.1
result_obj = pd.DataFrame()
for loss_function in ["CrossEntropy", "Logloss", None]:
    ya_clf = CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate, loss_function=loss_function)
    ya_result = calc_CV_classifier(kf, X_3, y, ya_clf)
    ya_result.insert(0, "name", ["ya_loss_function_" + str(loss_function)])
    result_obj = result_obj.append(ya_result, ignore_index=True)
Wall time: 2min 31s
In [561]:
%%time
result_obj_2 = result_obj.copy()

l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate)
l_result = calc_CV_classifier(kf, X_3, y, l_clf)
l_result.insert(0, "name", ["lgbm_objective_None"])
result_obj_2 = result_obj_2.append(l_result, ignore_index=True)   

l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate, is_unbalance=True)
l_result = calc_CV_classifier(kf, X_3, y, l_clf)
l_result.insert(0, "name", ["lgbm_objective_is_unbalance=True"])
result_obj_2 = result_obj_2.append(l_result, ignore_index=True)   

result_obj_2.set_index('name')
Wall time: 32.7 s
In [565]:
%%time
result_obj_3 = result_obj_2.copy()
for metric in ["binary_logloss", "binary_error", "cross_entropy"]:  
    l_clf = lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate, is_unbalance=True, metric=metric)
    l_result = calc_CV_classifier(kf, X_3, y, l_clf)
    l_result.insert(0, "name", ["lgbm_objective_" + str(metric) + "_is_unbalance=True"])
    result_obj_3 = result_obj_3.append(l_result, ignore_index=True)   
Wall time: 35.3 s
In [566]:
result_obj_3.sort_values("auc_mean")[::-1]
Out[566]:
name auc_mean time_mean auc_1 auc_2 auc_3 time_1 time_2 time_3
7 lgbm_objective_cross_entropy_is_unbalance=True 0.231029 3.276335 0.225010 0.239833 0.228245 3.272002 3.303002 3.254000
6 lgbm_objective_binary_error_is_unbalance=True 0.231029 3.336332 0.225010 0.239833 0.228245 3.349998 3.188997 3.470000
5 lgbm_objective_binary_logloss_is_unbalance=True 0.231029 3.257665 0.225010 0.239833 0.228245 3.384996 3.169000 3.218999
4 lgbm_objective_is_unbalance=True 0.231029 5.420999 0.225010 0.239833 0.228245 6.117998 5.380999 4.764000
3 lgbm_objective_None 0.230158 3.901331 0.223212 0.238963 0.228299 3.951995 3.689000 4.062997
1 ya_loss_function_Logloss 0.226959 17.127992 0.220238 0.235859 0.224781 17.185999 16.607998 17.589980
0 ya_loss_function_CrossEntropy 0.226959 17.556313 0.220238 0.235859 0.224781 16.852995 17.666000 18.149945
2 ya_loss_function_None 0.225346 11.816336 0.218689 0.232687 0.224663 11.424005 11.717007 12.307996

Чем больше количество деревьев и глубина построения, тем дольше работает алгоритм. До некоторого числа растет качество моделей, затем оно начинает падать

Задание 4. (3.5 балла)

Добавьте категориальные признаки к вещественным следующими способами:

  • как OHE признаки;
  • как счетчики со сглаживанием.

При подсчете счетчиков запрещается использование циклов.

На получившихся датасетах подберите параметры у каждого из алгоритмов. Как меняется время, необходимое для обучения модели в зависимости от способа кодирования? Сравните полученные результаты с встроенными методами обработки категориальных признаков.

In [21]:
 
In [54]:
def calc_classifier(x_train, y_train, x_test, y_test, classifier):
    begin = time()
    classifier.fit(x_train, y_train)
    _times = time() - begin
    y_predict = classifier.predict_proba(x_test)[:,1]
    _auc = pr_auc(y_test, y_predict)
    d = {}
    d["auc"] = [_auc]
    d["time"] = [_times]
    return pd.DataFrame(data=d)
In [182]:
X_train_OHE = X_train.copy()
X_test_OHE = X_test.copy()
y_train_OHE = y_train.copy()
y_test_OHE = y_test.copy()
In [183]:
%%time
for i in X_train_OHE:
    if X_train_OHE[i].dtype == 'object':
        X_train_OHE = pd.concat([X_train_OHE, pd.get_dummies(X_train_OHE[i], prefix = i)], axis=1)       
        X_test_OHE = pd.concat([X_test_OHE, pd.get_dummies(X_test_OHE[i], prefix = i)], axis=1) 
X_train_OHE.head()
Wall time: 3.21 s
In [184]:
X_train_OHE = X_train_OHE.select_dtypes(include=["int64", "float64", "uint8"])
X_test_OHE = X_test_OHE.select_dtypes(include=["int64", "float64", "uint8"])
In [67]:
%%time
max_trees = 100
learning_rate = 0.1
result_depth_OHE = pd.DataFrame()
for depth in range(5, 10, 2):
    ya_result = calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE, CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_depth_" + str(depth)])
    result_depth_OHE = result_depth_OHE.append(ya_result, ignore_index=True)
for depth in range(11, 20, 2):  
    l_result = calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE,  lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_depth_" + str(depth)])
    result_depth_OHE = result_depth_OHE.append(l_result, ignore_index=True)   

result_depth_OHE.set_index('name')
Wall time: 1min 47s
In [68]:
result_depth_OHE.sort_values("auc")[::-1]
Out[68]:
name auc time
6 lgbm_depth_17 0.226655 4.973000
7 lgbm_depth_19 0.226366 5.048001
5 lgbm_depth_15 0.225366 5.063996
4 lgbm_depth_13 0.225254 5.848997
3 lgbm_depth_11 0.224522 5.578999
1 ya_depth_7 0.222873 21.397999
2 ya_depth_9 0.222511 27.170990
0 ya_depth_5 0.221202 21.255000
In [100]:
%%time
learning_rate = 0.1
result_trees_OHE = pd.DataFrame()
for max_trees in log_int_iterator(80, 160, 1.3):
    ya_result = calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE,  
                                CatBoostClassifier(logging_level='Silent', depth=7, iterations=max_trees, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_trees_" + str(max_trees)])
    result_trees_OHE = result_trees_OHE.append(ya_result, ignore_index=True)
for max_trees in log_int_iterator(160, 320, 1.3):  
    l_result = calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE,  
                               lgb.LGBMClassifier(max_depth=17, n_estimators=max_trees, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_trees_" + str(max_trees)])
    result_trees_OHE = result_trees_OHE.append(l_result, ignore_index=True)   

result_trees_OHE.set_index('name')
Wall time: 1min 53s
In [101]:
result_trees_OHE.sort_values("auc")[::-1]
Out[101]:
name auc time
3 lgbm_trees_160 0.226276 6.691996
4 lgbm_trees_208 0.225808 8.567997
5 lgbm_trees_270 0.225065 10.282003
2 ya_trees_135 0.224759 28.804001
1 ya_trees_104 0.223038 25.267998
0 ya_trees_80 0.221178 22.034606
In [104]:
%%time
result_rate_OHE = pd.DataFrame()
for learning_rate in log_iterator(0.1, 0.2, 1.1):
    ya_result =calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE,  
                               CatBoostClassifier(logging_level='Silent', depth=7, iterations=150, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_rate_" + str(learning_rate)])
    result_rate_OHE = result_rate_OHE.append(ya_result, ignore_index=True)
for learning_rate in log_iterator(0.01, 0.1, 1.2):  
    l_result = calc_classifier(X_train_OHE, y_train_OHE, X_test_OHE, y_test_OHE, 
                               lgb.LGBMClassifier(max_depth=17, n_estimators=150, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_rate_" + str(learning_rate)])
    result_rate_OHE = result_rate_OHE.append(l_result, ignore_index=True)   

result_rate_OHE.set_index('name')
Wall time: 9min 4s
In [105]:
result_rate_OHE.sort_values("auc")[::-1]
Out[105]:
name auc time
19 lgbm_rate_0.07430083706879999 0.228120 11.166834
18 lgbm_rate_0.06191736422399999 0.228070 62.689053
17 lgbm_rate_0.05159780351999999 0.226134 20.864165
16 lgbm_rate_0.0429981696 0.225639 18.839656
2 ya_rate_0.12100000000000002 0.225587 31.980033
20 lgbm_rate_0.08916100448255998 0.225213 11.447431
15 lgbm_rate_0.035831808 0.224927 13.206527
0 ya_rate_0.1 0.224669 31.208994
3 ya_rate_0.13310000000000002 0.224383 33.362998
1 ya_rate_0.11000000000000001 0.223548 31.609998
4 ya_rate_0.14641000000000004 0.223369 33.700764
5 ya_rate_0.16105100000000006 0.223367 35.113729
7 ya_rate_0.1948717100000001 0.222927 38.494523
6 ya_rate_0.17715610000000007 0.222734 34.553633
14 lgbm_rate_0.02985984 0.222442 13.770530
13 lgbm_rate_0.0248832 0.221124 11.955902
12 lgbm_rate_0.020736 0.219033 14.907905
11 lgbm_rate_0.01728 0.216863 15.059998
10 lgbm_rate_0.0144 0.214668 14.947731
9 lgbm_rate_0.012 0.212405 11.160001
8 lgbm_rate_0.01 0.210092 12.293635
In [186]:
X_train_SC = X_train.copy()
X_test_SC = X_test.copy()
y_train_SC = y_train.copy()
y_test_SC = y_test.copy()
In [187]:
%%time
y_mean = y_train_SC.mean()
for i in X_train_SC:
    if X_train_SC[i].dtype == 'object':
        # Fit
        uniq = pd.DataFrame(data=X_train_SC[i].unique(), columns=["unique"])
        val = uniq.apply(lambda x: 1 if y_train_SC[X_train_SC[X_train_SC[i] == x["unique"]].index].mean() > y_mean else 0, axis=1)
        uniq.insert(loc=1, column='value', value=val)
        uniq = uniq.set_index("unique")
        # Transform
        X_train_SC[i] = X_train_SC.apply(lambda x: uniq.loc[x[i],"value"], axis=1)
        X_test_SC[i] = X_test_SC.apply(lambda x: uniq.loc[x[i],"value"], axis=1)
Wall time: 1min 23s
In [188]:
X_train_SC = X_train_SC.select_dtypes(include=["int64", "float64", "uint8"])
X_test_SC = X_test_SC.select_dtypes(include=["int64", "float64", "uint8"])
In [92]:
%%time
max_trees = 100
learning_rate = 0.1
result_depth_SC = pd.DataFrame()
for depth in range(8, 16, 2):
    ya_result =calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,  
                               CatBoostClassifier(logging_level='Silent', depth=depth, iterations=max_trees, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_depth_" + str(depth)])
    result_depth_SC = result_depth_SC.append(ya_result, ignore_index=True)
for depth in range(5, 10, 2):  
    l_result = calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,   
                               lgb.LGBMClassifier(max_depth=depth, n_estimators=max_trees, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_depth_" + str(depth)])
    result_depth_SC = result_depth_SC.append(l_result, ignore_index=True)   

result_depth_SC.set_index('name')
Wall time: 5min 31s
In [93]:
result_depth_SC.sort_values("auc")[::-1]
Out[93]:
name auc time
5 lgbm_depth_7 0.226819 4.406003
4 lgbm_depth_5 0.225525 4.298995
6 lgbm_depth_9 0.224344 4.514000
0 ya_depth_8 0.222469 19.968033
1 ya_depth_10 0.221058 34.812993
2 ya_depth_12 0.213479 69.627002
3 ya_depth_14 0.203703 187.957524
In [102]:
%%time
learning_rate = 0.1
result_trees_SC = pd.DataFrame()
for max_trees in log_int_iterator(80, 300, 1.3):
    ya_result = calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,  
                                CatBoostClassifier(logging_level='Silent', depth=8, iterations=max_trees, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_trees_" + str(max_trees)])
    result_trees_SC = result_trees_SC.append(ya_result, ignore_index=True)
for max_trees in log_int_iterator(160, 320, 1.3):  
    l_result = calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,  
                               lgb.LGBMClassifier(max_depth=7, n_estimators=max_trees, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_trees_" + str(max_trees)])
    result_trees_SC = result_trees_SC.append(l_result, ignore_index=True)   

result_trees_SC.set_index('name')
Wall time: 3min 53s
In [103]:
result_trees_SC.sort_values("auc")[::-1]
Out[103]:
name auc time
6 lgbm_trees_160 0.226953 7.666000
7 lgbm_trees_208 0.226483 10.101635
8 lgbm_trees_270 0.225873 8.554997
5 ya_trees_297 0.223792 58.516913
4 ya_trees_228 0.223316 42.901617
3 ya_trees_175 0.223121 32.979997
2 ya_trees_135 0.223056 25.090998
1 ya_trees_104 0.222755 20.033993
0 ya_trees_80 0.220863 17.984996
In [106]:
%%time
result_rate_SC = pd.DataFrame()
for learning_rate in log_iterator(0.1, 0.2, 1.1):
    ya_result = calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,  
                                CatBoostClassifier(logging_level='Silent', depth=8, iterations=300, learning_rate=learning_rate))
    ya_result.insert(0, "name", ["ya_rate_" + str(learning_rate)])
    result_rate_SC = result_rate_SC.append(ya_result, ignore_index=True)
for learning_rate in log_iterator(0.01, 0.1, 1.2):  
    l_result = calc_classifier(X_train_SC, y_train_SC, X_test_SC, y_test_SC,  
                               lgb.LGBMClassifier(max_depth=7, n_estimators=160, learning_rate=learning_rate))
    l_result.insert(0, "name", ["lgbm_rate_" + str(learning_rate)])
    result_rate_SC = result_rate_SC.append(l_result, ignore_index=True)   

result_rate_SC.set_index('name')
Wall time: 11min
In [107]:
result_rate_SC.sort_values("auc")[::-1]
Out[107]:
name auc time
18 lgbm_rate_0.06191736422399999 0.228050 9.101934
19 lgbm_rate_0.07430083706879999 0.227807 6.801994
17 lgbm_rate_0.05159780351999999 0.227005 7.220998
16 lgbm_rate_0.0429981696 0.225853 7.941535
20 lgbm_rate_0.08916100448255998 0.225570 6.334002
15 lgbm_rate_0.035831808 0.225211 7.306051
0 ya_rate_0.1 0.223777 59.692731
3 ya_rate_0.13310000000000002 0.223437 65.876862
14 lgbm_rate_0.02985984 0.223329 7.364002
1 ya_rate_0.11000000000000001 0.222403 56.800084
13 lgbm_rate_0.0248832 0.222137 9.043053
2 ya_rate_0.12100000000000002 0.220917 66.717576
12 lgbm_rate_0.020736 0.219990 8.970075
4 ya_rate_0.14641000000000004 0.218410 62.318781
11 lgbm_rate_0.01728 0.218278 20.330832
10 lgbm_rate_0.0144 0.216346 17.754997
6 ya_rate_0.17715610000000007 0.215640 59.077204
5 ya_rate_0.16105100000000006 0.215003 55.292202
9 lgbm_rate_0.012 0.214070 8.715777
7 ya_rate_0.1948717100000001 0.212816 85.972667
8 lgbm_rate_0.01 0.211875 10.011747

CatBoost categorical

In [118]:
from catboost import Pool, CatBoostClassifier
In [73]:
categorical = np.array(X_train.select_dtypes(include=["object"]).columns)
categorical
Out[73]:
array(['NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE'], dtype=object)
In [119]:
train_dataset = Pool(data=X_train,
                     label=y_train,
                     cat_features=categorical)

eval_dataset = Pool(data=X_test,
                    label=y_test,
                    cat_features=categorical)
In [120]:
model = CatBoostClassifier(iterations=135,
                           learning_rate=0.146,
                           depth=7)
In [121]:
model.fit(train_dataset)
0:	learn: 0.5498153	total: 558ms	remaining: 1m 14s
1:	learn: 0.4577457	total: 932ms	remaining: 1m 1s
2:	learn: 0.3933832	total: 1.43s	remaining: 1m 2s
3:	learn: 0.3515783	total: 1.92s	remaining: 1m 2s
4:	learn: 0.3221710	total: 2.36s	remaining: 1m 1s
5:	learn: 0.3031934	total: 2.81s	remaining: 1m
6:	learn: 0.2900960	total: 3.29s	remaining: 1m
7:	learn: 0.2802120	total: 3.75s	remaining: 59.6s
8:	learn: 0.2739610	total: 4.21s	remaining: 58.9s
9:	learn: 0.2686102	total: 4.66s	remaining: 58.3s
10:	learn: 0.2651570	total: 5.11s	remaining: 57.6s
11:	learn: 0.2620787	total: 5.58s	remaining: 57.2s
12:	learn: 0.2600087	total: 6.22s	remaining: 58.4s
13:	learn: 0.2583669	total: 6.83s	remaining: 59s
14:	learn: 0.2566506	total: 7.33s	remaining: 58.7s
15:	learn: 0.2554154	total: 7.82s	remaining: 58.1s
16:	learn: 0.2546885	total: 8.36s	remaining: 58s
17:	learn: 0.2540173	total: 8.91s	remaining: 57.9s
18:	learn: 0.2531824	total: 9.38s	remaining: 57.3s
19:	learn: 0.2527277	total: 10s	remaining: 57.7s
20:	learn: 0.2523273	total: 10.8s	remaining: 58.6s
21:	learn: 0.2519660	total: 11.4s	remaining: 58.5s
22:	learn: 0.2515779	total: 12s	remaining: 58.3s
23:	learn: 0.2511539	total: 12.5s	remaining: 57.6s
24:	learn: 0.2508874	total: 13s	remaining: 57.1s
25:	learn: 0.2504573	total: 13.5s	remaining: 56.5s
26:	learn: 0.2501976	total: 14s	remaining: 55.9s
27:	learn: 0.2499544	total: 14.5s	remaining: 55.4s
28:	learn: 0.2497565	total: 15s	remaining: 54.8s
29:	learn: 0.2496000	total: 15.5s	remaining: 54.2s
30:	learn: 0.2492999	total: 16s	remaining: 53.8s
31:	learn: 0.2491192	total: 16.5s	remaining: 53.2s
32:	learn: 0.2488717	total: 17.2s	remaining: 53.2s
33:	learn: 0.2487257	total: 17.7s	remaining: 52.7s
34:	learn: 0.2485779	total: 18.2s	remaining: 52s
35:	learn: 0.2485027	total: 18.6s	remaining: 51.2s
36:	learn: 0.2483549	total: 19.1s	remaining: 50.5s
37:	learn: 0.2482492	total: 19.5s	remaining: 49.9s
38:	learn: 0.2481184	total: 20s	remaining: 49.2s
39:	learn: 0.2479598	total: 20.5s	remaining: 48.6s
40:	learn: 0.2477822	total: 20.9s	remaining: 48s
41:	learn: 0.2476596	total: 21.4s	remaining: 47.3s
42:	learn: 0.2476343	total: 21.9s	remaining: 46.8s
43:	learn: 0.2475439	total: 22.5s	remaining: 46.5s
44:	learn: 0.2474827	total: 22.9s	remaining: 45.8s
45:	learn: 0.2473764	total: 23.4s	remaining: 45.3s
46:	learn: 0.2472550	total: 23.9s	remaining: 44.7s
47:	learn: 0.2471388	total: 24.4s	remaining: 44.2s
48:	learn: 0.2470800	total: 24.8s	remaining: 43.5s
49:	learn: 0.2469901	total: 25.3s	remaining: 43s
50:	learn: 0.2469083	total: 25.8s	remaining: 42.4s
51:	learn: 0.2467997	total: 26.2s	remaining: 41.9s
52:	learn: 0.2466692	total: 26.7s	remaining: 41.3s
53:	learn: 0.2465628	total: 27.1s	remaining: 40.7s
54:	learn: 0.2464672	total: 27.6s	remaining: 40.1s
55:	learn: 0.2463405	total: 28s	remaining: 39.6s
56:	learn: 0.2462651	total: 28.5s	remaining: 38.9s
57:	learn: 0.2461808	total: 28.9s	remaining: 38.4s
58:	learn: 0.2461140	total: 29.4s	remaining: 37.8s
59:	learn: 0.2460520	total: 29.8s	remaining: 37.3s
60:	learn: 0.2459615	total: 30.3s	remaining: 36.8s
61:	learn: 0.2458719	total: 30.8s	remaining: 36.3s
62:	learn: 0.2457610	total: 31.4s	remaining: 35.9s
63:	learn: 0.2456988	total: 31.9s	remaining: 35.4s
64:	learn: 0.2456547	total: 32.4s	remaining: 34.9s
65:	learn: 0.2455689	total: 32.8s	remaining: 34.3s
66:	learn: 0.2455301	total: 33.3s	remaining: 33.8s
67:	learn: 0.2454726	total: 33.7s	remaining: 33.2s
68:	learn: 0.2453989	total: 34.2s	remaining: 32.7s
69:	learn: 0.2453339	total: 34.7s	remaining: 32.3s
70:	learn: 0.2452627	total: 35.3s	remaining: 31.8s
71:	learn: 0.2451527	total: 35.7s	remaining: 31.3s
72:	learn: 0.2450547	total: 36.3s	remaining: 30.9s
73:	learn: 0.2449666	total: 36.9s	remaining: 30.4s
74:	learn: 0.2448939	total: 37.3s	remaining: 29.9s
75:	learn: 0.2448265	total: 37.8s	remaining: 29.3s
76:	learn: 0.2447641	total: 38.2s	remaining: 28.8s
77:	learn: 0.2446385	total: 38.7s	remaining: 28.2s
78:	learn: 0.2445640	total: 39.1s	remaining: 27.7s
79:	learn: 0.2444399	total: 39.5s	remaining: 27.2s
80:	learn: 0.2443790	total: 40s	remaining: 26.7s
81:	learn: 0.2443560	total: 40.4s	remaining: 26.1s
82:	learn: 0.2442477	total: 40.9s	remaining: 25.6s
83:	learn: 0.2442240	total: 41.3s	remaining: 25.1s
84:	learn: 0.2441174	total: 41.8s	remaining: 24.6s
85:	learn: 0.2440652	total: 42.2s	remaining: 24s
86:	learn: 0.2440192	total: 42.6s	remaining: 23.5s
87:	learn: 0.2439386	total: 43.1s	remaining: 23s
88:	learn: 0.2438737	total: 43.6s	remaining: 22.5s
89:	learn: 0.2437947	total: 44s	remaining: 22s
90:	learn: 0.2437083	total: 44.4s	remaining: 21.5s
91:	learn: 0.2436437	total: 44.8s	remaining: 20.9s
92:	learn: 0.2435485	total: 45.2s	remaining: 20.4s
93:	learn: 0.2434764	total: 45.7s	remaining: 19.9s
94:	learn: 0.2433881	total: 46.3s	remaining: 19.5s
95:	learn: 0.2433211	total: 46.8s	remaining: 19s
96:	learn: 0.2432510	total: 47.2s	remaining: 18.5s
97:	learn: 0.2431556	total: 47.7s	remaining: 18s
98:	learn: 0.2430725	total: 48.1s	remaining: 17.5s
99:	learn: 0.2429764	total: 48.7s	remaining: 17.1s
100:	learn: 0.2429229	total: 49.3s	remaining: 16.6s
101:	learn: 0.2428362	total: 49.7s	remaining: 16.1s
102:	learn: 0.2427608	total: 50.2s	remaining: 15.6s
103:	learn: 0.2426981	total: 50.6s	remaining: 15.1s
104:	learn: 0.2426320	total: 51.2s	remaining: 14.6s
105:	learn: 0.2425525	total: 51.7s	remaining: 14.1s
106:	learn: 0.2424648	total: 52.2s	remaining: 13.7s
107:	learn: 0.2423618	total: 52.7s	remaining: 13.2s
108:	learn: 0.2422521	total: 53.1s	remaining: 12.7s
109:	learn: 0.2422300	total: 53.5s	remaining: 12.2s
110:	learn: 0.2421200	total: 54s	remaining: 11.7s
111:	learn: 0.2420304	total: 54.4s	remaining: 11.2s
112:	learn: 0.2419500	total: 54.9s	remaining: 10.7s
113:	learn: 0.2418972	total: 55.4s	remaining: 10.2s
114:	learn: 0.2418163	total: 55.8s	remaining: 9.71s
115:	learn: 0.2417474	total: 56.3s	remaining: 9.22s
116:	learn: 0.2417198	total: 56.8s	remaining: 8.73s
117:	learn: 0.2416341	total: 57.2s	remaining: 8.24s
118:	learn: 0.2415379	total: 57.7s	remaining: 7.76s
119:	learn: 0.2414737	total: 58.2s	remaining: 7.27s
120:	learn: 0.2414033	total: 58.7s	remaining: 6.79s
121:	learn: 0.2413545	total: 59.1s	remaining: 6.3s
122:	learn: 0.2412670	total: 59.6s	remaining: 5.82s
123:	learn: 0.2411448	total: 1m	remaining: 5.33s
124:	learn: 0.2410740	total: 1m	remaining: 4.84s
125:	learn: 0.2409829	total: 1m	remaining: 4.35s
126:	learn: 0.2409350	total: 1m 1s	remaining: 3.87s
127:	learn: 0.2408700	total: 1m 1s	remaining: 3.38s
128:	learn: 0.2407545	total: 1m 2s	remaining: 2.9s
129:	learn: 0.2407068	total: 1m 2s	remaining: 2.42s
130:	learn: 0.2406661	total: 1m 3s	remaining: 1.94s
131:	learn: 0.2406052	total: 1m 4s	remaining: 1.46s
132:	learn: 0.2405590	total: 1m 4s	remaining: 971ms
133:	learn: 0.2405088	total: 1m 5s	remaining: 486ms
134:	learn: 0.2404435	total: 1m 5s	remaining: 0us
Out[121]:
<catboost.core.CatBoostClassifier at 0x17e15c71240>
In [146]:
preds_proba = model.predict_proba(eval_dataset)[:,1]
print("Catboost pr auc", pr_auc(y_test, preds_proba))
Catboost pr auc 0.22378657110847214

AUC 0.223777 time 59.692731 для SC

AUC 0.225587 time 31.980033 для OHE

Стандартный метод обработки примерно похож по времени и качеству на SC

In [171]:
X_train_lgbm = X_train.copy()
X_test_lgbm = X_test.copy()
In [172]:
l_cat = "name:" + ','.join(categorical)
In [173]:
for i in categorical:
    X_train_lgbm[i] = X_train_lgbm[i].astype('category')
    X_test_lgbm[i] = X_test_lgbm[i].astype('category')
In [174]:
lgb_train = lgb.Dataset(X_train_lgbm, y_train)
In [201]:
%%time
lgb_params = {
    'objective': 'binary',
    'learning_rate'    : 0.074,
    'max_depth'        : 5,
    'n_estimators'     : 160}

gbm = lgb.train(lgb_params,
                lgb_train,
                num_boost_round=100)
C:\Users\alex1\Anaconda3\lib\site-packages\lightgbm\engine.py:118: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Wall time: 5.05 s
In [202]:
print("Lgbm pr auc", pr_auc(y_test,  gbm.predict(X_test_lgbm)))
Lgbm pr auc 0.2251404634039964

AUC 0.228050 time 9.101934 для SC

AUC 0.228120 time 11.166834 для OHE

Ручная обработка дала немного лучшие результаты, однако затрачивала больше времени. Возможная причина - различные гиперпараметры, подобранные для выборок без категориальных признаках и с их обработкой.

Conclusion:

Smoth counter счетчики обрабатываются несколько быстрее чем OHE метод. Возможно это обусловлено разным количеством признаков в выборках после обработки. В то же время, оптимальные гиперпараметры подбираются разные, что говорит о разной природе данных

LightGBM и CatBoost позволяют работать с категориальными признаками, для lgbm необходимо конвертировать тип столбцов и просто передать, далее он сам разберется. Catboost принимает список категориальных столбцов и с ними работает.

Задание 5. (1 балл)

Реализуйте блендинг (получение ответов нескольких моделей и взятие их с весами (их нужно подбирать на обучающей выборке)) подобранных в предыдущем задании моделей и сравните качество.

In [194]:
models = {
    "cat_OHE" : {"model":CatBoostClassifier(logging_level='Silent', depth=7, iterations=150, learning_rate=0.12)},
    "lgbm_OHE" : {"model":lgb.LGBMClassifier(max_depth=17, n_estimators=200, learning_rate=0.074)},
    "cat_SC" : {"model":CatBoostClassifier(logging_level='Silent', depth=8, iterations=300, learning_rate=0.1)},
    "lgbm_SC" : {"model":lgb.LGBMClassifier(max_depth=7, n_estimators=160, learning_rate=0.062)},
}
In [277]:
kf = KFold(n_splits=2,shuffle=True, random_state=0)
_y_train = y_train_OHE
_y_train = _y_train.reset_index(drop=True)
for k,v in models.items():
    _x_train = None
    if k.endswith("_OHE"):
        _x_train = X_train_OHE
    else:
        _x_train = X_train_SC
    _x_train = _x_train.reset_index(drop=True)

    for train_index, test_index in kf.split(_x_train):
        begin = time()
        v["model"].fit(_x_train.values[train_index], _y_train[train_index])
        print(k, "fit time: ", time() - begin)
        v["proba"] = v["model"].predict_proba(_x_train.values[test_index])[:,1]
        v["auc"] = pr_auc(_y_train[test_index], v["proba"])
        break
cat_OHE fit time:  20.19306516647339
lgbm_OHE fit time:  4.79600191116333
cat_SC fit time:  44.660996437072754
lgbm_SC fit time:  4.1190056800842285
In [278]:
from sklearn.linear_model import LinearRegression
In [279]:
d_blend = pd.DataFrame()
for k,v in models.items():
    d_blend[k] = pd.Series(v['proba'])
d_blend.head()
Out[279]:
cat_OHE lgbm_OHE cat_SC lgbm_SC
0 0.068491 0.048327 0.057223 0.055038
1 0.018775 0.026063 0.016626 0.028834
2 0.041735 0.042472 0.042970 0.034474
3 0.024000 0.027915 0.016231 0.026459
4 0.039678 0.040463 0.030689 0.070291
In [280]:
lr_blend = LinearRegression()
lr_blend.fit(d_blend, _y_train[test_index])
lr_blend.coef_
Out[280]:
array([0.2876268 , 0.12899986, 0.3083336 , 0.26534119])
In [281]:
out = pd.DataFrame()
for k,v in models.items(): 
    _x_test = None
    if k.endswith("_OHE"):
        _x_test = X_test_OHE
    else:
        _x_test = X_test_SC
    out[k] = v["model"].predict_proba(_x_test)[:,1]
out.head()
Out[281]:
cat_OHE lgbm_OHE cat_SC lgbm_SC
0 0.168339 0.229026 0.146101 0.199716
1 0.158058 0.167274 0.285554 0.146087
2 0.056792 0.051607 0.074889 0.047308
3 0.152083 0.163874 0.198038 0.183191
4 0.080180 0.077358 0.059342 0.053366
In [286]:
result_predict = np.sum(out.values * 0.5, axis=1)
result_predict
Out[286]:
array([0.37159091, 0.37848636, 0.1152988 , ..., 0.06779153, 0.05902516,
       0.10185661])
In [287]:
_auc = pr_auc(y_test_OHE,result_predict)
_auc
Out[287]:
0.22509396174519794
In [288]:
result_blend = pd.DataFrame()
for k,v in models.items(): 
    result_blend = result_blend.append(pd.DataFrame(data={"name":[k], "auc":v["auc"]}), ignore_index=True)
result_blend = result_blend.append(pd.DataFrame(data={"name":["blend"], "auc":[_auc]}), ignore_index=True)
In [285]:
result_blend.sort_values("auc")[::-1]
Out[285]:
name auc
3 lgbm_SC 0.230242
2 cat_SC 0.230041
0 cat_OHE 0.228619
1 lgbm_OHE 0.227083
4 blend 0.224872

Блендинг не повысил качество

Задание 6. (1.5 балла)

В задании 3 вы подобрали гиперпараметры для LightGBM и CatBoost на вещественных признаках. Визуализируйте важности признаков, посчитанные этими алгоритмами, в виде горизонтального bar-plot (отсортируйте признаки по убыванию важности, подпишите названия признаков по оси y).

Для каждого из двух алгоритмов удалите неважные признаки (обычно по bar-plot хорошо видно порог на важность, с которого начинается хвост неважных признаков) и обучите ту же модель на получившихся данных. Сильно ли упало качество при удалении признаков, которые модель считает неважными?

In [28]:
import matplotlib.pyplot as plt
%matplotlib inline
In [31]:
_x_train = X_train.select_dtypes(include=["int64", "float64"])
_x_test = X_test.select_dtypes(include=["int64", "float64"])
In [49]:
%%time
clf = lgb.LGBMClassifier(max_depth=5, n_estimators=160, learning_rate=0.074)
clf.fit(_x_train, y_train)
feature_imp = pd.DataFrame(sorted(zip(clf.feature_importances_,_x_train.columns)), columns=['Value','Feature'])
print("pr-auc", pr_auc(y_test, clf.predict_proba(_x_test)[:,1]))
pr-auc 0.22302033721552292
Wall time: 6.4 s
In [38]:
objects = feature_imp['Feature']
y_pos = np.arange(len(objects))
performance = feature_imp['Value']
plt.figure(figsize=(20,20))
plt.grid(True)
plt.title('Важность признаков для LGBM')
plt.barh(y_pos, performance, align='center', alpha=0.5)
plt.yticks(y_pos, objects)
plt.show()
In [53]:
select = feature_imp.sort_values("Value")[::-1]["Feature"][:41].values
_x_train_lgb = _x_train[select]
_x_test_lgb = _x_test[select]
In [54]:
%%time
clf = lgb.LGBMClassifier(max_depth=5, n_estimators=160, learning_rate=0.074)
clf.fit(_x_train_lgb, y_train)
feature_imp = pd.DataFrame(sorted(zip(clf.feature_importances_,_x_train_lgb.columns)), columns=['Value','Feature'])
print("pr-auc", pr_auc(y_test, clf.predict_proba(_x_test_lgb)[:,1]))
pr-auc 0.22301141951869569
Wall time: 5.15 s
In [51]:
%%time
clf = CatBoostClassifier(max_depth=7, n_estimators=135, learning_rate=0.146)
clf.fit(_x_train, y_train)
feature_imp = pd.DataFrame(sorted(zip(clf.feature_importances_,_x_train.columns)), columns=['Value','Feature'])
print("pr-auc", pr_auc(y_test, clf.predict_proba(_x_test)[:,1]))
0:	learn: 0.5466391	total: 164ms	remaining: 22s
1:	learn: 0.4562286	total: 321ms	remaining: 21.3s
2:	learn: 0.3935334	total: 481ms	remaining: 21.2s
3:	learn: 0.3543266	total: 638ms	remaining: 20.9s
4:	learn: 0.3248818	total: 790ms	remaining: 20.5s
5:	learn: 0.3053074	total: 945ms	remaining: 20.3s
6:	learn: 0.2903538	total: 1.14s	remaining: 20.8s
7:	learn: 0.2813296	total: 1.31s	remaining: 20.7s
8:	learn: 0.2747577	total: 1.47s	remaining: 20.6s
9:	learn: 0.2693616	total: 1.63s	remaining: 20.4s
10:	learn: 0.2647987	total: 1.84s	remaining: 20.7s
11:	learn: 0.2621068	total: 2.05s	remaining: 21s
12:	learn: 0.2598813	total: 2.29s	remaining: 21.4s
13:	learn: 0.2583308	total: 2.51s	remaining: 21.7s
14:	learn: 0.2571862	total: 2.7s	remaining: 21.6s
15:	learn: 0.2559531	total: 2.88s	remaining: 21.4s
16:	learn: 0.2550236	total: 3.12s	remaining: 21.7s
17:	learn: 0.2543727	total: 3.31s	remaining: 21.5s
18:	learn: 0.2536770	total: 3.49s	remaining: 21.3s
19:	learn: 0.2532385	total: 3.65s	remaining: 21s
20:	learn: 0.2528858	total: 3.82s	remaining: 20.7s
21:	learn: 0.2525923	total: 3.98s	remaining: 20.4s
22:	learn: 0.2521166	total: 4.15s	remaining: 20.2s
23:	learn: 0.2517843	total: 4.32s	remaining: 20s
24:	learn: 0.2514605	total: 4.49s	remaining: 19.8s
25:	learn: 0.2511555	total: 4.66s	remaining: 19.6s
26:	learn: 0.2508952	total: 4.93s	remaining: 19.7s
27:	learn: 0.2507467	total: 5.17s	remaining: 19.8s
28:	learn: 0.2505553	total: 5.37s	remaining: 19.6s
29:	learn: 0.2504023	total: 5.55s	remaining: 19.4s
30:	learn: 0.2502301	total: 5.74s	remaining: 19.2s
31:	learn: 0.2499631	total: 5.93s	remaining: 19.1s
32:	learn: 0.2497816	total: 6.09s	remaining: 18.8s
33:	learn: 0.2496380	total: 6.28s	remaining: 18.6s
34:	learn: 0.2494756	total: 6.47s	remaining: 18.5s
35:	learn: 0.2492645	total: 6.66s	remaining: 18.3s
36:	learn: 0.2490400	total: 6.85s	remaining: 18.1s
37:	learn: 0.2489032	total: 7.05s	remaining: 18s
38:	learn: 0.2487495	total: 7.24s	remaining: 17.8s
39:	learn: 0.2485909	total: 7.43s	remaining: 17.6s
40:	learn: 0.2484669	total: 7.61s	remaining: 17.4s
41:	learn: 0.2483230	total: 7.81s	remaining: 17.3s
42:	learn: 0.2482206	total: 7.97s	remaining: 17.1s
43:	learn: 0.2481188	total: 8.17s	remaining: 16.9s
44:	learn: 0.2479314	total: 8.36s	remaining: 16.7s
45:	learn: 0.2478285	total: 8.53s	remaining: 16.5s
46:	learn: 0.2477250	total: 8.7s	remaining: 16.3s
47:	learn: 0.2476175	total: 8.87s	remaining: 16.1s
48:	learn: 0.2475045	total: 9.06s	remaining: 15.9s
49:	learn: 0.2474546	total: 9.23s	remaining: 15.7s
50:	learn: 0.2473639	total: 9.43s	remaining: 15.5s
51:	learn: 0.2472873	total: 9.59s	remaining: 15.3s
52:	learn: 0.2471622	total: 9.77s	remaining: 15.1s
53:	learn: 0.2470383	total: 9.93s	remaining: 14.9s
54:	learn: 0.2469383	total: 10.1s	remaining: 14.7s
55:	learn: 0.2467582	total: 10.3s	remaining: 14.5s
56:	learn: 0.2467027	total: 10.5s	remaining: 14.3s
57:	learn: 0.2466219	total: 10.7s	remaining: 14.1s
58:	learn: 0.2465000	total: 10.8s	remaining: 14s
59:	learn: 0.2464183	total: 11s	remaining: 13.8s
60:	learn: 0.2463628	total: 11.2s	remaining: 13.6s
61:	learn: 0.2462682	total: 11.4s	remaining: 13.4s
62:	learn: 0.2461664	total: 11.5s	remaining: 13.2s
63:	learn: 0.2460555	total: 11.6s	remaining: 12.9s
64:	learn: 0.2459491	total: 11.8s	remaining: 12.7s
65:	learn: 0.2459013	total: 12s	remaining: 12.5s
66:	learn: 0.2458418	total: 12.1s	remaining: 12.3s
67:	learn: 0.2457953	total: 12.2s	remaining: 12s
68:	learn: 0.2457364	total: 12.3s	remaining: 11.8s
69:	learn: 0.2456418	total: 12.5s	remaining: 11.6s
70:	learn: 0.2455152	total: 12.6s	remaining: 11.4s
71:	learn: 0.2454336	total: 12.7s	remaining: 11.1s
72:	learn: 0.2453576	total: 12.9s	remaining: 10.9s
73:	learn: 0.2452786	total: 13s	remaining: 10.7s
74:	learn: 0.2451672	total: 13.2s	remaining: 10.6s
75:	learn: 0.2451019	total: 13.4s	remaining: 10.4s
76:	learn: 0.2450247	total: 13.5s	remaining: 10.2s
77:	learn: 0.2449193	total: 13.6s	remaining: 9.96s
78:	learn: 0.2448351	total: 13.8s	remaining: 9.81s
79:	learn: 0.2447962	total: 14s	remaining: 9.62s
80:	learn: 0.2447685	total: 14.1s	remaining: 9.41s
81:	learn: 0.2446774	total: 14.2s	remaining: 9.21s
82:	learn: 0.2446143	total: 14.4s	remaining: 9.01s
83:	learn: 0.2445683	total: 14.5s	remaining: 8.83s
84:	learn: 0.2445222	total: 14.7s	remaining: 8.67s
85:	learn: 0.2444601	total: 14.9s	remaining: 8.47s
86:	learn: 0.2443848	total: 15s	remaining: 8.29s
87:	learn: 0.2442999	total: 15.1s	remaining: 8.09s
88:	learn: 0.2442105	total: 15.3s	remaining: 7.9s
89:	learn: 0.2441195	total: 15.4s	remaining: 7.71s
90:	learn: 0.2440422	total: 15.5s	remaining: 7.52s
91:	learn: 0.2439909	total: 15.7s	remaining: 7.32s
92:	learn: 0.2439323	total: 15.8s	remaining: 7.14s
93:	learn: 0.2438433	total: 15.9s	remaining: 6.96s
94:	learn: 0.2437950	total: 16.1s	remaining: 6.77s
95:	learn: 0.2437059	total: 16.2s	remaining: 6.58s
96:	learn: 0.2436633	total: 16.3s	remaining: 6.4s
97:	learn: 0.2435801	total: 16.5s	remaining: 6.22s
98:	learn: 0.2435162	total: 16.6s	remaining: 6.05s
99:	learn: 0.2434214	total: 16.8s	remaining: 5.88s
100:	learn: 0.2433601	total: 16.9s	remaining: 5.71s
101:	learn: 0.2433146	total: 17.1s	remaining: 5.53s
102:	learn: 0.2432440	total: 17.3s	remaining: 5.37s
103:	learn: 0.2431750	total: 17.5s	remaining: 5.21s
104:	learn: 0.2431027	total: 17.7s	remaining: 5.04s
105:	learn: 0.2429946	total: 17.8s	remaining: 4.87s
106:	learn: 0.2429409	total: 17.9s	remaining: 4.69s
107:	learn: 0.2428753	total: 18.1s	remaining: 4.51s
108:	learn: 0.2427856	total: 18.2s	remaining: 4.34s
109:	learn: 0.2427089	total: 18.3s	remaining: 4.16s
110:	learn: 0.2426494	total: 18.4s	remaining: 3.99s
111:	learn: 0.2425567	total: 18.6s	remaining: 3.82s
112:	learn: 0.2424810	total: 18.8s	remaining: 3.65s
113:	learn: 0.2423885	total: 18.9s	remaining: 3.49s
114:	learn: 0.2423468	total: 19.1s	remaining: 3.33s
115:	learn: 0.2423182	total: 19.3s	remaining: 3.16s
116:	learn: 0.2422620	total: 19.4s	remaining: 2.99s
117:	learn: 0.2421555	total: 19.5s	remaining: 2.81s
118:	learn: 0.2420774	total: 19.7s	remaining: 2.65s
119:	learn: 0.2420280	total: 19.8s	remaining: 2.48s
120:	learn: 0.2419559	total: 20s	remaining: 2.31s
121:	learn: 0.2419029	total: 20.1s	remaining: 2.15s
122:	learn: 0.2418020	total: 20.3s	remaining: 1.98s
123:	learn: 0.2416871	total: 20.4s	remaining: 1.81s
124:	learn: 0.2416060	total: 20.5s	remaining: 1.64s
125:	learn: 0.2415634	total: 20.7s	remaining: 1.48s
126:	learn: 0.2415027	total: 20.9s	remaining: 1.31s
127:	learn: 0.2414403	total: 21s	remaining: 1.15s
128:	learn: 0.2413465	total: 21.1s	remaining: 983ms
129:	learn: 0.2412980	total: 21.3s	remaining: 818ms
130:	learn: 0.2412372	total: 21.4s	remaining: 653ms
131:	learn: 0.2411406	total: 21.5s	remaining: 489ms
132:	learn: 0.2410715	total: 21.7s	remaining: 326ms
133:	learn: 0.2409998	total: 21.8s	remaining: 163ms
134:	learn: 0.2409475	total: 21.9s	remaining: 0us
pr-auc 0.21880762137363477
Wall time: 28.5 s
In [52]:
objects = feature_imp['Feature']
y_pos = np.arange(len(objects))
performance = feature_imp['Value']
plt.figure(figsize=(20,20))
plt.grid(True)
plt.title('Важность признаков для CatBoost')
plt.barh(y_pos, performance, align='center', alpha=0.5)
plt.yticks(y_pos, objects)
plt.show()
In [55]:
select = feature_imp.sort_values("Value")[::-1]["Feature"][:30].values
_x_train_lgb = _x_train[select]
_x_test_lgb = _x_test[select]
In [56]:
%%time
clf = lgb.LGBMClassifier(max_depth=5, n_estimators=160, learning_rate=0.074)
clf.fit(_x_train_lgb, y_train)
feature_imp = pd.DataFrame(sorted(zip(clf.feature_importances_,_x_train_lgb.columns)), columns=['Value','Feature'])
print("pr-auc", pr_auc(y_test, clf.predict_proba(_x_test_lgb)[:,1]))
pr-auc 0.22260249357719117
Wall time: 4.55 s

После селеции качество lgbm практически не изменилось, качество же catboost'a даже несколько повысилось