Learn how to approach and win competitions on Kaggle.

Kaggle competitions process

Slide

Competitions overview

Explore train data

# Import pandas
import pandas as pd

# Read train data
train = pd.read_csv('train.csv')

# Look at the shape of the data
print('Train shape:', train.shape)

# Look at the head() of the data
print(train.head())
<script.py> output:
    Train shape: (15500, 5)
           id        date  store  item  sales
    0  100000  2017-12-01      1     1     19
    1  100001  2017-12-02      1     1     16
    2  100002  2017-12-03      1     1     31
    3  100003  2017-12-04      1     1      7
    4  100004  2017-12-05      1     1     20

Explore test data

import pandas as pd

# Read the test data
test = pd.read_csv('test.csv')
# Print train and test columns
print('Train columns:', train.columns.tolist())
print('Test columns:', test.columns.tolist())

# Read the sample submission file
sample_submission = pd.read_csv('sample_submission.csv')

# Look at the head() of the sample submission
print(sample_submission.head())
script.py> output:
    Train columns: ['id', 'date', 'store', 'item', 'sales']
    Test columns: ['id', 'date', 'store', 'item']
       id  sales
    0   0     52
    1   1     52
    2   2     52
    3   3     52
    4   4     52

Public vs Private leaderboard

Train XGBoost models

import xgboost as xgb

# Create DMatrix on train data
dtrain = xgb.DMatrix(data=train[['store', 'item']],
                     label=train['sales'])

# Define xgboost parameters
params = {'objective': 'reg:linear',
          'max_depth': 15,
          'verbosity': 0}

# Train xgboost model
xg_depth_15 = xgb.train(params=params, dtrain=dtrain)

Explore overfitting XGBoost

from sklearn.metrics import mean_squared_error

dtrain = xgb.DMatrix(data=train[['store', 'item']])
dtest = xgb.DMatrix(data=test[['store', 'item']])

# For each of 3 trained models
for model in [xg_depth_2, xg_depth_8, xg_depth_15]:
    # Make predictions
    train_pred = model.predict(dtrain)     
    test_pred = model.predict(dtest)          
    
    # Calculate metrics
    mse_train = mean_squared_error(train['sales'], train_pred)                  
    mse_test = mean_squared_error(test['sales'], test_pred)
    print('MSE Train: {:.3f}. MSE Test: {:.3f}'.format(mse_train, mse_test))
<script.py> output:
    MSE Train: 631.275. MSE Test: 558.522
    MSE Train: 183.771. MSE Test: 337.337
    MSE Train: 134.984. MSE Test: 355.534

Dive into the Competition

Slide

Understand the problem

Define a competition metric

  • Mean Squared Error (MSE) for the regression problem:
M S E = 1 N i = 1 N ( y i y ^ i ) 2
import numpy as np

# Import MSE from sklearn
from sklearn.metrics import mean_squared_error

# Define your own MSE function
def own_mse(y_true, y_pred):
  	# Raise differences to the power of 2
    squares = np.power(y_true - y_pred, 2)
    # Find mean over all observations
    err = np.mean(squares)
    return err

print('Sklearn MSE: {:.5f}. '.format(mean_squared_error(y_regression_true, y_regression_pred)))
print('Your MSE: {:.5f}. '.format(own_mse(y_regression_true, y_regression_pred)))
Sklearn MSE: 0.15418. 
Your MSE: 0.15418. 
  • Logarithmic Loss (LogLoss) for the binary classification problem:
L o g L o s s = 1 N i = 1 N ( y i ln p i + ( 1 y i ) ln ( 1 p i ) )
import numpy as np

# Import log_loss from sklearn
from sklearn.metrics import log_loss

# Define your own LogLoss function
# prob_pred: probability predicted
def own_logloss(y_true, prob_pred):
  	# Find loss for each observation
    terms = y_true * np.log(prob_pred) + (1 - y_true) * np.log(1 - prob_pred)
    # Find mean over all observations
    err = np.mean(terms) 
    return -err

print('Sklearn LogLoss: {:.5f}'.format(log_loss(y_classification_true, y_classification_pred)))
print('Your LogLoss: {:.5f}'.format(own_logloss(y_classification_true, y_classification_pred)))
<script.py> output:
    Sklearn LogLoss: 1.10801
    Your LogLoss: 1.10801

Initial EDA

EDA statistics

# Shapes of train and test data
print('Train shape:', train.shape)
print('Test shape:', test.shape)

# Train head()
print(train.head())
<script.py> output:
    Train shape: (20000, 8)
    Test shape: (9914, 7)
       id  fare_amount          pickup_datetime  pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude  passenger_count
    0   0          4.5  2009-06-15 17:26:21 UTC           -73.844           40.721            -73.842            40.712                1
    1   1         16.9  2010-01-05 16:52:16 UTC           -74.016           40.711            -73.979            40.782                1
    2   2          5.7  2011-08-18 00:35:00 UTC           -73.983           40.761            -73.991            40.751                2
    3   3          7.7  2012-04-21 04:30:42 UTC           -73.987           40.733            -73.992            40.758                1
    4   4          5.3  2010-03-09 07:51:00 UTC           -73.968           40.768            -73.957            40.784                1
# Shapes of train and test data
print('Train shape:', train.shape)
print('Test shape:', test.shape)

# Train head()
print(train.head())

# Describe the target variable
print(train.fare_amount.describe())

# Train distribution of passengers within rides
print(train.passenger_count.value_counts())
count    20000.000
    mean        11.303
    std          9.542
    min         -3.000
    25%          6.000
    50%          8.500
    75%         12.500
    max        180.000
    Name: fare_amount, dtype: float64
    1    13999
    2     2912
    5     1327
    3      860
    4      420
    6      407
    0       75
    Name: passenger_count, dtype: int64

EDA plots I

# Calculate the ride distance
train['distance_km'] = haversine_distance(train)

# Draw a scatterplot
plt.scatter(x=train['fare_amount'], y=train['distance_km'], alpha=0.5)
plt.xlabel('Fare amount')
plt.ylabel('Distance, km')
plt.title('Fare amount based on the distance')

# Limit on the distance
plt.ylim(0, 50)
plt.show()

EDA plots II

# Create hour feature
train['pickup_datetime'] = pd.to_datetime(train.pickup_datetime)
train['hour'] = train.pickup_datetime.dt.hour

# Find median fare_amount for each hour
hour_price = train.groupby('hour', as_index=False)['fare_amount'].median()

# Plot the line plot
plt.plot(hour_price['hour'], hour_price['fare_amount'], marker='o')
plt.xlabel('Hour of the day')
plt.ylabel('Median fare amount')
plt.title('Fare amount based on day time')
plt.xticks(range(24))
plt.show()

Local validation

K-fold cross-validation

# Import KFold
from sklearn.model_selection import KFold

# Create a KFold object
kf = KFold(n_splits=3, shuffle=True, random_state=123)

# Loop through each split
fold = 0
for train_index, test_index in kf.split(train):
    # Obtain training and testing folds
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    print('Fold: {}'.format(fold))
    print('CV train shape: {}'.format(cv_train.shape))
    print('Medium interest listings in CV train: {}\n'.format(sum(cv_train.interest_level == 'medium')))
    fold += 1
<script.py> output:
    Fold: 0
    CV train shape: (666, 9)
    Medium interest listings in CV train: 175
    
    Fold: 1
    CV train shape: (667, 9)
    Medium interest listings in CV train: 165
    
    Fold: 2
    CV train shape: (667, 9)
    Medium interest listings in CV train: 162

Stratified K-fold

# Import StratifiedKFold
from sklearn.model_selection import StratifiedKFold

# Create a StratifiedKFold object
str_kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=123)

# Loop through each split
fold = 0
for train_index, test_index in str_kf.split(train, train['interest_level']):
    # Obtain training and testing folds
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    print('Fold: {}'.format(fold))
    print('CV train shape: {}'.format(cv_train.shape))
    print('Medium interest listings in CV train: {}\n'.format(sum(cv_train.interest_level == 'medium')))
    fold += 1
<script.py> output:
    Fold: 0
    CV train shape: (666, 9)
    Medium interest listings in CV train: 167
    
    Fold: 1
    CV train shape: (667, 9)
    Medium interest listings in CV train: 167
    
    Fold: 2
    CV train shape: (667, 9)
    Medium interest listings in CV train: 168

Validation usage

Time K-fold

from sklearn.model_selection import TimeSeriesSplit

# Create TimeSeriesSplit object
time_kfold = TimeSeriesSplit(n_splits=3)

# Sort train data by date
train = train.sort_values("date")

# Iterate through each split
fold = 0
for train_index, test_index in time_kfold.split(train):
    cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
    
    print('Fold :', fold)
    print('Train date range: from {} to {}'.format(cv_train.date.min(), cv_train.date.max()))
    print('Test date range: from {} to {}\n'.format(cv_test.date.min(), cv_test.date.max()))
    fold += 1
<script.py> output:
    Fold : 0
    Train date range: from 2017-12-01 to 2017-12-08
    Test date range: from 2017-12-08 to 2017-12-16
    
    Fold : 1
    Train date range: from 2017-12-01 to 2017-12-16
    Test date range: from 2017-12-16 to 2017-12-24
    
    Fold : 2
    Train date range: from 2017-12-01 to 2017-12-24
    Test date range: from 2017-12-24 to 2017-12-31

Overall validation score

from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Sort train data by date
train = train.sort_values('date')

# Initialize 3-fold time cross-validation
kf = TimeSeriesSplit(n_splits=3)

# Get MSE scores for each cross-validation split
mse_scores = get_fold_mse(train, kf)

print('Mean validation MSE: {:.5f}'.format(np.mean(mse_scores)))
print('MSE by fold: {}'.format(mse_scores))
print('Overall validation MSE: {:.5f}'.format(np.mean(mse_scores) + np.std(mse_scores)))
<script.py> output:
    Mean validation MSE: 955.49186
    MSE by fold: [890.30336, 961.65797, 1014.51424]
    Overall validation MSE: 1006.38784

Feature Engineering

Slide

Feature engineering

Arithmetical features

# Look at the initial RMSE
print('RMSE before feature engineering:', get_kfold_rmse(train))

# Find the total area of the house
train['TotalArea'] = train['TotalBsmtSF'] + train['FirstFlrSF'] + train['SecondFlrSF']
print('RMSE with total area:', get_kfold_rmse(train))

# Find the area of the garden
train['GardenArea'] = train['LotArea'] - train['FirstFlrSF']
print('RMSE with garden area:', get_kfold_rmse(train))

# Find total number of bathrooms
train['TotalBath'] = train['FullBath'] + train['HalfBath']
print('RMSE with number of bathrooms:', get_kfold_rmse(train))
<script.py> output:
    RMSE before feature engineering: 36029.39
    RMSE with total area: 35073.2
    RMSE with garden area: 34413.55
    RMSE with number of bathrooms: 34506.78

Date features

# Concatenate train and test together
taxi = pd.concat([train, test])

# Convert pickup date to datetime object
taxi['pickup_datetime'] = pd.to_datetime(taxi['pickup_datetime'])

# Create a day of week feature
taxi['dayofweek'] = taxi['pickup_datetime'].dt.dayofweek

# Create an hour feature
taxi['hour'] = taxi['pickup_datetime'].dt.hour

# Split back into train and test
new_train = taxi[taxi['id'].isin(train['id'])]
new_test = taxi[taxi['id'].isin(test['id'])]

Categorical features

Label encoding

# Concatenate train and test together
houses = pd.concat([train, test])

# Label encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# Create new features
houses['RoofStyle_enc'] = le.fit_transform(houses['RoofStyle'])
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])

# Look at new features
print(houses[['RoofStyle', 'RoofStyle_enc', 'CentralAir', 'CentralAir_enc']].head())
<script.py> output:
      RoofStyle  RoofStyle_enc CentralAir  CentralAir_enc
    0     Gable              1          Y               1
    1     Gable              1          Y               1
    2     Gable              1          Y               1
    3     Gable              1          Y               1
    4     Gable              1          Y               1

One-Hot encoding

The problem with label encoding is that it implicitly assumes that there is a ranking dependency between the categories.

# Concatenate train and test together
houses = pd.concat([train, test])

# Label encode binary 'CentralAir' feature
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])

# Create One-Hot encoded features
ohe = pd.get_dummies(houses['RoofStyle'], prefix='RoofStyle')

# Concatenate OHE features to houses
houses = pd.concat([houses, ohe], axis=1)

# Look at OHE features
print(houses[[col for col in houses.columns if 'RoofStyle' in col]].head(3))
<script.py> output:
      RoofStyle  RoofStyle_Flat  RoofStyle_Gable  RoofStyle_Gambrel  RoofStyle_Hip  RoofStyle_Mansard  RoofStyle_Shed
    0     Gable               0                1                  0              0                  0               0
    1     Gable               0                1                  0              0                  0               0
    2     Gable               0                1                  0              0                  0               0

Target encoding - Đối tượng địa lý

Áp dụng cho những features có >= 10 categories

Mean target encoding

First of all, you will create a function that implements mean target encoding. Remember that you need to develop the two following steps:

  • Calculate the mean on the train, apply to the test
  • Split train into K folds. Calculate the out-of-fold mean for each fold, apply to this particular fold

Each of these steps will be implemented in a separate function: test_mean_target_encoding() and train_mean_target_encoding(), respectively.

The final function mean_target_encoding() takes as arguments: the train and test DataFrames, the name of the categorical column to be encoded, the name of the target column and a smoothing parameter alpha (alpha is usually from 5 -> 10). It returns two values: a new feature for train and test DataFrames, respectively.

def test_mean_target_encoding(train, test, target, categorical, alpha=5):
    # Calculate global mean on the train data
    global_mean = train[target].mean()
    
    # Group by the categorical feature and calculate its properties
    train_groups = train.groupby(categorical)
    category_sum = train_groups[target].sum()
    category_size = train_groups.size()
    
    # Calculate smoothed mean target statistics
    train_statistics = (category_sum + global_mean * alpha) / (category_size + alpha)
    
    # Apply statistics to the test data and fill new categories
    test_feature = test[categorical].map(train_statistics).fillna(global_mean)
    return test_feature.values
def train_mean_target_encoding(train, target, categorical, alpha=5):
    # Create 5-fold cross-validation
    kf = KFold(n_splits=5, random_state=123, shuffle=True)
    train_feature = pd.Series(index=train.index)
    
    # For each folds split
    for train_index, test_index in kf.split(train):
        cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
      
        # Calculate out-of-fold statistics and apply to cv_test
        cv_test_feature = test_mean_target_encoding(cv_train, cv_test, target, categorical, alpha)
        
        # Save new feature for this particular fold
        train_feature.iloc[test_index] = cv_test_feature       
    return train_feature.values
def mean_target_encoding(train, test, target, categorical, alpha=5):
  
    # Get the train feature
    train_feature = train_mean_target_encoding(train, target, categorical, alpha)
  
    # Get the test feature
    test_feature = test_mean_target_encoding(train, test, target, categorical, alpha)
    
    # Return new features to add to the model
    return train_feature, test_feature

K-fold cross-validation

You will work with a binary classification problem on a subsample from Kaggle playground competition. The objective of this competition is to predict whether a famous basketball player Kobe Bryant scored a basket or missed a particular shot.

Train data is available in your workspace as bryant_shots DataFrame. It contains data on 10,000 shots with its properties and a target variable shot\_made\_flag – whether shot was scored or not.

One of the features in the data is game_id – a particular game where the shot was made. There are 541 distinct games. So, you deal with a high-cardinality categorical feature. Let’s encode it using a target mean!

Suppose you’re using 5-fold cross-validation and want to evaluate a mean target encoded feature on the local validation.

# Create 5-fold cross-validation
kf = KFold(n_splits=5, random_state=123, shuffle=True)

# For each folds split
for train_index, test_index in kf.split(bryant_shots):
    cv_train, cv_test = bryant_shots.iloc[train_index], bryant_shots.iloc[test_index]

    # Create mean target encoded feature
    cv_train['game_id_enc'], cv_test['game_id_enc'] = mean_target_encoding(train=cv_train,
                                                                           test=cv_test,
                                                                           target='shot_made_flag',
                                                                           categorical='game_id',
                                                                           alpha=5)
    # Look at the encoding
    print(cv_train[['game_id', 'shot_made_flag', 'game_id_enc']].sample(n=1))
<script.py> output:
           game_id  shot_made_flag  game_id_enc
    7106  20500532             0.0        0.362
           game_id  shot_made_flag  game_id_enc
    5084  20301100             0.0        0.568
           game_id  shot_made_flag  game_id_enc
    6687  20500228             0.0        0.481
           game_id  shot_made_flag  game_id_enc
    5046  20301075             0.0        0.252
           game_id  shot_made_flag  game_id_enc
    4662  20300515             1.0        0.453

Beyond binary classification

Of course, binary classification is just a single special case. Target encoding could be applied to any target variable type:

  • For binary classification usually mean target encoding is used
  • For regression mean could be changed to median, quartiles, etc.
  • For multi-class classification with N classes we create N features with target mean for each category in one vs. all fashion The mean_target_encoding() function you’ve created could be used for any target type specified above. Let’s apply it for the regression problem on the example of House Prices Kaggle competition.

Your goal is to encode a categorical feature “RoofStyle” using mean target encoding. The train and test DataFrames are already available in your workspace.

# Create mean target encoded feature
train['RoofStyle_enc'], test['RoofStyle_enc'] = mean_target_encoding(train=train,
                                                                     test=test,
                                                                     target='SalePrice',
                                                                     categorical='RoofStyle',
                                                                     alpha=10)

# Look at the encoding
print(test[['RoofStyle', 'RoofStyle_enc']].drop_duplicates())
<script.py> output:
         RoofStyle  RoofStyle_enc
    0        Gable     171565.948
    1          Hip     217594.645
    98     Gambrel     164152.950
    133       Flat     188703.563
    362    Mansard     180775.939
    1053      Shed     188267.663

Missing data

Find missing data

# Read DataFrame
twosigma = pd.read_csv('twosigma_train.csv')

# Find the number of missing values in each column
print(twosigma.isnull().sum())
<script.py> output:
    id                 0
    bathrooms          0
    bedrooms           0
    building_id       13
    latitude           0
    longitude          0
    manager_id         0
    price             32
    interest_level     0
    dtype: int64
# Look at the columns with the missing values
print(twosigma[['building_id', 'price']].head())
                         building_id   price
    0  53a5b119ba8f7b61d4e010512e0dfc85  3000.0
    1  c5c8a357cba207596b04d1afd1e4f130  5465.0
    2  c3ba40552e2120b0acfc3cb5730bb2aa  2850.0
    3  28d9ad350afeaab8027513a3e52ac8d5  3275.0
    4                               NaN  3350.0

Impute missing data

You’ve found that “price” and “building_id” columns have missing values in the Rental Listing Inquiries dataset. So, before passing the data to the models you need to impute these values.

Numerical feature “price” will be encoded with a mean value of non-missing prices.

Imputing categorical feature “building_id” with the most frequent category is a bad idea, because it would mean that all the apartments with a missing “building_id” are located in the most popular building. The better idea is to impute it with a new category.

# Import SimpleImputer
from sklearn.impute import SimpleImputer

# Create mean imputer
mean_imputer = SimpleImputer(strategy='mean')

# Price imputation
rental_listings[['price']] = mean_imputer.fit_transform(rental_listings[['price']])
# Import SimpleImputer
from sklearn.impute import SimpleImputer

# Create constant imputer
constant_imputer = SimpleImputer(strategy='constant', fill_value='MISSING')

# building_id imputation
rental_listings[['building_id']] = constant_imputer.fit_transform(rental_listings[['building_id']])

Modeling

Slide

Baseline model

Hyperparameter tuning

Model ensembling

Final tips


NOTE:

Chào bạn, việc lựa chọn thuật toán phù hợp cho các cuộc thi Kaggle là một yếu tố quan trọng để đạt được kết quả tốt. Dưới đây là một số thuật toán thường được sử dụng làm baseline và các thuật toán nên thử để cải thiện kết quả:

Các thuật toán Baseline thường được sử dụng

  • Linear Regression (Hồi quy tuyến tính): Đây là một thuật toán đơn giản và dễ hiểu, thường được sử dụng làm baseline cho các bài toán hồi quy.
  • Logistic Regression (Hồi quy Logistic): Tương tự như Linear Regression, nhưng được sử dụng cho các bài toán phân loại nhị phân (binary classification).
  • Decision Tree (Cây quyết định): Một thuật toán đơn giản, dễ trực quan hóa, nhưng có thể bị overfitting nếu không được kiểm soát tốt.
  • Random Forest (Rừng ngẫu nhiên): Một thuật toán mạnh mẽ, thường được sử dụng làm baseline cho các bài toán phân loại. Random Forest là một tập hợp các cây quyết định, giúp giảm thiểu overfitting và cải thiện độ chính xác.

Các thuật toán nên thử để cải thiện kết quả

  • Gradient Boosting Machines (GBM): Đây là một họ các thuật toán boosting mạnh mẽ, bao gồm XGBoost, LightGBM và CatBoost. GBM thường cho kết quả rất tốt trong các cuộc thi Kaggle.
  • Neural Networks (Mạng nơ-ron): Mạng nơ-ron có thể học các biểu diễn phức tạp của dữ liệu và thường được sử dụng cho các bài toán phức tạp như xử lý ảnh và xử lý ngôn ngữ tự nhiên.
  • Support Vector Machines (SVM): SVM là một thuật toán mạnh mẽ trong việc tìm ra ranh giới quyết định tối ưu giữa các lớp.
  • k-Nearest Neighbors (k-NN): Một thuật toán đơn giản, dựa trên khoảng cách giữa các điểm dữ liệu. k-NN có thể hiệu quả trong một số trường hợp nhất định.

Các yếu tố khác cần xem xét

  • Đặc điểm của dữ liệu: Loại dữ liệu (số, văn bản, ảnh, …) và cấu trúc của dữ liệu sẽ ảnh hưởng đến việc lựa chọn thuật toán.
  • Kích thước dữ liệu: Với dữ liệu lớn, các thuật toán phức tạp như Neural Networks có thể hiệu quả hơn, nhưng đòi hỏi nhiều tài nguyên tính toán hơn.
  • Thời gian và tài nguyên: Một số thuật toán đòi hỏi nhiều thời gian huấn luyện và tài nguyên tính toán hơn các thuật toán khác.

Lời khuyên

  • Bắt đầu với một baseline: Hãy thử một số thuật toán đơn giản trước để thiết lập một baseline.
  • Thử nghiệm và so sánh: Thử nghiệm nhiều thuật toán khác nhau và so sánh kết quả của chúng trên tập validation để chọn ra thuật toán tốt nhất.
  • Tối ưu hóa siêu tham số: Điều chỉnh các siêu tham số của thuật toán để đạt được kết quả tốt nhất.
  • Ensemble learning: Kết hợp kết quả của nhiều mô hình khác nhau để cải thiện độ chính xác.