What Should I Do to Finish a Kaggle Competition?

This post records some useful ideas from reading other's posts on Kaggle. If you are a beginner to data science and Kaggle, why not take a look?

参考Spaceship得到的分析数据问题的基本框架。其中主要使用的是LazyClassifier包含的模型

0 数据载入

1 数据初步分析

查看各feature是否有null值。(考虑后续处理)

查看各feature的unique取值数量占总数据的比例。(考虑某些feature过于多样无效)

查看各feature的分布(percentage，数值类型，可枚举类型),并可视化。 (考虑某些feature是否过于集中/平坦)

查看各feature的outlier数量占总数据的比例。(如果有一些outlier考虑截取掉极值)

查看feature之间的关联性。

2 数据预处理

删除不相干feature

将某些feature拆分/合并

处理feature中的null值(drop掉对应行/用mean值替代/用mode替代)

将feature中非数值类型转换为one-hot编码

增广数据，利用已有的feature建立额外的feature

列名处理，统一数据格式

3 模型训练

划分训练集和测试集(train_test_split)

from sklearn.model_selection import train_test_split

引入模型包进行训练

from lazypredict.Supervised import LazyClassifier,

使用训练好的模型对数据进行预测

构建损失函数计算value

4 生成数据并提交

参考入门教程得到的对模型优化小技巧

1 交叉验证cross_val_score

2 使用blender来集成多个模型的结果/使用VotingClassifier去平衡多个模型

参考模型总结获取到的可以参考使用的模型

1 树模型三剑客XGBoost、LightGBM、CatBoost

2 SVR支持向量回归， MLP+Embedding，Tabnet

3 Transformer

可以使用Pycaret自动给比较获取最佳模型，也可构建模型类手动添加模型

看了一些模型之后的想法:

1 尽可能多地找到不同的特征

2 将数据集分散成多个小的集再训练

3 模型多组参数尝试，最终选择最优

如何进行调参优化模型:

1 使用网格搜索获取最佳参数组合 GridSearchCV

2 根据模型训练后的important feature再修改不同feature重视程度,或添加新的feature

Yilin Blog

Kaggle -- Note for Beginners(Chinese)

What Should I Do to Finish a Kaggle Competition?