查看原文
其他

时序特征相关系数的稳定性分析(附代码)

The following article is from 宅码 Author Ai

在时序中,特征也许是具有时效性的,比如在某些市场环境下,股票的收益更看重公司的市盈率,另外的行情时,有看重换手率。本质上,可以反映为:在时间上,特征与目标变量之间相关性的不稳定,为此,我们能做一些相关性分析,帮我们找到这些时间上不稳定的特征,剔除它们,并让模型更加鲁棒。




这里,直接上例子:
import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as plt
# 导入数据train_df = pd.read_csv('train.csv')train.head()

先对训练集数据,每月统计特征与target的相关性:
# 获取(年,月)特征train_df['Date'] = pd.to_datetime(train_df['Date'])train_df['Date_year'] = train_df['Date'].dt.yeartrain_df['Date_month'] = train_df['Date'].dt.monthdef concat_year_month(year, month):    return (year, month)train_df['Date_ym'] = train_df.apply(lambda x:concat_year_month(x['Date_year'], x['Date_month']), axis=1)
# 针对每月,统计特征与target的相关性date_yms = train_df['Date_ym'].unique()corr_df = []for date_ym in date_yms:    curr_df = train_df[train_df['Date_ym']==date_ym]    curr_corr_df = curr_df.corr()    curr_corr_df = curr_corr_df['target'].reset_index()    curr_corr_df.rename(columns={'index':'feature', 'target':'corr'}, inplace=True)    curr_corr_df['Date_ym'] = str(date_ym)    corr_df.append(curr_corr_df)
corr_df = pd.concat(corr_df, axis=0).reset_index(drop=True)

再观察每月各个特征相关性,随着时间变化的情况:
USE_COLS = [f for f in corr_df['feature'].unique() \ if f not in ['Date', 'Date_year', 'Date_month', 'target']] # 训练时用的列名corr_df = corr_df[corr_df['feature'].isin(USE_COLS)]fig = plt.figure(figsize=(25,4))fig = sns.lineplot(data=corr_df, x='Date_ym', y='corr', hue='feature')fig.set_xticklabels(date_yms, rotation=90)plt.title('the correlation along the time axis')plt.show()
也可以看下,各个特征的月度相关性的标准差:
fig = plt.figure(figsize=(25,4))fig = sns.boxplot(data=corr_df.sort_values('feature'), x='feature', y='corr')plt.xticks(rotation=90)plt.title('The BoxPlot of each feature')plt.show()

我们打印看看,top相关性标准差大的特征有哪些:
top_corrStd_fnum = 10 # 选择top相关性标准差大的特征的数量top_corrStd_feats = corr_df.groupby('feature').std().reset_index().sort_values('corr', ascending=False)['feature'].iloc[:top_corrStd_fnum].to_list()print(top_corrStd_feats)

总结以上内容,打包函数如下:
def get_unstable_feats(df, top_fnum=10, corr_thresh=0.15): """对训练集数据,每月统计特征与target的相关性, 基于每个特征相关性的标准差,选择出TOP不稳定的特征,用于后续的特征选择工作 输入: df (pd.DataFrame): 训练集 top_fnum (int): top不稳定特征数量 corr_thresh (float): 相关性的标准差阈值 注意:若选择corr_thresh,而不是top_fnum,只要将top_fnum设为None就好。 输出: unstable_feats (list): 不稳定的特征 """ # 获取(年,月)特征 df['Date'] = pd.to_datetime(df['Date']) df['Date_year'] = df['Date'].dt.year df['Date_month'] = df['Date'].dt.month def concat_year_month(year, month): return (year, month) df['Date_ym'] = df.apply(lambda x:concat_year_month(x['Date_year'], x['Date_month']), axis=1) # 针对每月,统计特征与target的相关性 date_yms = df['Date_ym'].unique() corr_df = [] for date_ym in date_yms: curr_df = df[df['Date_ym']==date_ym] curr_corr_df = curr_df.corr() curr_corr_df = curr_corr_df['target'].reset_index() curr_corr_df.rename(columns={'index':'feature', 'target':'corr'}, inplace=True) curr_corr_df['Date_ym'] = str(date_ym) corr_df.append(curr_corr_df) corr_df = pd.concat(corr_df, axis=0).reset_index(drop=True) # 剔除非训练特征 USE_COLS = [f for f in corr_df['feature'].unique() if f not in ['Date', 'Date_year', 'Date_month', 'target']] corr_df = corr_df[corr_df['feature'].isin(USE_COLS)] # 基于每个特征相关性的标准差,选择出TOP不稳定的特征 if top_fnum != None: top_corrStd_fnum = top_fnum # 选择top相关性标准差大的特征的数量 top_corrStd_feats = corr_df.groupby('feature').std().reset_index().sort_values('corr', ascending=False)['feature'].iloc[:top_corrStd_fnum].to_list() elif corr_thresh != None: corr_df = corr_df.groupby('feature').std().reset_index().sort_values('corr', ascending=False) top_corrStd_feats = corr_df[corr_df['corr'] >= corr_thresh]['feature'].to_list() print('Features with Unstable Correlation:', top_corrStd_feats) return top_corrStd_feats top_corrStd_feats = get_unstable_feats(train_df, top_fnum=10, corr_thresh=None)

实际使用上述方法,确实对含冗余特征且存在明显相关性不稳定的数据集,有提分的帮助。
扩展:除了相关性分析,Kaggle常见的一个技巧:对抗验证也能做这块不稳定特征的筛选工作
参考:https://www.kaggle.com/competitions/ubiquant-market-prediction/discussion/312398推荐阅读
LightGBM 原理、代码最全解读!15种顶级分析思维模型。从梯度下降到 Adam!一文看懂各种神经网络优化算法

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存