Sklearn交叉验证分层与分组使用介绍
有些数据集分布并不均匀,因此在训练模型后会出现极大的不平衡。这时就需要采用分层抽样,即分层交叉验证迭代器,可以理解为分层抽样。
分组迭代器有以下几种。
声明:《Python系列教程》为本站“54笨鸟”官方原创,由国家机构和地方版权局所签发的权威证书所保护。
1.分层交叉验证
#导入相关模块 In [1]: from sklearn.model_selection import StratifiedKFold #导入相关数据 I [2]: X = ["a", "b", "c", "d", "e", "f"] #导入相关数据 In [3]: y = [0, 0, 1, 1, 1, 1,] #创建分层交叉验证对象 In [4]: skf = StratifiedKFold(n_splits=2) #查看分组结果 In [5]: for train, test in skf.split(X, y): ...: print("%s-%s" % (train, test)) [1 4 5]-[0 2 3] [0 2 3]-[1 4 5]
2. 分组交叉验证
有时测试集的数据可能是分组得来的,这时可能出现的情况就是组内的各个变量之间不是独立的,而组间是独立的。我们需要去除这个影响因素,也就是说测试集中的样本组别不能来自训练集中样本的组别。分组迭代器有以下几种。
1) 组K折
#导入相关模块 In [1]: from sklearn.model_selection import GroupKFold #导入相关数据 In [2]: X = ["a", "b", "c", "d", "e", "f"] #导入相关数据 In [3]: y = [0, 0, 1, 1, 1, 1,] #导入分组标签 In [4]: groups = [1, 1, 2, 2, 2, 2] #创建分组对象 In [5]: gkf = GroupKFold(n_splits=2) #查看分组结果 In [6]: for train, test in gkf.split(X, y, groups=groups): ...: print("%s-%s" % (train, test)) [0 1]-[2 3 4 5] [2 3 4 5]-[0 1]
2) 留一组交叉验证
#导入相关模块 In [1]: from sklearn.model_selection import LeaveOneGroupOut #导入相关数据 In [2]: X = ["a", "b", "c", "d", "e", "f"] #导入相关数据 In [3]: y = [0, 0, 1, 1, 1, 1,] #导入分组标签 In [4]: groups = [1, 2, 2, 2, 2, 2] #创建分组对象 In [5]: logo = LeaveOneGroupOut() #查看分组结果 In [6]: for train, test in logo.split(X, y, groups=groups): ...: print("%s-%s" % (train, test)) [1 2 3 4 5]-[0] [0]-[1 2 3 4 5]
3) 留P组交叉验证
#导入相关模块 In [1]: from sklearn.model_selection import LeavePGroupsOut #导入相关数据 In [2]: X = ["a", "b", "c", "d", "e", "f"] #导入相关数据 In [3]: y = [0, 0, 1, 1, 1, 1,] #导入分组标签 In [4]: groups = [1, 1, 2, 2, 3, 3] #创建分组对象 In [5]: lpgo = LeavePGroupsOut(n_groups=2) #查看分组结果 In [6]: for train, test in lpgo.split(X, y, groups=groups): ...: print("%s-%s" % (train, test)) [4 5]-[0 1 2 3] [2 3]-[0 1 4 5] [0 1]-[2 3 4 5]
4) 随机排列组交叉验证
#导入相关模块 In [1]: from sklearn.model_selection import GroupShuffleSplit #导入相关数据 In [2]: X = ["a", "b", "c", "d", "e", "f"] #导入相关数据 In [3]: y = [0, 0, 1, 1, 1, 1,] #导入分组标签 In [4]: groups = [1, 1, 2, 2, 3, 3] #创建分组对象 In [5]: gss = GroupShuffleSplit(n_splits=3, test_size=0.5) #查看分组结果 In [6]: for train, test in gss.split(X, y, groups=groups): ...: print("%s-%s" % (train, test)) [4 5]-[0 1 2 3] [0 1]-[2 3 4 5] [4 5]-[0 1 2 3]
声明:《Python系列教程》为本站“54笨鸟”官方原创,由国家机构和地方版权局所签发的权威证书所保护。