无论是在参加比赛,还是在实际工程中,把数据处理完之后,肯定不会只用一种算法进行测试,不可避免的采用多种算法进行比较。而autoclf就是这次在阿里风险支付大赛进行中,花了两天撸出来的一个简单框架。
目录结构为1
2
3
4
5
6
7
8
9
10├── [4.0K] clf
│ │
│ ├── [4.0K] nn
│
├── [4.0K] data
│
├── [4.0K] pipe
│
└── [4.0K] saved
clf
目录下是常见的或者自定义的算法, 而nn
目录下是自定义的深度学习算法,和sklearn
接口绑定的,自定义的算法是以类的形式存在的,只需要实现fit
,score
,predict
即可,但是如果自己的fit
使用了sklearn
的训练,就无需再自定义score
,predict
了。例如这样: (clf/isvc.py
)
1 |
|
然后在train.py
里调用即可,工程还在继续完善,准备增加命令行接口。然而在训练中最重要的无疑是数据预处理和特征选择,而在train.py
中传入的数据就是来自pipe
文件夹里定义的预处理函数。train.py
1 | import os |
pipe/iload_aliaetc.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53import pandas as pd
import os
from sklearn.model_selection import train_test_split
train_data_path = 'data/atec_anti_fraud_train.csv'
predict_data_path = 'data/atec_anti_fraud_test_a.csv'
DROPCOLUMS = ["id","label","date"]
# 0 .... 1, 0 is safe / 1 is not safe
def iload_aliatec_pipe():
if os.path.isfile(train_data_path) and os.path.isfile(predict_data_path):
print("[√] Path Checked, File Exists")
else:
print("[X] Please Make Sure Your Datasets Was Exists")
import sys
sys.exit(1)
data = pd.read_csv(train_data_path)
data = data.fillna(0)
unlabeled = data[data['label'] == -1]
labeled = data[data['label'] != -1]
train, test = train_test_split(labeled, test_size=0.2, random_state=42)
cols = [c for c in DROPCOLUMS if c in train.columns]
x_train = train.drop(cols,axis=1)
cols = [c for c in DROPCOLUMS if c in test.columns]
x_test = test.drop(cols,axis=1)
y_train = train['label']
y_test = test['label']
return x_train, y_train, x_test, y_test
def iload_predict_data():
upload_test = pd.read_csv(predict_data_path)
upload_test = upload_test.fillna(0)
upload_id = upload_test['id']
cols = [c for c in DROPCOLUMS if c in upload_test.columns]
upload_test = upload_test.drop(cols,axis=1)
return upload_id, upload_test
def isave_predict_data(data_id,predict,filename):
p = pd.DataFrame(predict,columns=["score"])
res = pd.concat([data_id,p],axis=1)
res.to_csv(filename,index=False)
print("[+] Save Predict Result To {} Sucessful".format(filename))
而预测是单独的predict.py文件,会自动加载由train.py
训练完成后的模型,然后进行批量预测,并保存到相应的文件夹中。
ps: 1332 人参赛,目前居然只有36人提交结果。大佬都是在后面提交的啊。自己还是真的菜。不过让我想不明白的是,这个比赛每天只启动一次系统测评,不像之前的腾讯广告回流的那个,每次提交都会即时评测,可以方便调优,但这个又不是,搞不懂。不过我已经有了新的思路去做这个了。关键点不一定是那些个-1的标签,但一定是个突破点。