这里写自定义目录标题
任务一:比较两个sql给出的数据,检查是否有不一致
使用pandas进行处理与分析
数据特性
表1:353140
表2:356040
- 初步想法:
- 读取两个表,对比每一行是否一致
// An highlighted block
from pandas import DataFrame
import pandas as pd
df1=pd.read_csv('./Desktop/021_0210.csv',encoding='gb18030')
df2=pd.read_csv('./Desktop/fenxichouxuan.csv',encoding='gb18030')
#新增唯一列
df1['new_add1']=df1['interface_list_sn'].astype('str').str.cat(df1['classname'],sep='-')
df2['new_add1']=df2['interface_list_sn'].astype('str').str.cat(df2['classname'],sep='-')
dt1_name=df1['new_add1'].values.tolist()
dt2_name=df2['new_add1'].values.tolist()
count=0
for i in dt2_name:
if i in dt1_name:
dt1_row=df1.loc[df1['new_add1']==i]
dt2_row=df2.loc[df2['new_add1']==i]
dt1_row_=dt1_row.loc[:,dt1_row.columns.difference(['optime','pidflag','doc_type_ai_probability','rule_type','reasoncode','rule_reason'])]
dt2_row_=dt2_row.loc[:,dt2_row.columns.difference(['optime','pidflag','doc_type_ai_probability','rule_type','reasoncode','rule_reason'])]
if dt2_row_.equals(dt1_row_):
print('i=',i)
pass
else:
count +=1
dt1_row.to_csv(r'test.csv',index=False,mode='a',header=None)
dt2_row.to_csv(r'test.csv',index=False,mode='a',header=None)
else:
print("匹配失败的序列:",i)
print(count)
结果显示基本所有行都是不一致。检查原数据,发现以下问题。
- 表1中无数据以\N填充;表2中无数据以空字符串显示;
// An highlighted block
#将表2中的空字符串填充为\N
df2=df2.fillna(r'\N')
- 表1中‘optime’时间是datetime数据类型;表2中为object数据类型;
- 部分列表1和表2的内容一致,但文字顺序不一致,比较时显示不一致
- 表1的长数值以科学计数法展示,表2以正常整形展示,比较时不一致
#中间步骤
from pandas import DataFrame
import pandas as pd
df1=pd.read_csv('./Desktop/021_0210.csv',encoding='gb18030')
df2=pd.read_csv('./Desktop/fenxichouxuan.csv',encoding='gb18030')
#表2的空字符串填充为\N
df2=df2.fillna(r'\N')
#时间暂不参与比较
del df1['optime']
del df2['optime']
#新增唯一列
df1['new_add1']=df1['interface_list_sn'].astype('str').str.cat(df1['classname'],sep='-')
df2['new_add1']=df2['interface_list_sn'].astype('str').str.cat(df2['classname'],sep='-')
dt1_name=df1['new_add1'].values.tolist()
dt2_name=df2['new_add1'].values.tolist()
count=0
for i in dt2_name:
if i in dt1_name:
dt1_row=df1.loc[df1['new_add1']==i]
dt2_row=df2.loc[df2['new_add1']==i]
dt1_row_=dt1_row.loc[:,dt1_row.columns.difference(['optime','pidflag','doc_type_ai_probability','rule_type','reasoncode','rule_reason'])]
dt2_row_=dt2_row.loc[:,dt2_row.columns.difference(['optime','pidflag','doc_type_ai_probability','rule_type','reasoncode','rule_reason'])]
if dt2_row_.equals(dt1_row_):
print('i=',i)
pass
else:
count +=1
dt1_row.to_csv(r'test.csv',index=False,mode='a',header=None)
dt2_row.to_csv(r'test.csv',index=False,mode='a',header=None)
else:
print("匹配失败的序列:",i)
print(count)
未完…待更新
学习(数据清洗)
查看数据
查看数据是否的结构
df.dtypes
df.shape
判断是否有重复行
df.
处理数据
pandas的数据类型转换
pandas astype 将科学计数法转化成整形int
abc.astype(np.int64)
pandas:填充缺失值
pandas的多列拼接成一列函数.str.cat()
查看dataframe的重复数据
duplicated()方法判断
drop_duplicats()方法去重
声明:本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。