任务一:比较两个sql给出的数据,检查是否有不一致

这里写自定义目录标题

任务一:比较两个sql给出的数据,检查是否有不一致

使用pandas进行处理与分析

数据特性

表1:353140
表2:3560
40

  1. 初步想法:
  • 读取两个表,对比每一行是否一致
// An highlighted block
from pandas import DataFrame
import pandas as pd
df1=pd.read_csv('./Desktop/021_0210.csv',encoding='gb18030')
df2=pd.read_csv('./Desktop/fenxichouxuan.csv',encoding='gb18030')
#新增唯一列
df1['new_add1']=df1['interface_list_sn'].astype('str').str.cat(df1['classname'],sep='-')
df2['new_add1']=df2['interface_list_sn'].astype('str').str.cat(df2['classname'],sep='-')
dt1_name=df1['new_add1'].values.tolist()
dt2_name=df2['new_add1'].values.tolist()
count=0
for i in dt2_name:
 if i in dt1_name:
     dt1_row=df1.loc[df1['new_add1']==i]
     dt2_row=df2.loc[df2['new_add1']==i]
     dt1_row_=dt1_row.loc[:,dt1_row.columns.difference(['optime','pidflag','doc_type_ai_probability','rule_type','reasoncode','rule_reason'])]
     dt2_row_=dt2_row.loc[:,dt2_row.columns.difference(['optime','pidflag','doc_type_ai_probability','rule_type','reasoncode','rule_reason'])]
     if dt2_row_.equals(dt1_row_):
         print('i=',i)
         pass
     else: 
         count +=1   
 dt1_row.to_csv(r'test.csv',index=False,mode='a',header=None)           
 dt2_row.to_csv(r'test.csv',index=False,mode='a',header=None)            
 else:
     print("匹配失败的序列:",i)
print(count)

结果显示基本所有行都是不一致。检查原数据,发现以下问题。

  1. 表1中无数据以\N填充;表2中无数据以空字符串显示;
// An highlighted block
#将表2中的空字符串填充为\N
df2=df2.fillna(r'\N')
  1. 表1中‘optime’时间是datetime数据类型;表2中为object数据类型;
  2. 部分列表1和表2的内容一致,但文字顺序不一致,比较时显示不一致
  3. 表1的长数值以科学计数法展示,表2以正常整形展示,比较时不一致
#中间步骤
from pandas import DataFrame
import pandas as pd
df1=pd.read_csv('./Desktop/021_0210.csv',encoding='gb18030')
df2=pd.read_csv('./Desktop/fenxichouxuan.csv',encoding='gb18030')
#表2的空字符串填充为\N
df2=df2.fillna(r'\N')
#时间暂不参与比较
del df1['optime']
del df2['optime']
#新增唯一列
df1['new_add1']=df1['interface_list_sn'].astype('str').str.cat(df1['classname'],sep='-')
df2['new_add1']=df2['interface_list_sn'].astype('str').str.cat(df2['classname'],sep='-')
dt1_name=df1['new_add1'].values.tolist()
dt2_name=df2['new_add1'].values.tolist()
count=0
for i in dt2_name:
 if i in dt1_name:
     dt1_row=df1.loc[df1['new_add1']==i]
     dt2_row=df2.loc[df2['new_add1']==i]
     dt1_row_=dt1_row.loc[:,dt1_row.columns.difference(['optime','pidflag','doc_type_ai_probability','rule_type','reasoncode','rule_reason'])]
     dt2_row_=dt2_row.loc[:,dt2_row.columns.difference(['optime','pidflag','doc_type_ai_probability','rule_type','reasoncode','rule_reason'])]
     if dt2_row_.equals(dt1_row_):
         print('i=',i)
         pass
     else: 
         count +=1   
 dt1_row.to_csv(r'test.csv',index=False,mode='a',header=None)           
 dt2_row.to_csv(r'test.csv',index=False,mode='a',header=None)            
 else:
     print("匹配失败的序列:",i)
print(count)

未完…待更新

学习(数据清洗)

查看数据

查看数据是否的结构
df.dtypes
df.shape
判断是否有重复行
df.

处理数据

pandas的数据类型转换
pandas astype 将科学计数法转化成整形int
abc.astype(np.int64)
pandas:填充缺失值
pandas的多列拼接成一列函数.str.cat()
查看dataframe的重复数据
duplicated()方法判断
drop_duplicats()方法去重

    原文作者:执着的我2222
    原文地址: https://blog.csdn.net/mnwlh_/article/details/123117558
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞