因此,我有两个CSV文件,它们试图进行比较并获得相似项目的结果。 第一个文件hosts.csv如下所示:
Path Filename Size Signature
C:\ a.txt 14kb 012345
D:\ b.txt 99kb 678910
C:\ c.txt 44kb 111213
第二个文件masterlist.csv如下所示:
Filename Signature
b.txt 678910
x.txt 111213
b.txt 777777
c.txt 999999
您可以看到行不匹配,并且masterlist.csv始终大于hosts.csv文件.我要搜索的唯一部分是签名部分.我知道这看起来像:
hosts[3] == masterlist[1]
我正在寻找一种解决方案,该解决方案将为我提供以下内容(基本上是带有新的RESULTS列的hosts.csv文件):
Path Filename Size Signature RESULTS
C:\ a.txt 14kb 012345 NOT FOUND in masterlist
D:\ b.txt 99kb 678910 FOUND in masterlist (row 1)
C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
- 2021-1-121 #
- 2021-1-122 #
srgerg的答案效率极低,因为它在二次时间内工作; 这里是线性时间解决方案,而是使用与Python 2.6兼容的语法:
import csv with open('masterlist.csv', 'rb') as master: master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master))) with open('hosts.csv', 'rb') as hosts: with open('results.csv', 'wb') as results: reader = csv.reader(hosts) writer = csv.writer(results) writer.writerow(next(reader, []) + ['RESULTS']) for row in reader: index = master_indices.get(row[3]) if index is not None: message = 'FOUND in master list (row {})'.format(index) else: message = 'NOT FOUND in master list' writer.writerow(row + [message])
这将产生一个字典,映射来自
masterlist.csv
的签名 到行号.在字典中查找需要花费恒定的时间,从而在hosts.csv
上进行第二次循环 行与masterlist.csv
中的行数无关 .更不用说简单得多的代码了。对于使用Python 3的用户,上面仅需要使用
open()
调整为在文本模式下打开的通话(删除b
从文件模式),并且您想添加new line=''
因此CSV阅读器可以控制行分隔符.您可能需要声明要明确使用的编码,而不是依赖于系统默认值(使用encoding=...
).维兹威兹 可以使用字典理解(master_indices
)。 - 2021-1-123 #
Python的CSV和集合模块(特别是OrderedDict)在这里确实很有帮助.您想使用OrderedDict保留键等的顺序。您不必这样做,但这很有用!
{r[1]: i for i, r in enumerate(csv.reader(master))}
以下是使用您的测试CSV文件的输出:
import csv from collections import OrderedDict signature_row_map = OrderedDict() with open('hosts.csv') as file_object: for line in csv.DictReader(file_object, delimiter='\t'): signature_row_map[line['Signature']] = {'line': line, 'found_at': None} with open('masterlist.csv') as file_object: for i, line in enumerate(csv.DictReader(file_object, delimiter='\t'), 1): if line['Signature'] in signature_row_map: signature_row_map[line['Signature']]['found_at'] = i with open('newhosts.csv', 'w') as file_object: fieldnames = ['Path', 'Filename', 'Size', 'Signature', 'RESULTS'] writer = csv.DictWriter(file_object, fieldnames, delimiter='\t') writer.writer.writerow(fieldnames) for signature_info in signature_row_map.itervalues(): result = '{0} FOUND in masterlist {1}' # explicit check for sentinel if signature_info['found_at'] is not None: result = result.format('', '(row %s)' % signature_info['found_at']) else: result = result.format('NOT', '') payload = signature_info['line'] payload['RESULTS'] = result writer.writerow(payload)
请原谅未对齐,以制表符分隔:)
- 2021-1-124 #
Path Filename Size Signature RESULTS C:\ a.txt 14kb 012345 NOT FOUND in masterlist D:\ b.txt 99kb 678910 FOUND in masterlist (row 1) C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
模块在解析csv文件时非常有用.但是只是为了好玩,我只是在空白处拆分输入以获取数据。只需解析数据,构建一个
csv
对于masterlist.csv中的数据,签名为密钥,行号为值.现在,对于hosts.csv的每一行,我们只需查询dict
并找出masterlist.csv中是否存在相应的条目,以及在哪一行存在。dict
- 2021-1-125 #
我只是在Martijn Pieters代码中修复了一个小问题,以使其能够在Python 3中运行,并且在此代码中,我试图匹配file1
#! /usr/bin/env python def read_data(filename): input_source=open(filename,'r') input_source.readline() return [line.split() for line in input_source] if __name__=='__main__': hosts=read_data('hosts.csv') masterlist=read_data('masterlist.csv') master=dict() for index,data in enumerate(masterlist): master[data[-1]]=index+1 for row in hosts: try: found="FOUND in masterlist (row %s)"%master[row[-1]] except KeyError: found="NOT FOUND in masterlist" line=row+[found] print "%s %s %s %s %s"%tuple(line)
中的第一列元素 在file2row[0]
中具有第一列元素 .row[0]
import csv with open('file1.csv', 'rt', encoding='utf-8') as master: master_indices = dict((r[0], i) for i, r in enumerate(csv.reader(master))) with open('file2.csv', 'rt', encoding='utf-8') as hosts: with open('result.csv', 'w') as results: reader = csv.reader(hosts) writer = csv.writer(results) writer.writerow(next(reader, []) + ['RESULTS']) for row in reader: index = master_indices.get(row[0]) if index is not None: message = 'FOUND in master list (row {})'.format(index) writer.writerow(row + [message]) else: message = 'NOT FOUND in master list' writer.writerow(row + [message]) results.close()
相关问题
- python:如何在csv表中进行行到列的数据转置?pythoncsv2021-01-11 23:25
Edit: 当我的解决方案正常工作时,请查看下面Martijn的答案,以获得更有效的解决方案。
您可以在此处找到python CSV模块的文档。
您正在寻找的东西是这样的: