首页>Program>source

因此,我有两个CSV文件,它们试图进行比较并获得相似项目的结果。 第一个文件hosts.csv如下所示:

Path    Filename    Size    Signature
C:\     a.txt       14kb    012345
D:\     b.txt       99kb    678910
C:\     c.txt       44kb    111213

第二个文件masterlist.csv如下所示:

Filename    Signature
b.txt       678910
x.txt       111213
b.txt       777777
c.txt       999999

您可以看到行不匹配,并且masterlist.csv始终大于hosts.csv文件.我要搜索的唯一部分是签名部分.我知道这看起来像:

hosts[3] == masterlist[1]

我正在寻找一种解决方案,该解决方案将为我提供以下内容(基本上是带有新的RESULTS列的hosts.csv文件):

Path    Filename    Size    Signature    RESULTS
C:\     a.txt       14kb    012345       NOT FOUND in masterlist
D:\     b.txt       99kb    678910       FOUND in masterlist (row 1)
C:\     c.txt       44kb    111213       FOUND in masterlist (row 2)

我搜索了帖子,发现与200

python

csv

compare

最新回答
  • 8月前
    1 #

    Edit: 当我的解决方案正常工作时,请查看下面Martijn的答案,以获得更有效的解决方案。

    您可以在此处找到python CSV模块的文档。

    您正在寻找的东西是这样的:

    import csv
    f1 = file('hosts.csv', 'r')
    f2 = file('masterlist.csv', 'r')
    f3 = file('results.csv', 'w')
    c1 = csv.reader(f1)
    c2 = csv.reader(f2)
    c3 = csv.writer(f3)
    masterlist = list(c2)
    for hosts_row in c1:
        row = 1
        found = False
        for master_row in masterlist:
            results_row = hosts_row
            if hosts_row[3] == master_row[1]:
                results_row.append('FOUND in master list (row ' + str(row) + ')')
                found = True
                break
            row = row + 1
        if not found:
            results_row.append('NOT FOUND in master list')
        c3.writerow(results_row)
    f1.close()
    f2.close()
    f3.close()
    

  • 8月前
    2 #

    srgerg的答案效率极低,因为它在二次时间内工作; 这里是线性时间解决方案,而是使用与Python 2.6兼容的语法:

    import csv
    with open('masterlist.csv', 'rb') as master:
        master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master)))
    with open('hosts.csv', 'rb') as hosts:
        with open('results.csv', 'wb') as results:    
            reader = csv.reader(hosts)
            writer = csv.writer(results)
            writer.writerow(next(reader, []) + ['RESULTS'])
            for row in reader:
                index = master_indices.get(row[3])
                if index is not None:
                    message = 'FOUND in master list (row {})'.format(index)
                else:
                    message = 'NOT FOUND in master list'
                writer.writerow(row + [message])
    

    这将产生一个字典,映射来自 masterlist.csv的签名 到行号.在字典中查找需要花费恒定的时间,从而在 hosts.csv上进行第二次循环 行与 masterlist.csv中的行数无关 .更不用说简单得多的代码了。

    对于使用Python 3的用户,上面仅需要使用 open() 调整为在文本模式下打开的通话(删除 b 从文件模式),并且您想添加 new line='' 因此CSV阅读器可以控制行分隔符.您可能需要声明要明确使用的编码,而不是依赖于系统默认值(使用 encoding=... ).维兹威兹 可以使用字典理解( master_indices )。

  • 8月前
    3 #

    Python的CSV和集合模块(特别是OrderedDict)在这里确实很有帮助.您想使用OrderedDict保留键等的顺序。您不必这样做,但这很有用!

    {r[1]: i for i, r in enumerate(csv.reader(master))}
    

    以下是使用您的测试CSV文件的输出:

    import csv
    from collections import OrderedDict
    
    signature_row_map = OrderedDict()
    
    with open('hosts.csv') as file_object:
        for line in csv.DictReader(file_object, delimiter='\t'):
            signature_row_map[line['Signature']] = {'line': line, 'found_at': None}
    
    with open('masterlist.csv') as file_object:
        for i, line in enumerate(csv.DictReader(file_object, delimiter='\t'), 1):
            if line['Signature'] in signature_row_map:
                signature_row_map[line['Signature']]['found_at'] = i
    
    with open('newhosts.csv', 'w') as file_object:
        fieldnames = ['Path', 'Filename', 'Size', 'Signature', 'RESULTS']
        writer = csv.DictWriter(file_object, fieldnames, delimiter='\t')
        writer.writer.writerow(fieldnames)
        for signature_info in signature_row_map.itervalues():
            result = '{0} FOUND in masterlist {1}'
            # explicit check for sentinel
            if signature_info['found_at'] is not None:
                result = result.format('', '(row %s)' % signature_info['found_at'])
            else:
                result = result.format('NOT', '')
            payload = signature_info['line']
            payload['RESULTS'] = result
            writer.writerow(payload)
    

    请原谅未对齐,以制表符分隔:)

  • 8月前
    4 #

    Path Filename Size Signature RESULTS C:\ a.txt 14kb 012345 NOT FOUND in masterlist D:\ b.txt 99kb 678910 FOUND in masterlist (row 1) C:\ c.txt 44kb 111213 FOUND in masterlist (row 2) 模块在解析csv文件时非常有用.但是只是为了好玩,我只是在空白处拆分输入以获取数据。

    只需解析数据,构建一个 csv 对于masterlist.csv中的数据,签名为密钥,行号为值.现在,对于hosts.csv的每一行,我们只需查询 dict 并找出masterlist.csv中是否存在相应的条目,以及在哪一行存在。

    dict
    

  • 8月前
    5 #

    我只是在Martijn Pieters代码中修复了一个小问题,以使其能够在Python 3中运行,并且在此代码中,我试图匹配file1 #! /usr/bin/env python def read_data(filename): input_source=open(filename,'r') input_source.readline() return [line.split() for line in input_source] if __name__=='__main__': hosts=read_data('hosts.csv') masterlist=read_data('masterlist.csv') master=dict() for index,data in enumerate(masterlist): master[data[-1]]=index+1 for row in hosts: try: found="FOUND in masterlist (row %s)"%master[row[-1]] except KeyError: found="NOT FOUND in masterlist" line=row+[found] print "%s %s %s %s %s"%tuple(line)中的第一列元素 在file2 row[0]中具有第一列元素 .

    row[0]
    
    import csv with open('file1.csv', 'rt', encoding='utf-8') as master: master_indices = dict((r[0], i) for i, r in enumerate(csv.reader(master))) with open('file2.csv', 'rt', encoding='utf-8') as hosts: with open('result.csv', 'w') as results: reader = csv.reader(hosts) writer = csv.writer(results) writer.writerow(next(reader, []) + ['RESULTS']) for row in reader: index = master_indices.get(row[0]) if index is not None: message = 'FOUND in master list (row {})'.format(index) writer.writerow(row + [message]) else: message = 'NOT FOUND in master list' writer.writerow(row + [message]) results.close()

  • 通过ajax渲染其他表单会导致其视图状态丢失,如何将其重新添加?
  • r:在ggplot2中将构面标签更改为数学公式