Skip to content

Double-counting the documents containing an item #10

@jianle4github

Description

@jianle4github

If an item, for example, "Bourqoqne" appears multiple times in a given document, "Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France", your algorithm will append this same item into the IrIndex.index list and IrIndex.tf list multiple times. This multiple-append implementation distorts the calculation of total number of documents containing the given item in the following code:

idf = log( float( len(self.documents) ) / float( len(self.tf[term]) ) )

I changed the code from:

for term in terms:
if term not in self.index:
self.index[term] = []
self.tf[term] = []

        self.index[term].append(document_pos)
        self.tf[term].append(terms.count(term))

to:

for term in terms:
if term not in self.index:
self.index[term] = []
self.tf[term] = []

        if document_pos not in self.index[term]:
            self.index[term].append(document_pos)
            self.tf[term].append(terms.count(term))

by skipping the subsequent append operations if an item in conjunction with its containing document is already recorded inside an IrIndex object.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions