Deduplication refers to the process of identifying duplicate documents and accounting for those duplicates in the analysis of the data. Multiple copies of a particular posting are often scraped from various sources on the internet. Rather than allowing the duplicates to artificially inflate the posting count, analyses of job postings data take these duplicates into account by deduplicating the data before presenting it for analysis.
Two postings that are duplicates are usually not exactly identical. The deduplication process uses a statistical classifier that has been trained to detect duplicates based on comparison of a number of fields in the postings such as location, job title, company name, and similarity of posting text.
Similarity of posting text is detected using shingling, a technique that analyzes the similarity of textual sequences in a block of text. For instance, given the sentence “the quick brown fox jumped over the lazy dog,” the following shingles might be evaluated:
- the quick brown fox
- quick brown fox jumped
- brown fox jumped over
Potential duplicate postings’ shingles are compared and an index is assigned based on the similarity of the postings’ shingles. A textual shingling threshold, accompanied by the comparison of other fields checked by the statistical classifier, gives a reliable indicator of whether two postings duplicate each other.
Duplicate postings are stored and tracked as such along with original postings, ensuring that both total and unique (deduplicated) posting counts are available.