This article outlines the creation of Emsi’s job postings data, from the collection of postings to enrichment of the data.
It is important to note that job postings are not necessarily the same as job vacancies; there is a correlation, but many recruitment practices make it an imperfect relationship. Job postings are a measure of recruitment marketing by employers purportedly looking to fill job vacancies.
Emsi’s Canadian job postings data is gathered by scraping over 15,000 websites, including company career sites, national and local job boards, and job posting aggregators. Postings for roughly 290,000 companies are scraped.
Users often ask about the absence of postings from LinkedIn and Indeed in Emsi’s job postings. Both sources have asked that their sites not be scraped for postings; therefore Emsi does not collect or display postings from either source.
Job postings are assessed for likely duplicate postings, which are singularized when sufficient data is present. Deduplication is the process of identifying duplicate job postings that are connected to the same vacancy. Multiple copies of a particular posting are often scraped from various sources on the internet. Rather than allowing these duplicates to artificially inflate the posting count, Emsi deduplicates the data before presenting it for analysis.
The deduplication process uses a machine learning algorithm to determine whether two job postings are duplicates. Two postings that are duplicates usually are not exactly identical. The deduplication process uses a statistical classifier that has been trained to detect duplicates by comparing a number of fields in the postings, including location, job title, similarity of posting text, contact information in the posting, and company name. Duplicate job postings posted in separate cities will not be deduplicated and will appear as multiple job posts.
Duplicate postings are stored and tracked along with original postings, ensuring that both total and unique (deduplicated) posting counts are available.
Deduplication Over Time
In addition to the deduplication process described above, job postings are deduplicated over time to account for new postings appearing for the same vacancy after the other postings for the vacancy expired.
A vacancy is considered expired or closed when there are zero active postings for it among all of its duplicate postings. For instance, a vacancy with three total postings is considered expired when all three associated postings are no longer active. However, there are cases in which a vacancy can expire and another posting will appear for it after its expiration. In cases like these, if the new posting appears within six weeks of the vacancy’s expiration, we revive the vacancy and count the new posting as another duplicate. Job postings more than six weeks apart will not be considered potential duplicates if all prior postings have expired.
Once the postings data is scraped and deduplicated, it undergoes further enrichment and cleaning.
Company Normalization and Metadata
A company (advertiser) is assigned to each job posting based on the text present in the posting. This data includes normalized company name, NAICS (industry) code, company size, company location, whether the company is a staffing company, and other information. All subsidiary entities are reported as the top-level corporate enterprise.
Emsi assigns an education level to each posting using a machine learning model to detect the presence of required or preferred education levels. If more than one education level is mentioned, the posting will be tagged with all levels mentioned. Potential values include Unspecified, High School/GED, Associate’s Degree, Bachelor’s Degree, Master’s Degree, or Ph.D./Professional Degree.
Postings are tagged as full-time (more than 32 hours) or part-time (32 hours or less). If the posting does not specify, full-time is assumed.
Years of experience required for the position is captured where available.
City information is usually present in the postings and is easily retrieved during the collection process. This location represents the location of the posting and may not represent the location of the job vacancy. It is not uncommon for companies to post a job in other markets to attract talent.
Emsi also maps postings to traditional CMAs using a mapping that links CMAs to the cities listed in job postings. A similar process is used to map city-states to Census Divisions and Census Subdivisions.
Read more about Emsi skills here.