This article outlines the creation of Lightcast’s job postings data, from the collection of postings to enrichment of the data.
It is important to note that job postings are not necessarily the same as job vacancies; there is a correlation, but many recruitment practices make it an imperfect relationship. Job postings are a measure of recruitment marketing by employers purportedly looking to fill job vacancies.
The methodology used to obtain job advertisements from online job boards and company websites is based on Lightcast advanced scraping technology. Once Lightcast identifies an online site as a valid source of employment opportunities, a dedicated spider is programmed, tested, and activated. The spider visits the site regularly and pulls job information for all jobs posted; the information is then stored in a database. The sites with the newest jobs or with the highest frequency of change in postings are visited most frequently. Lightcast currently scrapes more than 50,000 sites worldwide.
Lightcast’s database is a comprehensive reflection of job listings posted across the Internet, as such robust processes are required to identify and remove duplicate listings. Lightcast applies a unique two-step approach to deduplication that results in up to 80% of all jobs we collect being deduplicated. The initial deduplication screen is undertaken on a source-level basis, with intelligence contained within the spiders themselves to identify and refrain from collecting records that have previously been aggregated. However, because duplicates can occur across sources, our next phase involves a thorough and ongoing analysis of the full database of aggregated content. This deduplication analysis is possible because of advanced parsing engines that extract and normalize a number of data elements from each job listing, each of which can function as an individual duplicate screen or in concert with other variables, e.g. job title, job ID, source, posting date, employer name, location, etc.
For deduplication, Lightcast uses a 60 day rule to identify duplicates. For example, if there is a job for a Marketing Specialist at Google posted for the first time on March 1st, Lightcast considers this as the ‘original posting’ then for the next 60 days Lightcast removes all possible duplicates. If Google continues to actively post the Marketing Specialist ad after 60 days, approximately May 1st, then we will count the ad as a new posting and start tracking a new 60 days. In theory, if Google posts the same ad every day for the entire year Lightcast will count it 6 times.
Once postings are collected, Lightcast technologies parse, extract, and code dozens of data elements including the following: Lightcast job title, Occupation, Company, detailed data about the specific skills, educational credentials, certifications, experience levels, and work activities required for a specific job, as well as data about salary, number of openings, and job type. The high-level of detail enables users to look beyond summary statistics to discover specific skills in demand and skills that job seekers can identify and acquire if needed.
Newly Posted job postings are considered inactive when it reaches either an age-based expiration or liveliness expiration. Lightcast job postings are considered expired either at 60 days or when the posting has been removed from the original source. If a posting has reached 60 days but has not been removed from the original source that posting will be considered expired and will become inactive. For this reason, a posting may reappear as Newly Posted within the data if it is picked up at aggregation again after the 60 day expiration.
Newly Posted measures all postings that were posted in that month.
Active measures how many postings were live during that month, (even if originally posted in a previous month but left active by the employer).
Active postings is a good way to get a view of the total open demand present in a given month while Newly Posted gives a better view of the behavior of the market in a given month and over time.
Starting with raw company names, we normalize these names using a set of proprietary criteria. This strips information from the name that is irrelevant to identifying the company correctly (e.g. LLC, Inc.). This leaves a normalized name with all the ingredients needed to classify it to our Companies Taxonomy.
After normalization, we match the clean name to the best fit in our Companies Taxonomy. Each company has associated metadata, including Tradestyle, NAICS codes, and staffing labels. If a company is a subsidiary or establishment of another company, we generally roll it up into the main company when the establishment or subsidiary has the parent in its name. For example, “Walmart Canada” would be classified as “Walmart.”
Lightcast assigns an education level to each posting using a machine learning model to detect the presence of required or preferred education levels. If more than one education level is mentioned, the posting will be tagged with all levels mentioned. Potential values include High School/GED, Associate’s Degree, Bachelor’s Degree, Master’s Degree, or Ph.D./Professional Degree. In the case that the posting does not contain any educational requirements it will be tagged as Unspecified.
Postings are tagged as full-time (more than 32 hours), part-time (32 hours or less), flexible hours (if the posting mentions both full and part time, or a range of working hours that span both categories), or intern. If the posting does not specify, full-time is assumed.
Years of experience required for the position is captured where available. Not all postings include an experience level, the unspecified postings will not be displayed when the Minimum Experience Required filter is applied.
Country, city, and state information is captured during the scraping process when present. Lightcast also maps postings to traditional MSAs using a mapping based on Google geo-coding that links MSAs to the city-state combinations found in job postings. A similar process is used to map city-states to counties.
Skills data are extracted using the text of the posting. Lightcast takes the text of the posting and looks for sequences of words that indicate skills. Lightcast distinguishes between specialized skills, common skills, software skills and qualifications. Specialized Skills are skills that are primarily required within a subset of occupations or equip one to perform a specific task (e.g. “NumPy” or “Hotel Management”). Also known as technical skills or hard skills. Common Skills are skills that are prevalent across many different occupations and industries, including both personal attributes and learned skills. (e.g. “Communication” or “Microsoft Excel”). Also known as soft skills, human skills, and competencies. Software Skills are any software tool or programming component used to help with a job (e.g. Python, Workday, AutoCAD, Microsoft Excel, React.Js, Accounting Software, and 3D Modeling Software would all be considered “Software Skills”). Certifications are recognizable qualification standards assigned by industry or education bodies (e.g. “Cosmetology License” or “Certified Cytotechnologist”).
Some job postings include the salary or salary range of the vacancy. Lightcast extracts and cleans this information and includes it in the dataset when it is a likely and reasonable reflection of the position.
All job postings are scanned for the presence of language indicating that the advertised position can be filled by a remote or partially remote worker. This involves analyzing the text of each posting’s title and body for job location language. Many words and phrases are used to indicate a remote or hybrid position, including “remote”, “position can be located anywhere”, “work from home”, “telecommute”, “partially remote” and others. Postings containing language indicative of Job Location are tagged as Remote, Hybrid, or Non-Remote. It should be noted that the definition of Remote is broad enough to include postings that require that a person live in a particular region although coming into an office is not required. Postings that do not contain any indication of Job Type are tagged as Unknown.
Raw titles are collected at aggregation and are then cleaned and normalized to our Lightcast Titles taxonomy. For example, a posting for Facebook might have a job title of “Data Science Manager, Messenger”, the postings are then run through a tagging system, the job title would be normalized to “Data Science Manager.”
Lightcast uses machine learning models and rules to code occupations from the raw title and job description of the job posting.
Let us know what specific questions we can help you with (we may even add your question to our knowledge base).
Let us know what specific questions we can help you with (we may even add your question to our knowledge base).