Data collection and processing

Published by Rowan Hotham-Gough

Selecting UK universities and university colleges

In June 2011, I made a record of every university and university college in the UK. I started with the UCAS list of university and colleges. I removed records from the UCAS list that were not for universities or university colleges. I did this by checking Wikipedia’s list of universities in the United Kingdom (accessed 16 June 2011). Where there was doubt, or if the institution was not listed, I searched Wikipedia for the institution. I only included the institution if it was clear from its Wikipedia entry that it was a university or university college.

Downloading job details

I downloaded the job details for each UK university and university college. There were some exceptions. I did not download the job details where:

  • The job was internal, honorary, or voluntary;
  • The job was a studentship;
  • The job was a non-university job;
  • The job was advertised solely in the Welsh language.

I excluded these jobs because I was concerned that they could distort the results of any analyses. Internal, honorary, voluntary, and non-university jobs could be advertised differently to regular jobs. Rather than spend time analysing these types of jobs to see whether there were any real differences, I decided to save time and exclude them.

I decided that some people would question whether a studentship is really a job. I also anticipated that the criteria for getting a studentship could be different to those for getting a regular job. So, I excluded these job adverts.

I wanted to include the jobs that were advertised exclusively in Welsh. But, I would have had to translate them into English so that I could include them in my analyses. I would have had to choose which way I was going to translate certain words and phrases. Because it is not possible to produce a single definitive translation of a job advert, the translated adverts could have skewed any results. At best, I would have made the dataset more ‘noisy’. I decided to exclude these adverts.

I did not keep a record of how many job adverts I did not download. With hindsight, I wish I had. However, I am confident that most of them were internal jobs.

For the job details I did download, I followed a set procedure. I accessed the jobs webpages for each university and university college. For each job listed, I:

  1. Saved the webpage (if present) as Web Page, HTML Only;
  2. Saved all documents (PDF, Word) directly associated with the job. I did not save documents that were not (and did not contain) a job description or person specification. I did not download 1 file for 1 job (job 21 at Cardiff University) because it was too large.

I downloaded the job details on the following dates:

  • 17 June 2011: Jobs at institutions with UCAS codes ranging from A20 (The University of Aberdeen) to C99 (University of Cumbria);
  • 20 June 2011: D26 (De Montfort University) – L39 (University of Lincoln);
  • 21 June 2011: L41 (The University of Liverpool) – R54 (Royal Agricultural College);
  • 22 June 2011: R72 (Royal Holloway, University of London) – Y75 (York St John University).

In total, I downloaded the details for 2,417 jobs.

Findings from a random sample

I randomly selected 242 (10%) of the 2,417 jobs I had downloaded. For each randomly selected job, I examined:

  1. The title of the person specification section;
  2. How the criteria listed in the person specification are separated into essential and desirable criteria;
  3. The sub-headings used in the person specification;
  4. How the criteria listed in the person specification were to be assessed;
  5. The structure of the rest of the job advert.

1. I discovered that not all of the randomly selected job adverts contained a person specification. Those job adverts that did contain a person specification titled that section: Person specification; Criteria; Person Profile; Candidate Requirements; or Employee Specification. It became apparent that I would need to look for all sections in job adverts that functioned as a person specification, rather than only look for sections that were titled ‘person specification’.

2. The person specifications tended to separate essential and desirable criteria into two lists headed ‘Essential’ and ‘Desirable’. However, some job adverts used headings or text like ‘Required Skills’, ‘Candidates must demonstrate the following’, ‘The post holder must have’, ‘Applicants should have’, or ‘The person appointed will’. This variety of headings for the essential and desirable criteria raised important questions: What is the difference between ‘essential’, ‘required’, ‘must’, ‘should’, and ‘will’? Do recruiters and job applicants consistently interpret these words in the same way? I decided that I would look at whether the person specifications had a regular or irregular structure. I would only include person specifications that clearly specified which criteria were essential and which were desirable.

3. There was a lot of variety in the sub-headings used in the person specifications. In these randomly selected job adverts alone, there were many ways that recruiters headed criteria just to do with education: Educational philosophy; Education & qualifications; Education/Qualifications; Education/Professional qualifications; Educational and professional qualifications; Education, qualifications, training; Educational qualifications and training; Education, qualifications and training; Qualifications/Education training; Education and training; Education/Training; Education and experience; Educational experience; Education, experience & achievements; and Experience/Education/Qualifications.

Some of the randomly selected job adverts did not use any sub-headings within their person specification. Those that did sometimes included a note with the sub-heading. Despite all of this variation, I decided that sub-headings contained valuable data. For each person specification, I would record whether the person specification contained any sub-headings or not. If it did, I would record the sub-heading, any note it had with it, and which criteria it was a heading for.

4. In comparison to the rest of the person specification, the assessment part tended to be straightforward. Some person specifications stated how the listed criteria were to be assessed. In the random sample, I found the following assessment types: Application form; Application; Interview; Testing; Test; Assessment; Test/Exercise; References; Reference; Certificate; Certificates; Copy of certificates; Presentation; Documentary evidence; CV; Supporting statement; Evidence; Publications; Past record; Qualifications; Evidence of published papers; Funding received and roles held; and Occupational assessment.

It was interesting to see how recruiters would assess whether applicants had met the criteria listed in the person specification. However, this data was not directly relevant to my research. So I decided that I would record whether this data was present or not in the person specification, but I would not record the data itself.

5. I thought it was important to look at the person specifications in context, so I looked at the rest of the job advert. I discovered that recruiters sometimes gave more than one job title in the advert. I decided that I would record just one title per job (the one positioned closest to the person specification), and note that other job titles were also used.

The randomly selected job adverts contained a lot of other data. For example, the job reference number, job type ( ‘senior management’, ‘technical’, ‘academic’, ‘clinical’…), the university campus, the department/faculty/school/centre, contract type (‘permanent’, ‘temporary’, ‘fixed term’, ’8 months’, ‘ongoing’…), position type (‘full-time’, ‘part-time’, ’18.25 hours a week’, ‘variable hours’, ‘freelance’…), days worked (‘Monday to Friday’), the date the job advert was posted, application closing date, the start date, the grade/pay band, the salary/salary range, the interview date, the duties and responsibilities of the post… Although all of this data was rich and interesting, I decided to exclude most of it to save time. The only data I did decide to record was the job reference number as it could be used to identify the job advert.

Some job adverts listed something called ‘competencies’ in a separate section to the person specification. It was not always clear whether these ‘competencies’ were a continuation of the person specification, or something different. I decided to exclude job adverts where it was unclear what belonged to the person specification. Where ‘competencies’ were clearly separate from the person specification, I would ignore the ‘competencies’ and just record the contents of the person specification.

Some job adverts contained multiple person specifications. Sometimes this was the same person specification duplicated in different parts of the job advert. Sometimes these were different person specifications advertised together (usually for different grades of the same position). I decided that it would complicate the results of my research if multiple person specifications were linked to a single job. So I decided to exclude all job adverts where there were multiple different person specifications. When a job advert had exact duplicates of a single person specification, I would record the contents of one of the duplicates.

Processing the data

For each of the 2,417 job adverts, I recorded:

  • The job title;
  • Whether there were other job titles present in the job advert;
  • The job reference number;
  • Whether the job advert contained a person specification.

For every job advert that contained a person specification, I recorded:

  • The location of the file containing the person specification;
  • Whether the recruiter had specified how the person specification’s criteria would be assessed;
  • Whether there were any desirable criteria (not just essential criteria);
  • The label used as a heading for the desirable criteria (if present);
  • Whether there were any sub-headings in the person specification.

I also recorded any issues that I found with the job adverts. Particularly if:

  • A downloaded file did not have any content;
  • Multiple posts were available for the advertised role;
  • The person specification had an irregular format or used irregular labels;
  • There were multiple person specifications (listed separately or combined into one) for the advertised role;
  • The person specification was only advertised in the Welsh language;
  • The advertised role was voluntary, honorary, a studentship, only available to internal applicants, or being advertised on behalf of an external organisation;
  • A file could not be downloaded because it was too large;
  • The headings for the person specification were not visible (but were present);
  • The advertised role was a duplicate of another job advert;
  • There were multiple jobs advertised together in a single job advert;
  • There were alternate roles advertised together (applicants had to choose one of the optional roles).

In the above list, I have highlighted the problematic issues. All of these issues would have made it difficult to record or to analyse the related data. These problematic issues carried the risk of distorting the results of my analyses. I decided to exclude the job adverts containing problematic issues from the next stage of data processing. In total, I excluded 567 job adverts that did have a person specification, but had 1 or more of the following problems:

  • The format of the person specification was irregular, or it used irregular labels (460 jobs);
  • The advert contained more than 1 person specification (123 jobs);
  • The person specification was only available in Welsh (2 jobs);
  • The job was internal, honorary, or voluntary, or it was a studentship, or a non-university job (20 jobs);
  • The job advert was a duplicate of another job advert (8 jobs).

I also excluded 191 job adverts that did not have a person specification. This left 1,662 jobs that I could gather further data from (69% of the total 2,417).

For every job advert that contained a person specification and that did not have any problematic issues (1,662 jobs), I recorded the criteria listed in the person specification. For each criterion, I recorded:

  • Whether it was an ‘essential’ or ‘desirable’ criterion;
  • The sub-heading it was listed under (if present);
  • The note explaining the sub-heading (if present);
  • Whether the criterion contained a nested list of sub-criteria;
  • Whether the criterion contained conjunctions (e.g. ‘AND’) that were highlighted or set apart.

Because I wanted to record each criterion separately (so that I could distinguish between them in my analysis), I first needed to define what I would count to be a criterion. I defined a criterion as a string of text separated from its sub-heading (and the sub-heading’s note) and from other criteria by:

  • Being placed in its own table cell;
  • Being preceded by a bullet-point or list number (except for nested lists);
  • Or being placed on a new line (except where the text was on a new line due to word-wrapping or poor formatting).

Using this definition, I copied 28,187 criteria into my dataset.

Some of the criteria were copied containing corrupt text (e.g. containing the word ‘communica8on’ instead of ‘communication’). This corruption was due to some word-processing applications incorrectly converting ligature glyphs when converting documents to PDFs. So the text in the PDF appears correct, but the text string that is copied is corrupt. Also, a few of the person specifications contained typos or spelling errors made by the original author. To reduce these errors, I spell-checked all of the criteria. Where the copied text was corrupt, I replaced the corrupted string with a corrected version. Where I thought there was a typo or spelling error, I made a backup copy of the original text string, then corrected the error in the text string.