Research

Modern Assessment in the Age of Generative AI

A new Korn Ferry report shows minimal impact of GenAI on Korn Ferry assessment results.

Generative AI (GenAI) tools like ChatGPT, Copilot, and Gemini are transforming how we approach everyday tasks, from document summarization to complex coding. These AI platforms promise to make knowledge workers more productive, potentially boosting efficiency by 30% to 60%. But their impact on talent management, especially in recruiting and hiring, has sparked worries about candidates using AI to complete assessments.

Reports of GenAI's strong performance on standardized university admissions tests—with GPT-4 scoring at or above the 90th percentile—have raised concerns about the potential skewing of assessment results. This begs the question: Can AI tools give applicants an unfair advantage in unproctored assessments?

To explore this issue, the Korn Ferry Institute systematically investigated how GPT-3.5 Turbo and GPT-4 perform on Korn Ferry’s assessments, including self-assessments of competencies, traits, and drivers (CTD), cognitive assessments, and situational judgment tests. The results shows that while generative AI may be “an exciting and helpful resource for some tasks,” the AI models “did not shine at completing our assessments,” says Sarah Hezlett, Vice President of Assessment Science at the Korn Ferry Institute.

Here are five key takeaways from the investigation:

1. GenAI tools do not effectively take role-relevant information into account when completing Korn Ferry self-assessments.

Making decisions about people at work—like selecting roles or planning development—should be based on precise and context-specific assessments. Korn Ferry uses detailed Success Profiles to compare individual assessment results to what is needed for success in a job, considering factors like job demands, role level, and organizational culture. The study found that while AI did vary its responses across different roles, it was not effective at presenting as a strong candidate for these roles.  

2. Korn Ferry’s forced-choice item response theory (FC-IRT) approach mitigates “faking good” for both humans and AI.

Assessment responses can be collected and scored in different ways. Likert-type scales, which ask people to choose from options like “Strongly Disagree” to “Strongly Agree,” can be influenced by how people want to or think they should be perceived. Forced-choice formats, which ask people to rank items, are less susceptible to humans’ impression management. In the study, the GenAI models did not perform as well on Korn Ferry’s FC-IRT assessments compared to other studies that included different types of forced-choice formats. “Our best-in-class Success Profiles and our assessment methodology continue to differentiate the value offered by our portfolio,” says Hezlett.

3. Prompt engineering makes a difference with assessment output, but less than expected.

Large language models (LLMs) are trained to understand the context of a query by analyzing the words and their relationships, then predicting the next words. However, natural language can be imprecise, and LLMs work best with short, focused prompts. The study found that while skilled prompt engineering can slightly improve AI performance on assessments, a human candidate with the right mix of competencies, traits, drivers, and cognitive abilities will tend to perform better than GenAI, even with well-crafted prompts.  

4. Using GenAI to respond to assessments is a cumbersome process that many users dislike.

Applicants’ experiences during the hiring process can shape their views of a company and their decision to apply for a job, especially if the hiring website is easy to use and the selection process is clear. However, using AI tools like GPT-4 to complete assessments can be slow and frustrating. It often takes more time than having applicants take the assessments themselves because of misunderstandings and the inclusion of unnecessary information.

5. GenAI does not excel at all cognitive activities.

When GPT-4 was released, its strong performance on some standardized tests led many to believe it could excel in cognitive ability tests used in hiring. However, the research showed that the LLMs do not consistently perform well across all cognitive ability assessments. This means using generative AI is not going to help all job applicants on all cognitive tests.

Conclusion

Generative AI can boost productivity and transform work processes, but its use requires careful oversight and regular review. Korn Ferry's assessment components, including Success Profiles and FC-IRT, seem to minimize  manipulation by humans and GenAI. As AI evolves, so will assessment practices.  “Generative AI will continue to evolve and advance,” says Andrea Deege, Senior Director of Assessment Science and Scoring at the Korn Ferry Institute. “We will continue to stay abreast of these developments as part of our ongoing commitment to monitoring trends related to assessment science in the workplace.”

CLICK IMAGE TO DOWNLOAD PDF