Beyond personality tests

Evaluating AI-driven personality assessments against traditional self-reported Big Five measures

Artificial Intelligence (AI) and Machine Learning (ML) have steadily infused various domains of organisational research and application, one such being occupational assessment. A recent research article from Journal of Applied Psychology by Fan et al. (2023) has investigated the credibility of measuring personality indirectly through an Artificial Intelligence (AI) Chatbot compared to traditional self-reporting methods, much like NEO-PI-R or OPTO.

In this study by Fan et al. (2023) a chatbot extracted written input from users’ free text responses collected during a 20-30 min. online conversation with the chatbot and then used machine learning algorithms to calculate personality scores (hereon referred to as “AI personality scores”). These scores were then compared to a self-reported questionnaire derived Big Five personality measurement (hereon referred to as “self-reported Big5 scores”).

The fascination of using AI to measure one's personality from bits of digital footprints might seem enchanting; however, it is crucial to understand the psychometric nuances before choosing between AI and traditional self-report measures. From the findings provided in the above-mentioned article, the main issues with AI personality scores are:

  1. Discriminant Validity: While the AI personality scores demonstrated good convergent validity, their discriminant validity was relatively poor. A high convergent validity indicates that the AI personality scores were in line with self-reported Big5 scores for the same traits. However, the lower discriminant validity implies that the AI personality scores were not distinct enough between different traits. This poses a concern as it suggests the AI might have difficulty distinguishing between different personality factors.
  2. Incremental Validity: The AI personality scores only exhibited incremental validity over self-reported Big5 scores in "some" analyses. This inconsistency implies that the AI personality scores may not consistently provide additional predictive value beyond what is already captured by traditional self-reported measures.
  3. Criterion-related Validity: A big concern should be that the AI personality scores showed low criterion-related validity in the study. This indicates that these scores might not be effective predictors of external criteria (e.g., job performance). This heavily reduces their practical applicability compared to self-reported Big Five measurements, and the limitations on AI’s predictability should be seriously considered.

While these findings highlight some of the weaknesses that AI Chatbot has when assessing personality, there are also strengths of AI personality scores, such as overall acceptable reliability, comparable factor structure, and evidence for cross-sample generalizability. Nevertheless, the issues with discriminant validity, inconsistent incremental validity, and criterion-related validity are significant concerns for researchers and practitioners aiming to apply these AI-driven methods in real-world contexts.


Psychometrical studies of AI Personality Scores

Reliability of AI Personality Scores

Reliability, a cornerstone of psychometric research, measures the consistency of scores across multiple test administrations. First, when comparing scores assessed the traditional way by self-reporting, with score derived from an AI Chatbot, we need to be aware of the limitations in the comparison method. Typically, researchers use Internal consistency (Cronbach’s α) as one way to measure reliability of a tool or split-half reliability. But these methods are not directly applicable to the AI method, as there are no items to conduct the analysis on. So, researchers must produce alternative ways of measurement.

This study showed mixed results on the reliability of AI-driven personality assessments:

  • At the facet level, split-half reliabilities of AI personality scores were overall acceptable, but still somewhat lower than those of self-reported Big5 scores.
  • The test-retest reliabilities of AI personality scores were comparable to split-half reliabilities, and therefore also lower than those of self-reported Big5 scores.
  • AI personality scores demonstrated somewhat higher internal consistency reliabilities than self-reported Big5 domain scores when facet scores were treated as “items”.
  • Self-reported Big5 scores showed good internal consistency and comparable split-half reliabilities.

Validity of AI Personality Scores

Validity measures the accuracy of scores and can be assessed in different ways. In this study the focus has been on factorial validity, convergent and discriminant validity, and criterion-related validity.

The validity results in the study:

  • Most values of the model fit indicators did not meet the commonly used rule of thumb. Nevertheless, the authors still consider them as adequate given the complexity of personality structure.
  • Also, they concluded that in general, the AI personality scores largely replicated the patterns and structure observed from the self-reported Big5 scores.
  • factor loading patterns and magnitude are similar across the two measurement approaches.
  • AI personality domain scores demonstrated excellent convergent validity but somewhat weaker discriminant validity. The poor discriminant validity in the test sample is particularly concerning.
  • AI Personality domain scores demonstrated some initial evidence for low levels of criteria-related validity.

Although AI personality scores displayed low criterion-related validity in predicting performance, they were surprisingly comparable to traditional self-reported measures. These differences hint though, that even though AI personality assessments are intriguing, we are not yet at the same level psychometrically speaking with using AI Chatbots as with using traditional self-reporting questionnaire derived Big Five personality measurement. Nevertheless, the authors overall concluded that their results show improvements in differentiating among different traits in the AI personality scores as compared to other existing machine learning applications.

Psychometrical conclusions

Fan et al. (2023) concludes in their paper, that AI personality scores in their study:

  1. Had overall acceptable reliability at both domain and facet levels.
  2. Produced a comparable factor structure to self-reported Big5 scores.
  3. Displayed good convergent validity but relatively poor discriminant validity.
  4. Showed low criterion-related validity.
  5. Exhibited incremental validity in some analyses.

In addition, they found strong evidence for cross-sample generalizability of various aspects of psychometric properties of their AI Personality scores.


Superior Psychometric Properties in some AI studies

Notably, even though this study showed that traditional self-reporting methods performed better than an Artificial Intelligence (AI) Chatbot, studies on different AI methods showed even less favourable results. For instance, the present study revealed better results than those by other researchers. Data leakage, a potential issue where training models are inadvertently exposed to test data, was diligently avoided. Three factors might explain this psychometric edge to the study by Fan et al from 2023:

  • Sample Size: The study was conducted using a larger sample size. This offers a more refined detection of subtle trait-relevant features, enhancing the model's accuracy.
  • Data Collection Method: The AI chatbot system's interactive capabilities possibly garnered higher-quality data.
  • Natural Language Processing (NLP) Method: The use of sophisticated DL-based NLP methods, like the Universal Sentence Encoder (USE), surpasses the traditional count-based methods, retaining richer contextual information.

These are all extremely interesting results that are worth following and investigating. Even so, the method may not be considered strong enough yet to surpass traditional self-reporting methods.


Practical concerns

While AI and ML have pioneered innovative approaches to personality assessment, the traditional self-report models hold their ground in terms of psychometric robustness. The article by Fan et al. (2023) underscores the importance of understanding these nuances for HR professionals.

Beyond the attraction of AI and its empirical findings, several pragmatic concerns warrant attention before the broad-scale implementation of AI-driven personality assessments:

  1. Tailor-made Conversation Agenda: While the AI chatbot allows users to design conversation topics—apt for role-specific interview questions—there's scant evidence to suggest that different question sets yield consistent machine scores for identical participants. In line with Hickman et al. (2022), they advocate for transparent sharing of training data specifics by vendors to prospective clients. Fan et al. (2023) highlights: “…vendors need to provide potential client organizations with enough information about the training data such as sample demographics and the list of interview questions the models are built on.” (p. 47)
  2. Resistance to Applicant Faking: Currently, there is an absence of empirical support to the presumption that AI chatbot-based scores are resistant to candidate manipulation.
  3. Adverse Impact: Whereas self-reported personality scores have demonstrated negligible adverse impact, AI personality scores via chatbots have not established immunity. Some language patterns might inadvertently correlate with group identities, necessitating their exclusion from predictive algorithms. This has not been possible to do in this study.
  4. Criterion-related Validity: The modest criterion-related validity showcased in academic settings warrants extensive validation in diverse organisations.
  5. Robustness Concerning Input Volume: The present study's findings around the volume of chat inputs participants provide is encouraging. However, determining a threshold—minimal chat inputs— is essential to retain the psychometric integrity of AI personality scores.
  6. Incorporation of Other-reported Scores: Given recent research emphasizing the predictive potential of other-reported personality scores, it is considered beneficial to enhance AI models with other-report data, ensuring comprehensive personality insights for candidates.
  7. Ethical Considerations: AI-driven chatbot assessments raise ethical concerns. For instance, is it justifiable to subject candidates to chatbot interviews without disclosing the intent of text data mining? The repurposing of textual data from in-person interviews for talent management accentuates the need for ethically sound AI applications.


In summary, while AI chatbots might soon gain traction in preliminary recruitment phases, it is imperative for HR professionals to weigh the methodological soundness and ethical implications of these tools against the established reliability of traditional self-report instruments. As technological advances continue to push boundaries, so should our commitment to uphold the principles of fair, ethical, and robust assessment methodologies.



Fan, J., Sun, T., Liu, J., Zhao, T., Zhang, B., Chen, Z., Glorioso, M. & Hack, E. (2023) How Well Can an AI Chatbot Infer Personality? Examining Psychometric Properties of Machine-inferred Personality Scores. Journal of Applied Psychology - https://doi.org/10.31234/osf.io/pk2b7

Category: Recruitment, Data Driven
Tags: personality, OPTO, AI

Date: 06.11.2023