With the continued development and increased implementation of Artificial Intelligence tools in HR and workplace contexts, it will become more important to HR leaders and HR technologists to be able to assess the effectiveness of these technologies. One good example of this challenge is with AI-powered HR and workplace chatbot technology, one of the more common expressions of AI technology for HR. There are now numerous examples in HR where chatbots have been developed by HR tech solution providers for deployment in a growing range of applications.
We have probably all seen, and possibly interacted with or deployed chatbot technology in a recruiting context – where chatbots can answer candidate questions, perform basic screening, provide information about jobs and the company, drive candidates to complete a full application, and schedule follow up interviews. We’ve also seen a common use of AI technology for the classic HR service center or help desk, allowing employees to find information about company programs and policies in a simple, conversational manner. And then there are the more advanced applications of chatbots in HR that move more towards the category better labeled as “Digital Assistants” than simple chatbots, and provide the ability to direct and orchestrate processes, and serve as true productivity enhancers. But no matter the specific application of these tools, and what they are called by their developers, HR leaders who ultimately deploy them to their workforces do need a way to monitor and assess their performance and their true value to the organization. The question then is how best to make these evaluations.
The first, and simplest way to assess these tools is relatively simple – usage and adoption rates. How many times are employees or candidates using these technologies? How are usage rates trending? If there implementation is meant to reduce usage of other tools or activities, (like, say, phone calls or emails to HR support staff), how are those activities trending? Are the calls to the HR help desk actually decreasing as we expected? These are simple measures, and valuable, but only provide baseline information, and not much towards determining the real impact of these new technologies.
The next common assessment approach is to evaluate user satisfaction. We can simply ask or survey the users of these technologies how they feel about them, are they proving helpful or useful, are they an improvement from the prior tools and processes? And more importantly, do these new technologies serve to actually help me accomplish what I am trying to accomplish, and in an efficient and user friendly manner? There are plenty of questions we can ask of users of these technologies to help in our evaluation of their effectiveness, the challenge is certainly determining the right questions to ask, how many to ask, how often to ask them, and how to easily collect and analyze the responses.
Finally, there are some emerging frameworks and approaches to evaluating the technologies themselves, ones that hopefully can be refined and applied more generally to these kinds of AI tools. Recently researchers at Google released findings and insights from their internal chatbot development efforts, and shared ideas on how these tools can be assessed. Google has developed an evaluation called the Sensibleness and Specificity Average, (SSA), which is meant to capture and evaluate to key elements of chatbot conversation and responses to human questions in two dimensions. First, does the chatbot’s response make sense? Generally speaking, we can use common sense to judge if a response makes sense and is reasonable given the context in which the chatbot is engaged. If anything in the response seems odd — confusing, illogical, out of context, or factually wrong — then it should be rated as, “does not make sense”. Second, Google recommends that chatbot responses be evaluated as to their specificity. A simple example would be in a recruiting chatbot context if a job seeker says “I am interested in marketing roles in the USA” and the chatbot replied “That’s nice” then the evaluation would be scored “Not specific”. That particular response, while not “wrong”, could be applied in almost any context. But if the chatbot response was “Great, let’s learn more about your marketing background to find roles that fit”, then the evaluation would be “Specific” as the information was tailored to the specific context of the conversation.
There is plenty more detail about the SSA metric and Google’s efforts to both design better, more helpful chatbot technology as well as their work to automate the assessment of chatbot technology on their Google AI blog (linked above). In particular, the researchers note that other attributes such as personality, safety, and bias also need to be assessed. But the main idea that I hope that HR and HR tech leaders take from this is to think about how, how often, and by what criteria they will need to evaluate these new AI tools that are becoming increasingly prevalent in HR and workplace technology. Careful and accurate evaluation will be essential in order for these technologies to deliver on their potential and promise as enablers of increased efficiency, accuracy, and improved workplace experiences.