Kirkpatrick's Four-Level Evaluation Model
Perhaps the best known training methodology for evaluations is Donald Kirkpatrick's Four Level Evaluation Model that was first published in a series of articles in 1959 in the Journal of American Society of Training Directors (now known as T+D Magazine). The series was later compiled and published in a book, "Evaluating Training Programs" in 1975. While Kirkpatrick has written a number of books on the subject, his best known work is the 1994 edition of "Evaluating Training Programs." Kirkpatrick is now Professor Emeritus at the University of Wisconsin and is associated with Kirkpatrick Partners, LLC.
The four-levels of evaluation consist of (1994):
- Reaction - how the learners react to the learning process
- Learning - the extent to which the learners gain knowledge and skills
- Behavior - capability to perform the learned skills while on the job
- Results - includes such items as monetary, efficiency, moral, etc.
Note that some use the term "transfer" in lieu of results to identify the transfer of learning to the workplace; however, "performance" is now often the preferred word for "behavior". As Gilbert noted, performance has two aspects — behavior being the means and its consequence being the end. In addition, "impact" is often used for "results," such as impact on the business unit.
The chart below shows how the evaluation process fits together:

Level One - Reaction
As the word implies, evaluation at this level measures how the learners react to the training. This level is often measured with attitude questionnaires that are passed out after most training classes. This level measures one thing: the learner's perception (reaction) of the course. Learners are often keenly aware of what they need to know to accomplish a task. If the training program fails to satisfy their needs, a determination should be made as to whether it's the fault of the program design or delivery.
This level is not indicative of the training's performance potential as it does not measure what new skills the learners have acquired or what they have learned that will transfer back to the working environment. This has caused some evaluators to down play its value. However, the interest, attention and motivation of the participants are often critical to the success of any training process -- people often learn better when they react positively to the learning environment by seeing the importance of it.
When a learning package is first presented, rather it be e-learning, classroom training, CBT, etc., the learner has to make a decision as to whether he or she will pay attention to it. If the goal or task is judged as important and doable, then the learner is normally motivated to engage in it (Markus Ruvulo, 1990). However, if the task is presented as low-relevance or there is a low probability of success, then a negative effect is generated and motivation for task engagement is low.
This differs somewhat from Kirkpatrick (1996). He writes, "Reaction may best be considered as how well the trainees liked a particular training program". However, the less relevance the learning package is to a learner, then the more effort that has to be put into the design and presentation of the learning package. That is, if it is not relevant to the learner, then the learning package has to "hook" the learner through slick design, humor, games, etc. This is not to say that design, humor, or games are unimportant; however, their use in a learning package should be to promote or aid the "learning process" rather than the "learning package" itself. And if a learning package is built of sound purpose and design, then it should support the learners in bridging a performance gap. Hence, they should be motivated to learn! If not, something went dreadfully wrong during the planning and building processes! So if you find yourself having to hook the learners through slick design, then you probably need to reevaluate the purpose of the learning program.
For more information on reaction, see Self-System.
Level Two - Learning
This is the extent to which participants change attitudes, improve knowledge, and increase skill as a result of participating in the learning process. It addresses the question: Did the participants learn anything? The learning evaluation require some type of post-testing to ascertain what skills were learned during the training. In addition, the post-testing is only valid when combined with pre-testing, so that you can differentiate between what they already knew prior to training and what they actually learned during the training program.
Measuring the learning that takes place in a training program is important in order to validate the learning objectives. Evaluating the learning that has taken place typically focuses on such questions as:
- What knowledge was acquired?
- What skills were developed or enhanced?
- What attitudes were changed?
Learner assessments are created to allow a judgment to be made about the learner's capability for performance. There are two parts to this process: the gathering of information or evidence (testing the learner) and the judging of the information (what does the data represent?). This assessment should not be confused with evaluation. Assessment is about the progress and achievements of the individual learners, while evaluation is about the learning program as a whole (Tovey, 1997, p. 88).
Evaluation in this process comes through the learner assessment that was built in the design phase. Note that the assessment instrument normally has more benefits to the designer than to the learner. Why? For the designer, the building of the assessment helps to define what the learning must produce. For the learner, assessments are statistical instruments that often poorly correlate with the realities of performance on the job and they rate learners low on the "assumed" correlatives of the job requirements (Gilbert, 1998). Thus, the next level, performance, is the preferred method of assuring that the learning transfers to the job, but sadly, it is quite rarely performed.
Level Three - Performance (behavior)
This evaluation involves testing the students capabilities to perform learned skills while on the job, rather than in the classroom. Level three evaluations can be performed formally (testing) or informally (observation). It determines if the correct performance is now occurring by answering the question, "Do people use their newly acquired learnings on the job?"
In Kirkpatrick's original four-levels of evaluation, he names this level "behavior." However, behavior is the action that is performed, while the final result of the behavior is the performance. Gilbert said that performance has two aspects behavior being the means and its consequence being the end (1998). If we were only worried about the behavioral aspect, then this could be done in the training environment. However, the consequence of the behavior (performance) is what we are really after can the learner now perform and produce the needed results in the working environment?
It is important to measure performance because the primary purpose of training is to improve results by having the students learn new skills and knowledge and then actually applying them to the job. Learning new skills and knowledge is no good to an organization unless the participants actually use them in their work activities. Since level-three measurements must take place after the learners have returned to their jobs, the actual Level three measurements will typically involve someone closely involved with the learner, such as a supervisor.
Although it takes a greater effort to collect this data than it does to collect data during training, its value is important to the training department and organization as the data provides insight into the transfer of learning from the classroom to the work environment and the barriers encountered when attempting to implement the new techniques learned in the program.
Level Four - Results
This is the final results that occur. It measures the training program's effectiveness, that is, "What impact has the training achieved?" These impacts can include such items as monetary, efficiency, moral, teamwork, etc.
As we move from level one to level four, the evaluation process becomes more difficult and time-consuming, however, the higher levels provide information that is of increasingly significant value. Perhaps the most frequently type of measurement is Level-one because it is the easiest to measure, yet it provides the least valuable data. Measuring results that affect the organization is considerably more difficult, thus it is conducted less frequently although it yields the most valuable information.
The first three-levels of Kirkpatrick's evaluation Reaction, Learning, and Performance are largely "soft" measurements; however, decision-makers who approve such training programs, prefer results (returns or impacts). That does not mean the first three are useless, indeed, their use is in tracking problems within the learning package:
- Reaction informs you how relevant the training is to the work the learners perform (it measures how well the training requirement analysis processes worked).
- Learning informs you to the degree of relevance that the training package worked to transfer KSAs from the training material to the learners (it measures how well the design and development processes worked).
- The performance level informs you of the degree that the learning can actually be applied to the learner's job (it measures how well the performance analysis process worked).
- Impact informs you of the "return" the organization receives from the training. Decision-makers prefer this harder "result," although not necessarily in dollars and cents. For example, a recent study of financial and information technology executives found that they consider both hard and soft "returns" when it comes to customer-centris technologies, but give more weight to non-financial metrics (soft), such as customer satisfaction and loyalty (Hayes, 2003).
Note the difference in "information" and "returns." That is, the first three-levels give you "information" for improving the learning package. While the fourth-level gives you the "returns" for investing in the learning process. A hard result is generally given in dollars and cents, while soft results are more informational in nature. There are exceptions. For example, if the organizational vision is to provide learning opportunities (perhaps to increase retention), then a level-two or level-three evaluation could be used to provide a soft return.
Jack Phillips (1996), who probably knows Kirkpatrick's four-levels better than anyone, writes that the value of information becomes greater as we go up these levels of information (from reaction to results/impacts). For example, the evaluation of results has the highest value of information to the organization, while reaction provides the least information (although like any information, it can be useful). And like most levels of information, the ones that provide the best value are often more difficult to obtain. Thus we readily do the easy ones (levels one and two) and obtain a little information about our training efforts, while bypassing the more difficult ones (three and four) that would provide the most valuable information for the organization.
This final measurement of the training program might be met with a more "balanced" approach or a "balanced scorecard" (Kaplan Norton, 2001), which looks at the impact or return from four perspectives:
- Financial: A measurement, such as an ROI, that shows a monetary return, or the impact itself, such as how the output is affected. Financial can be either soft or hard results.
- Customer: Improving an area in which the organization differentiates itself from competitors to attract, retain, and deepen relationships with its targeted customers.
- Internal: Achieve excellence by improving such processes as supply-chain management, production process, or support process.
- Innovation and Learning: Ensuring the learning package supports a climate for organizational change, innovation, and the growth of individuals.
Criticisms
Kirkpatrick's four-levels treats evaluation as an end of the process activity. Whereas the objective should be to treat evaluation as an ongoing activity that should begin during the pre-training phase.
Actually, this criticism is inaccurate. For example, "The ASTD Training Development Handbook" (1996), edited by Robert Craig, includes a chapter by Kirkpatrick with the simple title of "Evaluation." In the chapter, Kirkpatrick discusses control groups and before and after approaches (such as pre and post-tests). He goes on to discuss that level-four should also include a post-training appraisal three or more months after the training to ensure the learners put into practice what they have learned. Kirkpatrick further notes that he believes the evaluations should be included throughout the training by getting evaluations not only during each session or module, but also after each subject or topic.
The four-levels of evaluations mean very little to the other business units
One of the best training and development books out is "The Six Disciplines of Breakthrough Learning" by Wick, Pollock, Jefferson, Flanagan (2006). They offer perhaps the best criticism that I have seen - "Unfortunately, it is not a construct widely shared by business leaders, who are principally concerned with learning's business impact. Thus, when learning leaders write and speak in terms of "levels" of evaluation to their business colleagues, it reflects a learning-centric perspective that tends to confuse rather than clarify issues and contribute to the lack of understanding between business and learning functions."
So it might turn out that the best criticism is not leveled at the four-levels themselves, but rather the way we use them when speaking to other business leaders. We tell the business units that the level-one evaluation show the learners were happy and that the level-two show they all passed the test with flying colors, and so on up the line. Yet according to the surveys that I have seen, level-four (impact) is rarely used. While the lower levels of evaluation can be quite useful within the training function (they help us to discuss what type of evaluation we are speaking of), outside of training development they fall flat. For the most part, the business units' main concern is the IMPACT -- did the resources spent on the learning process contribute to the overall health and prosperity of the enterprise?
There are three problematic assumptions of the Kirkpatrick model: 1) the levels are not arranged in ascending order, 2) the levels are not causally linked, and 3) the levels are positively inter-correlated (Alliger and Janak, 1989).
The main problem with the paper is that it puts no limits on what is training and what is not training. For example, they include spirit-building, inculcation of company history or philosophy, and individual growth programs as "training." Therefore, according to the authors, "not all training in organizations is meant to effect change at all four levels." However, the mistake the authors made is using all "learning" programs, such as education and development, under the heading of "training." If you are going to include every formal learning program as "training" as they have, then of course the four levels are not "arranged in ascending order of information provided." Although there are a variety of definitions for training, it is generally considered an HRD intervention or process for fixing a performance problem through some type of learning program. Hence, there is going to be some type of "impact" or "result." On the other hand, development or education programs are more concerned with the growth of the individual, hence, there might not be an immediate impact or result. The use of the learning, rather than training examples indicates that they have fallen into the first trap of meta-analysis -- comparing apples to oranges. That is, they seem to be assuming that all learning processes within an organization are considered training. They have failed to fully identify all constructs underlying the phenomena of interest, thus there is no way we can validate their work.
The only part of Kirkpatrick's Four Levels that has failed to uphold to scrutiny over time is the first level - reaction. For example, a Century 21 trainer with some of the lowest Level one scores was responsible for the highest performance outcomes post-training (level four), as measured by his graduates' productivity. This is not just an isolated incident -- in study after study the evidence shows very little correlation between Level one evaluations and how well people actually perform when they return to their job (Boehle, 2006).
Rather than measuring reaction, what we are now discovering is that we should be preframing the learners by having their managers discuss the importance of attending a training process (on-ramping) and then following-up on them after they return to ensure they are using their mew skills (Wick, et al. 2006).
Improving the Model
Because of its age and with all the new technology advances, the four-level evaluation model is often criticized nowadays for being too old and simple. Yet, almost five decades after its introduction, there has not been a viable option to replace it. And I think the reason why is that Kirkpatrick basically nailed it, but presented it wrong. Rather than being just an evaluation tool, it should have been presented as both a planning and evaluation tool. To do this, it needs one simple adjustment... flip it upside-down! (Clark, 2008) That is, rearrange the steps into a "backwards planning" tool by starting with the end in mind:
Thus, planing and analysis needs to work backward by identifying:
- the desired impact (outcome or result) that will improve the performance of the business
- the level of performance the learners must be able to do to create the impact
- the knowledge and skills they need to learn in order to perform
- what they need to perceive in order to learn (the need to learn)
Planning it backwards will help to ensure there is a circular causality:
The learners' perception of the need to learn should motivate them to learn, which in turn causes the desired performance that drives the impact desired by our customer (client). This causality should continue in a circular fashion in that the results achieved should now drive the performers' perceptions of the need to learn more and perform better in order to achieve even better results. Of course this assumes that not only the customer understands the level of impact achieved, but also the performers/learners' perception on how close they came to achieving the desired result.
References
Alliger, G. M., Sz Janak, E. A. (1989). Kirkpatrick's levels of training criteria: Thirty years later. Personnel Psychology, 42 (2), 331-342.
Boehle, S. (2006). Are You Too Nice to Train? Training Magazine. Retrieved from web Feb. 8, 2009: http://www.trainingmag.com/msg/content_display/training/e3iwtqVX4kKzJL%2BEcpyFJFrFA%3D%3D?imw=Y
Clark, D. (2008). Flipping Kirkpatrick. bdld.blogspot.com. Dec. 17, 2008. Retrieved from web April 27, 2009: http://bdld.blogspot.com/2008/12/flipping-kirkpatrick.html
Craig, R. L. (1996). The ASTD Training Development Handbook. New York: McGraw-Hill.
Gilbert, T. (1998). A Leisurely Look at Worthy Performance. The 1998 ASTD Training and Performance Yearbook. Woods, J. Gortada, J. (editors). New York McGraw-Hill.
Hayes, M. (2003, Feb 3). Just Who's Talking ROI? Information Week. p. 18.
Kaplan, R. S. and D. P. Norton. 2001. The Strategy-Focused Organization: How Balanced Scorecard Companies Thrive in the New Business Environment. Boston, MA: Harvard Business School Press.
Kirkpatrick D. L. (1959). Techniques for evaluating training programs. ''Journal of American Society of Training Directors'', 13 (3): pp. 21 - 26.
Kirkpatrick, D. L. (1975). Techniques for Evaluating Training programs. Evaluating training programs. D. L. Kirkpatrick (ed.) Alexandria, VA: ASTD.
Kirkpatrick, D. L. (1994). Evaluating Training Programs. San Francisco: Berrett-Koehler Publishers, Inc.
Markus, H. Ruvulo, A. (1990). "Possible selves. Personalized representations of goals." Goal Concepts in Psychology. Pervin, L. (Editor). Hillsdale, NJ: Lawrence Erlbaum. Pp. 211-241.
Phillips, J. (1996). Measuring the Results of Training. The ASTD Training Development Handbook. Craig, R. (ed.). New York: McGraw-Hill.
Tovey, Michael (1997). Training in Australia. Sydney: Pretice Hall Australia. (note: this is perhaps one of the best book on the ISD (ADDIE) process)
Wick, C. W., Pollock, R. V. H., Jefferson, A. K., Flanagan, R. D. (2006). The Six Disciplines of Breakthrough Learning. San Francisco, CA: Pfeiffer.
Next Steps
Return to the ISD Table of Contents or main Training Page


