Chances are, your organization is doing some form of Incident Management. And by “chances,” I mean there is a 100% chance unless of course, you’ve stumbled into some sort of utopia where issues don’t occur (psst, let me know so I can join you!).
So, is your SCSM Incident Management process working for you? Let’s see if any of these sound familiar…
- Are you a Service Desk member and unsure of troubleshooting steps for specific types of issues? Do you know when to draw the line, and when/to whom to escalate?
- Are you a higher tier support team member, casually punted an issue with the necessary info missing? Do issues outside your domain reach your queue, just for you to send them somewhere else?
- Are you someone who needs to run reports, and see 90% of your incidents classified as “Other”? Or perhaps some incidents are classified as “iPhone Issue”, some as “Android Issue,” and another bunch classified as “Mobile Device Issue”… well, how many iPhone issues do you have versus Android? Who knows!?
Surely in all the above cases, those other Analysts just didn’t do a good job, right?! Maybe if it is a one-off case, some training can mitigate, but if this type of thing occurs systemically, I’d suggest taking a hard look at the process. If you’ve dabbled in ITIL-based Incident Management processes before, you’ll be familiar with a few flavors of Incident Management process charts. These charts mostly boil down to the following steps, with arrows flying in a few competing directions:
- Identification and Logging
- Categorization
- Prioritization
- Investigation and Diagnosis
- Escalation
- Resolution, Recovery, and Closure
Let’s look at each step here, and identify some areas where folks tend to go wrong. Knowing is half the battle! Once we identify where we stumble, we can alter our broken processes and fix issues at their core. Okay, okay, I also know you’re a busy person, so I’ll put a “too long, didn’t read!” (TLDR) in each case. Here we go…
Identification and Logging
First on our list is the Identification and Logging step. Tickets are raised by users in all sorts of ways: email, phone, self-service, walk-ins, and maybe even social media tools like SnapFace or InstaChat. Regardless of identification method, that item needs to be logged within SCSM Incident Management, right? Wrong! Well, partially wrong. Yes, it should be logged in SCSM, but there’s a good chance some of those items are not actually incidents. Some of those items are likely Service Requests, which is an entirely separate module in both SCSM and the ITIL world.
If these items are all summed up as Incidents, you aren’t truly getting the value out of SCSM Incident Management, which is purely about restoring services as quickly as possible, often via workaround. If you are handling password resets, computer deployments, or even just general questions via SCSM Incident Management, you’re using the module for something it is not. Reporting data becomes saturated, process for the process-heavy-service-request-module goes out the window, and prioritization gets all mixed up. Quite a snowball effect.
TLDR: Train your Service Desk on the difference between an Incident and a Service Request, and ensure requests are logged as their appropriate type.
Categorization
Next on our list is the Categorization step. Assuming we have a true Incident now logged as an incident, and the proper information collected (foreshadowing alert!!), we need to categorize it. One of the most common mistakes I see on new SCSM Incident Management implementations is with the category list. The mindset is typically “we will have a really thorough list of categories, and that will allow us to get really detailed reports!” Sounds great on paper, but in reality: analysts are busy. They are not going to weed through a list of 50+ categories to find the best one– they have other things to get done! Chances are, if they can’t make the decision in 5 seconds or less, they are going to pick the closest-value-that-almost-maybe-makes-sense and move on.
The path of least resistance here is an “Other” value. If there is an “Other” value, you can bet you are going to have that value selected for anything not screamingly obvious. The solution here? Keep your lists simple. There are more ways to report on types of incidents than just the category field: related configuration items and resolution category come to mind as great options.
TLDR: Keep your category list simple, 7-12 values and ideally only one level of hierarchy. Also, try to avoid using an “Other” option
Prioritization
Priority in SCSM Incident Management is calculated based on a matrix, composed of “Impact” on one side and “Urgency” on the other. For example, an Impact of “High” and an Urgency of “High” might be a Priority 1, whereas an Impact of “Low” and Urgency of “Low” might be a Priority 4.
So now that we’ve cleared that up– and we have an Incident logged and categorized properly– we need to prioritize it. Well, the customer sounded really upset when they called, and they said their tablet wasn’t working and it was really important to them. That’s got to be high impact and high urgency, right? Sure, and when someone calls in and says, “I have a minor issue, and you can deal with it in a few months if you want, no rush,” that can be your one-instance-ever of a low/low incident. Reality is, this is too subjective. High and low are meaningless terms, and we need to define them better.
How can we do better for impact? Let’s say we have an issue that affects one person, versus an issue that affects a department, versus an issue that affects the entire company. I think we could all agree that those cases, in order, go from least impactful to most impactful. So, let’s take the subjective-ness out of it and just call them:
- Single User
- Group of Users (5-10)
- Company Wide
Setting impact in that way is pretty darned sure to be objective, and consistency is key when prioritizing one incident versus another.
How about urgency then? Urgency has a few more ways it can be sliced, but it’s important to keep in mind that this urgency should come from the business perspective, not the user perspective. Perhaps urgency gets divided into “Business Critical” versus “Not Business Critical.” That would rely on your business identifying systems that are considered critical and communicating that, so feasibility will vary.
Beyond that for urgency, there are also varying levels of breakage. A slow computer is distinctly less-urgent than a computer that won’t turn on at all. Defining urgency in tiers like this with more tangible terms helps to get consistent values. If you are getting consistent impact and consistent urgency values, you have achieved a consistent priority value.
TLDR: Use tangible and objective terms for impact and urgency values to yield consistent prioritization.
Investigation and Diagnosis
We now have an Incident that is logged, categorized, and prioritized properly. Let the troubleshooting commence! But… this particular Incident relates to a system I’ve never heard of before, and I have no clue where to begin troubleshooting. I guess my only option is to escalate, sans troubleshooting. Higher tier support gets the incident, and rolls their eyes: “it’s so simple, how could they not know how to fix this?!”… Or maybe… “it’s only fixable if I have information about xyz”… Or maybe “I’m not the group that handles that!”
A few things have gone wrong here, but it all comes down to communication and training. If higher tier support expects the Service Desk to collect certain information or to troubleshoot certain areas before escalation, higher tier support needs to empower the Service Desk to do so. Some of the information gathering can be handled via the intake form (i.e. Cireson Advanced Request Offering), but more might be needed beyond that– after all, we don’t really know what’s going on in all cases. Any area that the Service Desk is expected to handle should be identified, documented, and cross-trained on. More so, this should be encouraged and consistently expanded upon! By improving the knowledge and toolkits of the Service Desk, you grow their boundaries of support, empower them to be efficient and effective incident-crushers, end users get their problem resolved more efficiently, and higher tier groups can turn their attention towards other items.
TLDR: Patterns of poor issue diagnosis is a challenge for higher tier support to identify and address. Higher tier support must collaborate with the Service Desk to communicate the necessary technical knowledge and expectations.
Escalation
The previous step crosses over with escalation quite a bit. Let’s assume that all expected investigation and diagnosis is completed within reason, and there is truly a complex issue for higher tier support to address. How does the Service Desk know where to send this incident next? Cross-training with the Service Desk might be part of the solution, perhaps a simple routing table could be documented, or perhaps there could be some complex automation tied into the impacted Configuration Items. Either way, make sure that this information is out there. Put yourself in the shoes of a brand-new Service Desk employee and ask if they would know where an incident should route next.
Assuming the Incident goes to the right place, higher tier support then does-what-they-do, fix the issue, and close out the incident. Or maybe they are busy, and they’ll get to it next week, even though your service-level objectives say otherwise. The Service Desk representative might not be aware, and they promised the user they would be in touch with a swift resolution. Accountability and responsibility are distinctly different when it comes to SCSM Incident Management. Accountability is typically at the Service Desk level, meaning a way to track escalated incidents is needed so they can continue to serve as the user-facing point-of-contact. This is exactly the purpose of the “Primary Owner” field on an incident: it serves to keep the original Service Desk analyst connected to the incident. If troubleshooting falls behind and the user becomes unhappy, it is up to the Service Desk to ensure that expected levels of service are being delivered. The Service Desk should keep open communication with the user, should be in-touch with higher tier support if resolution stalls, and are ultimately accountable to reach the expected level of service. The Service Desk might not be fixing the problem directly anymore– that is the responsible party’s job– but they need to be involved.
TLDR: Provide the Service Desk with the knowledge to re-assign incidents when tier 1 troubleshooting hits its limits. Routing tables, integration with Configuration Items, and automation can all help to make this a smooth process. Further, when escalated, ensure that the Service desk maintains visibility so they can continue to serve as the voice of the customer.
Resolution, Recovery, and Closure
Woohoo! We’ve reached the end. The Service Desk was involved post-escalation, so they are in the loop. As such, the user has been looped in, and everyone is happy now that the incident is behind us.
Fast forward a few days. A different user has the same issue. Escalate and resolve all over again.
Fast forward a few days. A different user has the same issue. Escalate and resolve all over again.
Fast forward a few days. A different user has the same issue. Escalate and resolve all over again.
This is madness, it seems we are stuck in some sort of vicious loop. Is this issue something the Service Desk can help with directly? Is there something deeper going on here? The analyst who resolved the original incident has an important task of documenting that resolution and/or workaround. This information can be used by the Service Desk analyst in their initial diagnosis, in a process ITIL refers to as “Incident Matching.” If an issue is repeating like this, there is probably an underlying problem that needs to be addressed. That’s where the Problem Management process comes in, to investigate and resolve the true root cause of these incidents and prevent them in the future. While an incident might close once a workaround is applied, a problem is a longer-term, deeper-dive into the root cause. Remember that incidents are not intended for root cause analysis; their intent is to get a user back to a functional state as soon as possible, thus minimizing the impact to the business. If you have an incident that stays open for weeks-on-end, you are probably dealing with a problem and it should be treated as such.
TLDR: document fixes so that future instances can apply the same fix and bring in the Problem Management process as more information/patterns are detected.
There are, of course, many more nuances to Incident Management than we can cover in a single blog post. The important takeaway is this: make sure your fundamentals are sound. Some of this might seem obvious but going through sanity checks on the above can really help strengthen your foundation, get you to a performing state, and pave the way for future growth.