What Are the Best Practices for Triaging Software Bugs

Written by bugsnag | Published 2022/03/22
Tech Story Tags: bugsnag | software-development | bugs | bug-triaging | bug-triaging-principles | bug-triage | software-testing | good-company

TLDRThis video is the first in a three-part series on how to approach software bug triage. Bug triage comes down to the question: Is your software working in the hands of your users? If your app, or your game or your web service isn't working perfectly for every interaction, then it's time to think about bug triaging. For many systems, there are going to be some software bugs that are tolerable if they happen infrequently, while other bugs that happen even once, are critical and need immediate action.via the TL;DR App

We're talking about triaging bugs in software systems. So at a very simple level, this topic comes down to the question: Is your software working in the hands of your users? If your app, or your game, or your web service isn't working perfectly for every interaction, then it's time to think about bug triaging. For many systems, there are going to be some software bugs that are tolerable if they happen infrequently, while other bugs if they happen even once, are critical and need immediate action. So bug triage is all about making those judgment calls quickly and accurately. This video is the first in a three-part series on how to approach software bug triage.

Transcript

0:15 okay so let's set the stage here
0:17 we're talking about triaging bugs in
0:19 software systems
0:20 so at a very simple level this topic
0:23 comes down to the question
0:24 is your software working in the hands of
0:27 your users
0:28 if your app or your game or your web
0:31 service
0:31 isn't working perfectly for every
0:33 interaction then it's time to think
0:35 about bug triaging
0:37 for many systems there are going to be
0:39 some software bugs that are tolerable if
0:41 they happen infrequently
0:43 while other bugs if they happen even
0:44 once are critical
0:46 and need immediate action so bug triage
0:49 is all about making those judgment calls
0:51 quickly and accurately
0:55 so it's useful to break bugstown into
0:57 three categories for our discussion
0:59 today
1:00 there are those bugs that need immediate
1:01 action these are things that require
1:03 some developer intervention maybe
1:05 rolling back a recent release or
1:07 flipping a feature flag to get the bug
1:09 out of the hands of users
1:10 and get back onto a stable version of
1:12 the software there are those bugs that
1:15 do not require immediate action now but
1:18 may require action in the future
1:20 if they become more impactful let's say
1:22 if they happen more frequently or affect
1:24 more users
1:26 and third there are those bugs that
1:28 regardless of their frequency
1:30 are safe to ignore of these three
1:33 categories it's really the first two
1:35 that are most interesting for our
1:36 purposes today
1:37 and those are the ones we're going to
1:38 focus on
1:42 so when it comes to determining which
1:44 category a given bug falls into
1:46 there are really two main workflows
1:49 there's what we call reactive triage
1:51 and what we call periodic triage
1:53 reactive triage
1:55 is the scenario where a bug occurs or
1:58 something changes with a bug's frequency
2:00 and it requires someone on your team to
2:02 drop what they're doing and go
2:03 investigate it immediately
2:05 so these tend to be bugs that are high
2:08 impact
2:09 or affect a critical area of your system
2:12 so some examples here might be a new bug
2:15 that the system has never seen before
2:17 it might be an issue involving a bug
2:19 that was previously occurring at some
2:22 safe steady state frequency but now bug
2:24 snag has detected an anomalous spike in
2:26 the frequency of that bug
2:29 could be a bug in a critical area of
2:31 each system a bug that you previously
2:33 fixed
2:33 that has come back now in a future
2:35 release of the software
2:37 or it could be something to do with a
2:39 stability score
2:40 being off target for your project these
2:43 are all concepts we'll talk about more
2:45 but the crucial point here is that a key
2:47 category of bug triaging
2:49 is all about reactively jumping into bug
2:53 snag
2:53 figuring out what's going on and making
2:55 sure that
2:57 some immediate action isn't needed to
3:00 get a bug
3:00 away from your users
3:04 so when you're thinking about reactive
3:05 triage in
3:07 your project and within your team it's
3:10 really important to think about
3:12 which subset of your errors are going to
3:14 rise to the level of importance that you
3:16 want someone on your team
3:18 to effectively drop what they're doing
3:19 and go triage that bug immediately
3:23 once you've made that determination you
3:24 can configure bug snag via
3:26 the alerting and workflow engine to
3:28 notify your team via
3:30 team chat or via on-call alerting system
3:32 whenever one of the bugs that meets your
3:34 custom defined criteria
3:36 occurs so it's worth pointing out that
3:39 the
3:40 alerting and workflow engine is is
3:42 highly configurable
3:43 you decide when bug snag notifies you
3:45 and through what means
3:47 some examples of how you can use this to
3:50 your team's advantage
3:52 let's say you have a spike in errors
3:54 affecting your vip customers
3:56 where you define what it means for a
3:58 customer to be a vip
4:00 bug snag can detect that and
4:02 automatically open a pager duty incident
4:04 for you
4:05 fitting into your team's existing
4:06 on-call rotation
4:08 and bug remediation process
4:11 or let's say you work in a monolithic
4:13 code base where
4:15 each team works out of a different slack
4:17 channel but ultimately you share the
4:18 same code
4:20 you can configure bug snag to notify
4:22 your slack channel
4:24 about bugs in your team's part of the
4:26 monolith
4:27 and the possibilities are really
4:30 infinite from there
4:33 so most teams aren't going to triage
4:35 every single error
4:36 using a purely reactive workflow they're
4:40 going to be those bugs that
4:42 aren't critical enough that require
4:44 people to drop what they're doing
4:46 and go triage them immediately of course
4:49 this varies from team to team but this
4:51 is generally true
4:53 all bugs affecting your system need to
4:55 be reviewed and prioritized regularly
4:56 though
4:57 so an initial target that we recommend
5:00 is to have your team
5:01 triage your for review errors once per
5:04 day
5:05 this is especially important to do first
5:07 thing in the workday or
5:08 after lunch any time where there may
5:10 have been a lapse in coverage
5:13 and new bugs may have crept in or
5:16 previously triage bugs may have come
5:18 back into the for review state
5:20 and we'll talk about all that in greater
5:22 detail in a moment
5:25 let's quickly review the workflow
5:28 actions available
5:29 in bug snag when a bug is first detected
5:32 by bug snag
5:33 it goes into the open and for review
5:36 workflow states
5:37 and we'll talk more about the four
5:39 review workflow state because that's
5:41 really
5:41 key to triaging so when an error
5:45 is in an open error state there are some
5:47 key workflow actions you can perform on
5:49 the error
5:50 and these map back to those three
5:52 categories of errors we talked about at
5:53 the beginning right
5:54 things you want to fix immediately
5:56 things you may want to fix in the future
5:58 and things you're safe to ignore
6:00 so starting from
6:05 top left to right here snoozing an error
6:08 is something you can do to conditionally
6:09 reopen an error in the future and this
6:11 is something you would do
6:13 if an error is in that category where
6:15 you want to
6:17 keep an eye on it but you're not going
6:18 to fix it right now and you're only
6:19 going to address it
6:20 if it becomes more impactful you can
6:24 create an issue
6:25 to track the work related to an imminent
6:27 fix of a bug
6:28 so for example if you're using jira
6:32 this would be equivalent to clicking a
6:34 button in bugsmag which will create a
6:35 jira ticket
6:37 which will then be used in your sprint
6:39 or other work
6:40 uh work planning process
6:43 to track the work of actually going in
6:45 and making the necessary code or
6:47 infrastructure changes to remove the bug
6:51 you can mark a bug as fixed and this is
6:53 typically what you would do
6:54 for those category one bugs that you've
6:57 decided to fix right now
6:59 when you've taken some action to
7:00 remediate the bug
7:02 and when you mark a bug as fixed it will
7:05 only return to the for
7:06 for review state if it's seen again in a
7:08 future version of the code
7:10 and lastly you can ignore an open error
7:14 which will signify that you're not
7:16 planning to take any action on it
7:18 regardless of how frequently it may
7:20 occur in the future
7:22 during error triage you're typically
7:24 going to be taking these workflow
7:26 actions
7:26 from a specific error details view
7:29 inside bug snag
7:30 you can also take these workflow actions
7:33 from the inbox view and bug snag
7:35 which also gives you the ability to take
7:37 workflow actions on
7:38 more than one error at once we'll look
7:41 at some examples of doing this in the
7:43 product in just a moment
7:46 a key tip for error triaging and bug
7:49 snag
7:49 is to start your triaging workflow with
7:52 four review filter
7:53 in the bug snack inbox so if we look at
7:56 the screenshot below you'll see that
7:58 we're viewing the bug snag inbox
8:00 and that it's currently filtered to four
8:02 review errors and you can see this in
8:04 two key places
8:05 in the filter bar it says status for
8:07 review and in the left hand
8:10 column it says for review with 18 in
8:13 parentheses and that has
8:15 an active ui state and this signifies
8:17 that we're currently filtering for four
8:19 review errors there are 18
8:20 errors that need to be reviewed and the
8:22 tooltip there is giving us a hint it
8:23 says open errors that are awaiting
8:25 triage
8:26 so what we need to do if we imagine that
8:29 we're on this
8:30 team that's responsible for the software
8:32 that's being monitored by this bug snag
8:34 project here what we need to do is look
8:36 at every one of these errors currently
8:38 affecting our users
8:39 and determine its impact and then we
8:41 need to determine which of these
8:43 workflow actions that we just discussed
8:45 fixing snoozing creating an issue
8:47 ignoring etc.
8:48 is most appropriate given the current
8:50 impact of the bug
8:52 and given the current work that is on
8:54 our team's plate
8:57 so let's take a look at a project in bug
8:59 snag
9:00 so we can see some of these things in
9:01 action
9:03 if we go to this photosnap android
9:06 project
9:07 and have a look at the inbox we can see
9:10 that this project has quite a few
9:12 open errors and notice we're filtered to
9:15 areas that have occurred only in the
9:16 past 30 days
9:18 so it's likely that there are even more
9:19 than the 38 open is currently affecting
9:21 this
9:22 project but let's as we said have a look
9:25 at the four review errors so we can see
9:27 in the last 30 days there are 24
9:29 errors that are for review so we might
9:31 start our triaging here
9:33 and again what we're going to do is
9:35 we're going to look at every one of
9:36 these errors
9:38 in the for review set and we're going to
9:40 figure out
9:41 what the appropriate next step is for
9:43 each of these errors
9:46 one thing you might want to consider
9:47 doing at this point is sorting
9:49 the inbox either by total number of
9:51 events per error
9:52 so you can see this is the error that
9:55 had
9:56 the most events in the past 30 days or
9:59 you could sort by users affected as well
10:01 and you can see this one affected 56
10:03 users
10:05 this happens to be an application not
10:07 responding error
10:09 which is pretty severe so let's go and
10:12 take a look at that
10:15 so here we are on the error details
10:17 screen
10:18 this gives us a overview of all of the
10:22 specific information to do with this one
10:24 particular
10:26 defect in the application so we can see
10:29 again this affected
10:31 56 users it happened a total of 80 times
10:33 in the past 30 days
10:34 we can go between these tabs and see
10:37 more information about
10:39 how those 80 occurrences are distributed
10:41 across
10:42 specific users we can see which releases
10:45 of the software the bug has occurred in
10:47 os versions of end user devices and so
10:50 on
10:50 and if you're new to bug snag it's worth
10:54 pointing out that
10:55 all of these we call these pivots all
10:58 the information in these pivots can be
10:59 used to
11:00 filter down the view of this error even
11:03 more
11:04 so if we're only looking at os version 7
11:07 1 1 it goes down to 28
11:11 events and 24 users affected the point
11:13 is
11:14 you can use all of this information that
11:16 bug site gives you about the
11:18 frequency of the error the specific
11:22 device context in which this error has
11:25 been seen
11:26 to determine the impact and to determine
11:28 the next step
11:29 once you've determined what makes sense
11:32 to do for this you'd come up here
11:34 these are the error actions that we
11:35 talked about so this is where you would
11:37 create an issue
11:38 this is where you could mark it as fixed
11:40 so where you would snooze it
11:42 ignore it so let's say that we've just
11:45 shipped a fix for this and we don't
11:47 expect to see it in a future version of
11:49 the software anymore
11:50 then the next step would be to mark this
11:53 as fixed
11:55 here it's prompting us to add a comment
11:57 about why we think this has been
11:59 fixed and we can say something to the
12:01 effect of
12:03 fixed in last release mark is fixed
12:07 there you go now you can see that it's
12:08 fixed
12:10 and if we go to the comment and activity
12:12 view we can see
12:14 that this was fixed and here's my
12:16 comment explaining why
12:19 so you start your triaging workflow with
12:21 your four review errors
12:23 now your goal should be to get that
12:25 total number of four review errors down
12:27 to zero
12:28 on a regular basis and what it means
12:30 when you do that when you achieve bug
12:32 snag inbox zero
12:33 it means that all of your critical bugs
12:35 have been addressed and for any lower
12:37 priority
12:38 bugs you've determined the criteria at
12:40 which point you will take further action
12:42 on them in the future
12:46 a common question we get at this point
12:48 is when will bugs
12:50 ever go back into the for review state
12:52 and there are a few
12:53 key situations where this will happen
12:56 the obvious one is
12:57 newly introduced bugs bugs that bug
12:59 snake has never seen before
13:01 will continue to go in the for review
13:02 state for you to triage
13:04 but also any previously snoozed
13:07 bugs that have exceeded their previous
13:09 news thresholds will also go back into
13:11 the for review state
13:13 and any bugs that you've marked as fixed
13:14 that have happened in a new version of
13:16 your software
13:17 will also return to the for review state
13:20 and the reason for this is that
13:21 even though you've looked at these bugs
13:23 in the past now
13:24 their context has changed they've begun
13:27 to happen more frequently
13:29 or they've happened in a version of your
13:31 software where you're not expecting them
13:32 to happen and so in all of these cases
13:35 these are things that you want to be
13:36 looking at to be determining their
13:37 current impact
13:38 and whether you need to take some new
13:40 action based on this new information
13:43 so why aim for bug snag inbox zero well
13:46 first and foremost if you're regularly
13:48 getting your inbox down to
13:50 around zero errors for review it means
13:52 that when new errors do come in
13:54 to be reviewed your team can be more
13:56 efficient with their attention
13:58 because if you consider the case where
14:00 you're not getting anywhere close to
14:01 inbox
14:02 hero when someone comes in to do
14:03 periodic review they may have to sift
14:05 through
14:06 several errors that have been given
14:08 varying degrees of review
14:10 already but that's not necessarily clear
14:13 because a workflow action hasn't been
14:14 taken on those errors appropriately
14:17 so if you are getting to inbox hero
14:18 regularly it means that
14:20 the errors that your team looks at
14:21 during the triaging workflow are only
14:23 those errors that need to be considered
14:25 in their current context
14:28 the other thing about getting close to
14:30 inbox zero or hitting inbox zero on a
14:32 daily basis
14:33 is that it increases the likelihood that
14:34 your team is going to be engaged
14:36 with the periodic triaging process
14:38 because the lower that number is that's
14:40 for review the closer to zero
14:42 the more likely people are to want to
14:44 get that down to zero you know you
14:46 consider the case of 1000 errors to
14:49 review versus five errors to review
14:52 one is much more inviting than the other
14:54 as far as
14:55 you know someone on the team wanting to
14:56 go in and do the necessary work to get
14:58 those errors triaged
14:59 so try to hit inbox zero every day it's
15:02 going to make your team more engaged
15:03 it's going to allow them to spend time
15:05 in bug snag
15:10 efficiently
15:19 you

Written by bugsnag | The leading application stability management solution trusted by over 6,000 engineering teams worldwide.
Published by HackerNoon on 2022/03/22