What Are the Best Practices for Triaging Software Bugs

We're talking about triaging bugs in software systems. So at a very simple level, this topic comes down to the question: Is your software working in the hands of your users? If your app, or your game, or your web service isn't working perfectly for every interaction, then it's time to think about bug triaging. For many systems, there are going to be some software bugs that are tolerable if they happen infrequently, while other bugs if they happen even once, are critical and need immediate action. So bug triage is all about making those judgment calls quickly and accurately. This video is the first in a three-part series on how to approach software bug triage.

Part 1: Bug Triaging Principles
Part 2: Tips for Effective + Efficient Bug Triage
Part 3: Developing a Bug Triage Process with Your Team

Transcript

0:15 okay so let's set the stage here

0:17 we're talking about triaging bugs in

0:19 software systems

0:20 so at a very simple level this topic

0:23 comes down to the question

0:24 is your software working in the hands of

0:27 your users

0:28 if your app or your game or your web

0:31 service

0:31 isn't working perfectly for every

0:33 interaction then it's time to think

0:35 about bug triaging

0:37 for many systems there are going to be

0:39 some software bugs that are tolerable if

0:41 they happen infrequently

0:43 while other bugs if they happen even

0:44 once are critical

0:46 and need immediate action so bug triage

0:49 is all about making those judgment calls

0:51 quickly and accurately

0:55 so it's useful to break bugstown into

0:57 three categories for our discussion

0:59 today

1:00 there are those bugs that need immediate

1:01 action these are things that require

1:03 some developer intervention maybe

1:05 rolling back a recent release or

1:07 flipping a feature flag to get the bug

1:09 out of the hands of users

1:10 and get back onto a stable version of

1:12 the software there are those bugs that

1:15 do not require immediate action now but

1:18 may require action in the future

1:20 if they become more impactful let's say

1:22 if they happen more frequently or affect

1:24 more users

1:26 and third there are those bugs that

1:28 regardless of their frequency

1:30 are safe to ignore of these three

1:33 categories it's really the first two

1:35 that are most interesting for our

1:36 purposes today

1:37 and those are the ones we're going to

1:38 focus on

1:42 so when it comes to determining which

1:44 category a given bug falls into

1:46 there are really two main workflows

1:49 there's what we call reactive triage

1:51 and what we call periodic triage

1:53 reactive triage

1:55 is the scenario where a bug occurs or

1:58 something changes with a bug's frequency

2:00 and it requires someone on your team to

2:02 drop what they're doing and go

2:03 investigate it immediately

2:05 so these tend to be bugs that are high

2:08 impact

2:09 or affect a critical area of your system

2:12 so some examples here might be a new bug

2:15 that the system has never seen before

2:17 it might be an issue involving a bug

2:19 that was previously occurring at some

2:22 safe steady state frequency but now bug

2:24 snag has detected an anomalous spike in

2:26 the frequency of that bug

2:29 could be a bug in a critical area of

2:31 each system a bug that you previously

2:33 fixed

2:33 that has come back now in a future

2:35 release of the software

2:37 or it could be something to do with a

2:39 stability score

2:40 being off target for your project these

2:43 are all concepts we'll talk about more

2:45 but the crucial point here is that a key

2:47 category of bug triaging

2:49 is all about reactively jumping into bug

2:53 snag

2:53 figuring out what's going on and making

2:55 sure that

2:57 some immediate action isn't needed to

3:00 get a bug

3:00 away from your users

3:04 so when you're thinking about reactive

3:05 triage in

3:07 your project and within your team it's

3:10 really important to think about

3:12 which subset of your errors are going to

3:14 rise to the level of importance that you

3:16 want someone on your team

3:18 to effectively drop what they're doing

3:19 and go triage that bug immediately

3:23 once you've made that determination you

3:24 can configure bug snag via

3:26 the alerting and workflow engine to

3:28 notify your team via

3:30 team chat or via on-call alerting system

3:32 whenever one of the bugs that meets your

3:34 custom defined criteria

3:36 occurs so it's worth pointing out that

3:39 the

3:40 alerting and workflow engine is is

3:42 highly configurable

3:43 you decide when bug snag notifies you

3:45 and through what means

3:47 some examples of how you can use this to

3:50 your team's advantage

3:52 let's say you have a spike in errors

3:54 affecting your vip customers

3:56 where you define what it means for a

3:58 customer to be a vip

4:00 bug snag can detect that and

4:02 automatically open a pager duty incident

4:04 for you

4:05 fitting into your team's existing

4:06 on-call rotation

4:08 and bug remediation process

4:11 or let's say you work in a monolithic

4:13 code base where

4:15 each team works out of a different slack

4:17 channel but ultimately you share the

4:18 same code

4:20 you can configure bug snag to notify

4:22 your slack channel

4:24 about bugs in your team's part of the

4:26 monolith

4:27 and the possibilities are really

4:30 infinite from there

4:33 so most teams aren't going to triage

4:35 every single error

4:36 using a purely reactive workflow they're

4:40 going to be those bugs that

4:42 aren't critical enough that require

4:44 people to drop what they're doing

4:46 and go triage them immediately of course

4:49 this varies from team to team but this

4:51 is generally true

4:53 all bugs affecting your system need to

4:55 be reviewed and prioritized regularly

4:56 though

4:57 so an initial target that we recommend

5:00 is to have your team

5:01 triage your for review errors once per

5:04 day

5:05 this is especially important to do first

5:07 thing in the workday or

5:08 after lunch any time where there may

5:10 have been a lapse in coverage

5:13 and new bugs may have crept in or

5:16 previously triage bugs may have come

5:18 back into the for review state

5:20 and we'll talk about all that in greater

5:22 detail in a moment

5:25 let's quickly review the workflow

5:28 actions available

5:29 in bug snag when a bug is first detected

5:32 by bug snag

5:33 it goes into the open and for review

5:36 workflow states

5:37 and we'll talk more about the four

5:39 review workflow state because that's

5:41 really

5:41 key to triaging so when an error

5:45 is in an open error state there are some

5:47 key workflow actions you can perform on

5:49 the error

5:50 and these map back to those three

5:52 categories of errors we talked about at

5:53 the beginning right

5:54 things you want to fix immediately

5:56 things you may want to fix in the future

5:58 and things you're safe to ignore

6:00 so starting from

6:05 top left to right here snoozing an error

6:08 is something you can do to conditionally

6:09 reopen an error in the future and this

6:11 is something you would do

6:13 if an error is in that category where

6:15 you want to

6:17 keep an eye on it but you're not going

6:18 to fix it right now and you're only

6:19 going to address it

6:20 if it becomes more impactful you can

6:24 create an issue

6:25 to track the work related to an imminent

6:27 fix of a bug

6:28 so for example if you're using jira

6:32 this would be equivalent to clicking a

6:34 button in bugsmag which will create a

6:35 jira ticket

6:37 which will then be used in your sprint

6:39 or other work

6:40 uh work planning process

6:43 to track the work of actually going in

6:45 and making the necessary code or

6:47 infrastructure changes to remove the bug

6:51 you can mark a bug as fixed and this is

6:53 typically what you would do

6:54 for those category one bugs that you've

6:57 decided to fix right now

6:59 when you've taken some action to

7:00 remediate the bug

7:02 and when you mark a bug as fixed it will

7:05 only return to the for

7:06 for review state if it's seen again in a

7:08 future version of the code

7:10 and lastly you can ignore an open error

7:14 which will signify that you're not

7:16 planning to take any action on it

7:18 regardless of how frequently it may

7:20 occur in the future

7:22 during error triage you're typically

7:24 going to be taking these workflow

7:26 actions

7:26 from a specific error details view

7:29 inside bug snag

7:30 you can also take these workflow actions

7:33 from the inbox view and bug snag

7:35 which also gives you the ability to take

7:37 workflow actions on

7:38 more than one error at once we'll look

7:41 at some examples of doing this in the

7:43 product in just a moment

7:46 a key tip for error triaging and bug

7:49 snag

7:49 is to start your triaging workflow with

7:52 four review filter

7:53 in the bug snack inbox so if we look at

7:56 the screenshot below you'll see that

7:58 we're viewing the bug snag inbox

8:00 and that it's currently filtered to four

8:02 review errors and you can see this in

8:04 two key places

8:05 in the filter bar it says status for

8:07 review and in the left hand

8:10 column it says for review with 18 in

8:13 parentheses and that has

8:15 an active ui state and this signifies

8:17 that we're currently filtering for four

8:19 review errors there are 18

8:20 errors that need to be reviewed and the

8:22 tooltip there is giving us a hint it

8:23 says open errors that are awaiting

8:25 triage

8:26 so what we need to do if we imagine that

8:29 we're on this

8:30 team that's responsible for the software

8:32 that's being monitored by this bug snag

8:34 project here what we need to do is look

8:36 at every one of these errors currently

8:38 affecting our users

8:39 and determine its impact and then we

8:41 need to determine which of these

8:43 workflow actions that we just discussed

8:45 fixing snoozing creating an issue

8:47 ignoring etc.

8:48 is most appropriate given the current

8:50 impact of the bug

8:52 and given the current work that is on

8:54 our team's plate

8:57 so let's take a look at a project in bug

8:59 snag

9:00 so we can see some of these things in

9:01 action

9:03 if we go to this photosnap android

9:06 project

9:07 and have a look at the inbox we can see

9:10 that this project has quite a few

9:12 open errors and notice we're filtered to

9:15 areas that have occurred only in the

9:16 past 30 days

9:18 so it's likely that there are even more

9:19 than the 38 open is currently affecting

9:21 this

9:22 project but let's as we said have a look

9:25 at the four review errors so we can see

9:27 in the last 30 days there are 24

9:29 errors that are for review so we might

9:31 start our triaging here

9:33 and again what we're going to do is

9:35 we're going to look at every one of

9:36 these errors

9:38 in the for review set and we're going to

9:40 figure out

9:41 what the appropriate next step is for

9:43 each of these errors

9:46 one thing you might want to consider

9:47 doing at this point is sorting

9:49 the inbox either by total number of

9:51 events per error

9:52 so you can see this is the error that

9:55 had

9:56 the most events in the past 30 days or

9:59 you could sort by users affected as well

10:01 and you can see this one affected 56

10:03 users

10:05 this happens to be an application not

10:07 responding error

10:09 which is pretty severe so let's go and

10:12 take a look at that

10:15 so here we are on the error details

10:17 screen

10:18 this gives us a overview of all of the

10:22 specific information to do with this one

10:24 particular

10:26 defect in the application so we can see

10:29 again this affected

10:31 56 users it happened a total of 80 times

10:33 in the past 30 days

10:34 we can go between these tabs and see

10:37 more information about

10:39 how those 80 occurrences are distributed

10:41 across

10:42 specific users we can see which releases

10:45 of the software the bug has occurred in

10:47 os versions of end user devices and so

10:50 on

10:50 and if you're new to bug snag it's worth

10:54 pointing out that

10:55 all of these we call these pivots all

10:58 the information in these pivots can be

10:59 used to

11:00 filter down the view of this error even

11:03 more

11:04 so if we're only looking at os version 7

11:07 1 1 it goes down to 28

11:11 events and 24 users affected the point

11:13 is

11:14 you can use all of this information that

11:16 bug site gives you about the

11:18 frequency of the error the specific

11:22 device context in which this error has

11:25 been seen

11:26 to determine the impact and to determine

11:28 the next step

11:29 once you've determined what makes sense

11:32 to do for this you'd come up here

11:34 these are the error actions that we

11:35 talked about so this is where you would

11:37 create an issue

11:38 this is where you could mark it as fixed

11:40 so where you would snooze it

11:42 ignore it so let's say that we've just

11:45 shipped a fix for this and we don't

11:47 expect to see it in a future version of

11:49 the software anymore

11:50 then the next step would be to mark this

11:53 as fixed

11:55 here it's prompting us to add a comment

11:57 about why we think this has been

11:59 fixed and we can say something to the

12:01 effect of

12:03 fixed in last release mark is fixed

12:07 there you go now you can see that it's

12:08 fixed

12:10 and if we go to the comment and activity

12:12 view we can see

12:14 that this was fixed and here's my

12:16 comment explaining why

12:19 so you start your triaging workflow with

12:21 your four review errors

12:23 now your goal should be to get that

12:25 total number of four review errors down

12:27 to zero

12:28 on a regular basis and what it means

12:30 when you do that when you achieve bug

12:32 snag inbox zero

12:33 it means that all of your critical bugs

12:35 have been addressed and for any lower

12:37 priority

12:38 bugs you've determined the criteria at

12:40 which point you will take further action

12:42 on them in the future

12:46 a common question we get at this point

12:48 is when will bugs

12:50 ever go back into the for review state

12:52 and there are a few

12:53 key situations where this will happen

12:56 the obvious one is

12:57 newly introduced bugs bugs that bug

12:59 snake has never seen before

13:01 will continue to go in the for review

13:02 state for you to triage

13:04 but also any previously snoozed

13:07 bugs that have exceeded their previous

13:09 news thresholds will also go back into

13:11 the for review state

13:13 and any bugs that you've marked as fixed

13:14 that have happened in a new version of

13:16 your software

13:17 will also return to the for review state

13:20 and the reason for this is that

13:21 even though you've looked at these bugs

13:23 in the past now

13:24 their context has changed they've begun

13:27 to happen more frequently

13:29 or they've happened in a version of your

13:31 software where you're not expecting them

13:32 to happen and so in all of these cases

13:35 these are things that you want to be

13:36 looking at to be determining their

13:37 current impact

13:38 and whether you need to take some new

13:40 action based on this new information

13:43 so why aim for bug snag inbox zero well

13:46 first and foremost if you're regularly

13:48 getting your inbox down to

13:50 around zero errors for review it means

13:52 that when new errors do come in

13:54 to be reviewed your team can be more

13:56 efficient with their attention

13:58 because if you consider the case where

14:00 you're not getting anywhere close to

14:01 inbox

14:02 hero when someone comes in to do

14:03 periodic review they may have to sift

14:05 through

14:06 several errors that have been given

14:08 varying degrees of review

14:10 already but that's not necessarily clear

14:13 because a workflow action hasn't been

14:14 taken on those errors appropriately

14:17 so if you are getting to inbox hero

14:18 regularly it means that

14:20 the errors that your team looks at

14:21 during the triaging workflow are only

14:23 those errors that need to be considered

14:25 in their current context

14:28 the other thing about getting close to

14:30 inbox zero or hitting inbox zero on a

14:32 daily basis

14:33 is that it increases the likelihood that

14:34 your team is going to be engaged

14:36 with the periodic triaging process

14:38 because the lower that number is that's

14:40 for review the closer to zero

14:42 the more likely people are to want to

14:44 get that down to zero you know you

14:46 consider the case of 1000 errors to

14:49 review versus five errors to review

14:52 one is much more inviting than the other

14:54 as far as

14:55 you know someone on the team wanting to

14:56 go in and do the necessary work to get

14:58 those errors triaged

14:59 so try to hit inbox zero every day it's

15:02 going to make your team more engaged

15:03 it's going to allow them to spend time

15:05 in bug snag

15:10 efficiently

15:19 you