Troubleshooting a problem

Troubleshooting is a systematic approach to solving a problem. The goal of troubleshooting is to determine why something does not work as expected and explain how to resolve the problem.

The first step in the troubleshooting process is to describe the problem completely. Problem descriptions help you and the IBM® Support person know where to start to find the cause of the problem. This step involves gathering the necessary information to answer the following questions:

Knowing the answers to these questions helps you create a good description of the problem for IBM Support. These answers can result in a faster resolution of the problem.

What are the symptoms of the problem?

To describe a problem, first break the problem into several focused questions that create a more descriptive picture of the problem. The following questions are examples:

  • Who, or what, is reporting the problem?
  • What are the error codes and messages?
  • How does the system fail? For example, is it a loop, hang, crash, performance degradation, or incorrect result?
  • What is the business impact of the problem?

Where does the problem occur?

Determining where the problem originates is one of the most important steps in resolving a problem. Many layers of technology can exist between the reporting and failing components. Networks, disks, and drivers are only a few of the components to consider when you are investigating problems.

The following questions help you to focus on where the problem occurs to isolate the problem layer:

  • Is the problem specific to one platform or operating system, or is it common across multiple platforms or operating systems?
  • Is the current environment and configuration supported?

Remember that if one layer reports the problem, the problem does not necessarily originate in that layer. Part of identifying where a problem originates is understanding the environment in which it exists. Take some time to completely describe the problem environment, including the operating system and version, all corresponding software and versions, and hardware information. Confirm that you are running within an environment that is a supported configuration. Many problems can be traced back to incompatible levels of software that are not intended to run together or have not been fully tested together.

When does the problem occur?

Develop a detailed timeline of events that lead to a failure, especially for those cases that are one-time occurrences. This timeline is easier to develop if you start at the time that an error was reported (as precisely as possible, even to the millisecond). Then work backwards through the available logs and information. Typically, the first suspicious event in a diagnostic log is the best starting point. However, this method is not always easy to do. Knowing when to stop looking is especially difficult when multiple layers of technology are involved, and when each has its own diagnostic information.

To develop a detailed timeline of events, answer these questions as a guideline for knowing whether the appropriate information is gathered:

  • Does the problem happen only at a certain time of day or night?
  • How often does the problem happen?
  • What sequence of events leads up to the time that the problem is reported?
  • Does the problem happen after an environment change, such as upgrading or installing software or hardware?

Under which conditions does the problem occur?

Knowing which systems and applications are running at the time that a problem occurs is an important part of troubleshooting. The following questions about the environment can help in identifying the root cause of the problem:

  • Does the problem always occur when the same task is being performed?
  • Does a certain sequence of events have to occur for the problem to surface?
  • Do any other applications fail at the same time?

Answering these types of questions can help in explaining the environment in which the problem occurs and in correlating any dependencies. Remember that just because multiple problems might have occurred around the same time, the problems are not necessarily related.

Can the problem be reproduced?

Problems that can be reproduced are often easier to debug and solve. Typically, if a problem can be reproduced, more tools and procedures are available to help with investigation into the cause of the problem. If possible, re-create the problem in a test or development environment, which typically offers you more flexibility and control during your investigation.

  • Can the problem be reproduced on a test system?
  • Are multiple users or applications encountering the same type of problem?
  • Can the problem be re-created by running a single command, a set of commands, a particular application, or a stand-alone application?