Troubleshooting methodology
There are two fundamental reasons why you might be doing packet analysis:
- Troubleshooting a connectivity or functionality problem (a user can't connect, an application doesn't work, or doesn't work right), which we'll just call troubleshooting
- Analyzing a performance problem (the application works but is slow), which we'll call performance analysis
A third gray area is an application that basically works but is slow and occasionally times out, which could involve an underlying functional problem that causes the performance issue, or just simply be a really poor performance.
Troubleshooting a connectivity or functional issue is just a matter of comparing what normally works with what is going on, in the case you're working on.
A performance problem, on the other hand, requires determining where the majority of the time for a particular transaction to complete is being spent, measuring the delay and comparing that delay to what is normal or acceptable. The source and type of excessive delay usually points to the next area to investigate further or resolve.
In any case, you need to gather the information that allows you to determine whether this is a connectivity, functional, or performance issue and approach the problem according to its nature.
Gathering the right information
The most important thing you can do when approaching a problem is to determine what the real problem is so you can work on the right problem or the right aspect of the problem. In order to determine what the real problem is, or at least get close, you'll need to ask questions and interpret the answers. These questions could include the appropriate selections (depending on the complaint) from the following list:
- Define the problem:
- What were you trying to do (connect to a server, log in, send/receive e-mails, general application usage, upload/download file, and specific transactions or functions)?
- Is nothing working or is this just a problem with a specific application or multiple applications?
- What website/server/application were you trying / connecting to? Do you know the hostname, URL, and/or IP address and port used to access the application?
- What is the symptom/nature of the problem? Has this application or function/feature worked before, or is this the first time you've ever tried to use it?
- Did you receive any error messages or other indications of a problem?
- Is the issue consistent or intermittent? Depends? On what?
- How long has this been happening?
- Was there some recent change that did or could have had an impact?
- What has been identified or suspected so far? What has been done to address this? Has it helped or changed anything?
- Are there any other pertinent factors, symptoms, or recent changes to the user environment that should be considered?
- Determine the scope of the issue:
- Is this problem occurring for a single user or a group of users?
- Is this problem occurring within a specific office, region, or across the whole company?
- Is this problem affecting different types of users differently?
- Collect system, application, and path information. For a more in-depth analysis (beyond single user or small group issues), the applicable questions from the following list might also need to be gathered and analyzed, as appropriate to the complaint (some of this information may have to be obtained from network or application support groups):
- What is the browser type and version on the client (for web apps)? Is this different from clients that are working properly?
- What is the operating system type and version of the client(s) and server?
- What is the proper (vendor) application name and version? Are there any known issues with the application that match these symptoms (check the vendor's bug reports).
- What is the database type and server environment behind the application server?
- Are there other backend-supporting data sources such as an online data service or Documentum and SharePoint servers involved?
- What is the network path between the client and server? Are there firewalls, proxy servers, load balancers, and/or WAN accelerators in the path? Are they configured and working properly?
- Can you confirm the expected network path (and any WAN links involved) with a traceroute and verify the bandwidth availability?
- Can you measure the round trip time (RTT) path latency from the user to the application server with pings or TCP handshake completion times?
Establishing the general nature of the problem
At this point, you should be able to identify the general nature of the problem between one of the following three basic types:
- Determine whether this is a connectivity problem
- User(s) cannot connect to anything
- User(s) cannot connect to a specific server/application
- Determine whether this is a functionality or configuration problem
- User(s) can connect (gets a login screen or other response from the application server) but cannot log in (or get the expected response)
- User(s) can connect and log in but some or all functions are failing (for example, cannot send/receive e-mails)
- Determine whether this is a performance problem
- User(s) can connect, log in, and use the application normally; but it's slow
- The application works normally but sometimes it stalls and/or times out
Half-split troubleshooting and other logic
When I was doing component-level repair of electronic equipment early in my career, I learned to use the "half-split" troubleshooting method, which worked very well in almost every single case. Half-split troubleshooting is the process of cutting the problem domain (in my case, a piece of radio gear) in half by injecting or measuring signals roughly midway through the system. The idea is to see which half is working right and which half isn't, then shifting focus to the half that doesn't work, analyzing it halfway through, and so on. This process is repeated until you narrow the problem down to its source.
In the network and application world, the same half-split troubleshooting approach can be applied as well, in a general sense. If users are complaining that the network is slow, try to confirm or eliminate the network:
- Are users close to the server experiencing similar slowness? How about users in other remote locations?
- If a certain application is slow for a remote user, are other applications slow for that user as well?
- If users can't connect to a given server, can they connect to other servers nearby or at other locations?
By a process of logical examination of what does and doesn't work, you can eliminate a lot of guesswork and narrow your analysis down to just a few plausible possibilities.
It's usually much easier to determine the source of a connectivity or functionality problem if you have an environment where everything is working properly to compare with a situation that does not work. A packet capture of a working versus a non-working scenario can be compared to see what is different and if those differences are significant.
It is important not to make too many assumptions about a problem, even if the issue you're working on looks the same as the one that you've fixed before. Always verify the problem and the resolution that you should be able to apply and remove a fix and see the problem disappear/reappear reliably. Otherwise, you should question yourself about whether you've found the true source of the issue or are just affecting the symptoms.
Unless a reported problem is obviously a system-wide or specific server issue, it is better to conduct at least the initial analysis at or as close to the complaining user's workstation as possible. This has the advantages of offering the ability to perform the following actions:
- View and verify the actual problem that the user is reporting
- Measure round-trip times to the target server(s)
- Capture and view the TCP handshake process upon session initiation
- Capture and investigate the login and any other background processes and traffic
- Look for indications of network problems (lost packets and retransmissions) as they are experienced by the user's device
- Measure the apparent network throughput to the user's workstation during data downloads
- Eliminate the need to use a capture filter; the amount of traffic to/from a single workstation should not be excessive
A capture at a user workstation, server, or other device should be conducted with the use of an aggregating Test Access Point (TAP) versus using a switch SPAN port (as discussed in Chapter 3, Capturing All the Right Packets, or as a last resort by installing Wireshark on the user's workstation or server (if authorized).