IBM Support

WinSTAP 10.5 receiving alerts "Correlation Timeout Errors" and "Decryption of Messages Failed" resulting in missing DB User Names for sessions.

Troubleshooting


Problem

WinSTAP 10.5 is not logging DB User Names while receiving alerts for "Correlation Timeout Errors" and "Decryption of Message Failed".

Symptom

The following error may be seen during troubleshooting in the STAP.ctl log:

ERROR:

`E 05/30/2017 16:22:12.607 Wfp: DecryptMessage [xx.xx.xx.116:43150, xx.xx.xx.144:1433] failed: [The handle specified is invalid]
`W 05/30/2017 16:28:37.776 Wfp: Correlation delay timeout [xx.xx.xx.116:43227, xx.xx.xx.144:1433].

MEANING:

This message is an artifact of the new STAP correlation process. The STAP now uses code to decrypt the encrypted SQL Server stream before sending to the appliance. Prior to this (in V9), the encrypted stream was sent to the appliance along with an unencrypted stream and the appliance provided correlation.

The V10 STAP detects that it has received an encrypted packet for a stream and then waits for the driver (either NmpMonitor or WfpMonitor) to provide the encryption key so it can start decrypting the stream. What's happening is that the key is arriving late and the STAP times out waiting for the key and logs the "Correlation Delay Timeout" message. And then decryption of the stream fails because the STAP doesn't have the key so it logs the "Decryption of Message Failed" message. Login packets are always encrypted, so if the STAP doesn't have the key it can't decrypt the login packet and the customer will see missing information (like DB username) in the reports.

Cause

Prominent causes of such messages in S-TAP Event Log report are
- creating TCP/IP connection to database and then starting the S-TAP in middle of the running traffic.
- the client/server traffic is ignored causing delay in traffic processing in terms of correlation and invalidating the decryption context. This cause can be eliminated by ignoring server traffic only i.e by using "IGNORE RESPONSES PER SESSION" rule action in policy.
- traffic spikes can cause "Correlation Delay Timeout". This message can be controlled using CORRELATION_TIMEOUT parameter in guard_tap.ini file where default value of 5 seconds can be increased to 10 seconds.

Environment

Guardium v10.x

Diagnosing The Problem

Collect the full Win S-TAP must gather outputs, which includes driver logs (*.ctl), in order to provide the best suggestion to the customer.

Resolving The Problem

It is recommended to upgrade to Windows S-TAP V10.5.0.182 or newer which enhances Correlation performance and stability via code improvements.

https://www.securitylearningacademy.com/enrol/index.php?id=3784

Additional Recommendations:

Many of theses traffic spike problem can be eliminated by configuring  IGNORE RESPONSES PER SESSION which drops the server to client traffic.  This will ignore the traffic coming from SQL Server and going to the client, but it will still log the traffic coming from the client and going to the server.

Once sniffer receives traffic and IGNORE RESPONSE PER SESSION action is fired, sniffer requests S-TAP to stop sending S2C (server to client) traffic for the session.

S-TAP sends both C2S and S2C from the beginning, but once it receives request, it'll stop sending S2C traffic and keep sending C2S only.

For more in-depth details Ref- Ignore Responses Per Session
https://w3-connections.ibm.com/wikis/home?lang=en#!/wiki/We05c396c214c_4cba_8b2d_2c49ec7407b1/page/Ignore%20Responses%20Per%20Session

If an upgrade to Windows S-TAP V10.5.0.182 or newer  is not immediately possible, there are some workarounds that can be put in place.

The "Correlator Timeout" message is often being caused by traffic spikes which means it takes longer for the driver's to send the encryption keys to the STAP. Examine the .ctl files to determine if setting the correlation timeout in the guard_tap.ini file from the default of 5 seconds to 10 seconds is warranted as the resolution.

Parameter entry:

CORRELATION_TIMEOUT=10

ERROR:

E 05/30/2017 19:30:07.232 Wfp: DecryptMessage [xx.xx.xx.116:63482, xx.xx.xx.144:1433] failed: [The handle specified is invalid]
`W 05/31/2017 00:37:14.163 PDQueue: STAP buffer overflow

MEANING:

Although this doesn't generate the decrypt error message, it indicates that the STAP is seeing occasional traffic spikes and is dropping packets during those spikes due to it reaching its maximum memory allotment.

WORKAROUND:

Update STAP buffer size parameter value in the guard_trap.ini to something like 200 to 300MB for example should alleviate the issue. Follow with a restart of STAP service only after updating the value. Watch the STAP.ctl log for any additional buffer overflow errors and adjust STAP buffer accordingly if needed to avoid any further lost packets during STAP traffic spikes.

Example:

BUFFER_FILE_SIZE=200

NOTE: Buffer size currently defaults to 50MB.

ERROR:

  W 05/30/2017 16:22:04.576 Wfp: Tcp connection established mid-stream: [xx.xx.xx.144:1433, xx.xx.xx.116:43108]
`W 05/30/2017 16:22:12.607 Wfp: Correlation delay timeout [xx.xx.xx.116:43150, xx.xx.xx.144:1433]

MEANING:

This decrypt error message happens when you create a TCP/IP database session and THEN start the STAP process in the middle of the stream. Since the decryption key has already gone by, the STAP has no decryption context and you see the invalid handle error.

WORKAROUND:

Once the key has already passed by, there is not much at that point to prevent this anymore.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSMPHH","label":"IBM Security Guardium"},"Component":"sniffer","Platform":[{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"10.x","Edition":"","Line of Business":{"code":"LOB24","label":"Security Software"}}]

Document Information

Modified date:
29 March 2019

UID

ibm10742457