IBM Support

kafka is generating a lot of data

Troubleshooting


Problem

The Kafka service of the Cloud Application Performance Management (APM) server is saving a lot of data, which could lead to a file system full condition.

Symptom

The files in the /opt/ibm/kafka/data/kafka-logs/metric.protobuf* directories are responsible for the growth. Each log file in those directories might be up to 500 Mb by default. When the problem occurs, the number of log files in those directories keeps growing until there are 20 or more log files taking up 10 Gb or more of disk space. At the same time, you would also see that the Db2 WAREHOUS database size is growing, as the data in the metric.protobuf Kafka topic is written to the database.
You might also see the "No data available" message in the Cloud APM dashboard widgets that display real-time data but data is displayed in the widgets that show data from the last 4 hours.    
The file system size of some Windows agent systems might grow significantly too.

Cause

One or more agents installed on Windows are sending the same data over and over again to the Cloud APM server after the agents lost connectivity to the Cloud APM server, the connection came back, the connection was lost again before the agents sent all cached data to the server and then connectivity was reestablished. For example, the problem can occur if the Cloud APM server is stopped for many minutes and then restarted multiple times before the agents can send all of their cached data or if network connectivity between the agents and Cloud APM server is disrupted multiple times before the agents can send all of their cached data.  
The problem does not occur during normal operations of Cloud APM or if the Cloud APM server is restarted once and is not restated again for several hours.

Environment

Cloud APM server 8.1.4 with APM agents on Windows

Diagnosing The Problem

Diagnosing the problem from the Cloud APM server
If the file system size for the Cloud APM server is growing, log in to the APM server and run the following commands to determine the current sizes of the Kafka service directories.  
Note: the commands assume the Cloud APM server is installed in the /opt/ibm directory.
1. cd /opt/ibm/kafka/data/kafka-logs
2. du -hsx * | sort -rh | head -25 
Then run the same commands 15 minutes later. If the size of the /opt/ibm/kafka/data/kafka-logs/metric.protobuf-0 and metric.protobuf-1 directories have grown significantly, then the problem might be occurring.
To determine whether your Windows monitoring agents are using a Cloud APM core framework interim fix that causes agents to resend the same cached data over and over again, perform the following steps:
1. Log in to your Cloud APM server as root

2.  curl http://localhost/1.0/monitoring/agent/inventory -H "Accept: text/plain"  -o  /tmp/agent-inventory.log > /dev/null 2>&1

3.  grep -e "C=06.40.00.04" -e "C=06.40.00.05" -e "C=06.40.00.06" -e "C=06.40.00.07" -e "C=06.40.00.08" /tmp/agent-inventory.log | grep WIX > /tmp/windows-agents-to-update.log

The /tmp/windows-agent-to-update.log lists the monitoring agents that are using the core framework versions that cause the problem. If the file is empty, then a different issue is occurring. If the file is not empty and you have Windows OS agents, then issue these Db2 queries on your Db2 server to determine which Windows OS agents are sending more data than the agents sent before the problem started:  
1. Log in to your Db2 server as the Db2 instance user, which is db2apm by default
2. db2 connect to WAREHOUS user itmuser using your-itmuser-password
   where you need to specify the password for itmuser in place of your-itmuser-password
3. db2 "select ORIGINNODE, COUNT(*) as total from itmuser."LP_NTPROCESS" WHERE WRITETIME LIKE '1yymmd1%' GROUP BY ORIGINNODE order by total desc " 
4. db2 "select ORIGINNODE, COUNT(*) as total from itmuser."LP_NTPROCESS" WHERE WRITETIME LIKE '1yymmd2%' GROUP BY ORIGINNODE order by total desc "
5. db2 "select ORIGINNODE, COUNT(*) as total from itmuser."LP_NTPROCESS" WHERE WRITETIME LIKE '1yymmd3%' GROUP BY ORIGINNODE order by total desc"
6. db2 "select ORIGINNODE, COUNT(*) as total from itmuser."LP_NTPROCESS" WHERE WRITETIME LIKE '1yymmd4%' GROUP BY ORIGINNODE order by total desc" 
where you replace these values in the WRITETIME LIKE string in the db2 select commands above: 

         - Replace yy with the two-digit year number.  For example, specify 20 for the year 2020.

         - Replace mm with the two-digit month number. For example, specify 04 for April or 10 for October.

        - Replace d1 with a date in the last 7 days when the problem was not occurring.  For example, if the APM server file system growth was stable on April 3, then specify 03 in place of d1.

         - Replace d2 with another date in the last 7 days when the problem was not occurring.  For example, if the APM server file system growth was stable on April 4, then specify 04 in place of d2.

         - Replace d3 with the date when the APM server file system growth started.  For example, if the APM server file system size started growing on April 5, then specify 05 in place of d3.

         - If you did not specify the current date for d3, then specify the current date or the prior day for d4.  (For d4, you need to specify a date where the agents have sent data to the Cloud APM server for most of the day.)

The output of the Db2 queries in steps 3 and 4 shows the number of rows that the Windows OS agents sent to the APM server on the days when the APM server file system size was stable.  (The number of rows is in the TOTAL column of the query output.)  
The output of the Db2 queries in steps 5 and 6 shows the number of rows that the Windows OS agents sent to the APM server on the days when the APM server file system size was growing.  
If any Windows OS agents are sending 10 times or more the number of rows in the output of the queries for steps 5 and 6 compared to the number of rows sent by those same agents in the query output for steps 3 and 4, then those agents are causing the APM server's file system growth.

Diagnosing the problem from a Windows agent system
 
If the agent loses its connection to the Cloud APM server, it caches data until connectivity is restored. The cached data is stored in *.csh files in the \IBM\APM\localconfig\pc directory or subdirectories where pc is the product code for the agent type. For example, the product code for a Windows OS agent is nt. If there are multiple *.csh files for an agent type, the size of the *.csh files keeps growing, and the agent is connected to the Cloud APM server, then check the agent's RAS1 log files in the \IBM\APM\logs\TMAITM6_x64\logs directory.
When the agent connects to the Cloud APM server, it begins sending the cached data to the server.  On an agent system with the problem, we noticed that the agent's RAS1 logs had these log messages repeating all the time, with KBB_RAS1 set to ERROR (which is the default value):
(5D84DCB4.0000-7A4:kraasfsc.cpp,196,"printfSelf") *SFCP-INFO: Subscription Capsule Definition -----------------------------------------------------
(5D84DCB4.0001-7A4:kraasfsc.cpp,197,"printfSelf") *SFCP-INFO:   Type ID------------- 7
(5D84DCB4.0002-7A4:kraasfsc.cpp,198,"printfSelf") *SFCP-INFO:   CacheID------------- 1858
(5D84DCB4.0003-7A4:kraasfsc.cpp,199,"printfSelf") *SFCP-INFO:   QueueType----------- 0
(5D84DCB4.0004-7A4:kraasfsc.cpp,200,"printfSelf") *SFCP-INFO:   EnqueueTime--------- 5D84DBD2
(5D84DCB4.0005-7A4:kraasfsc.cpp,201,"printfSelf") *SFCP-INFO:   OnQueue------------- 0
(5D84DCB4.0006-7A4:kraasfsc.cpp,202,"printfSelf") *SFCP-INFO:   Aborted------------- 0
(5D84DCB4.0007-7A4:kraasfsc.cpp,203,"printfSelf") *SFCP-INFO:   FreePayLoad--------- 1
(5D84DCB4.0008-7A4:kraasfsc.cpp,204,"printfSelf") *SFCP-INFO:   PayloadLength------- 1118
(5D84DCB4.0009-7A4:kraasfsc.cpp,207,"printfSelf") *SFCP-INFO:   Payload ------------ <ENVELOPE><SUBSCRIBER>KNT_defaultSubscription</SUBSCRIBER><REPORTDATA><WRITETIME>1190920160154000</WRITETIME><TMZDIFF>-7200</TMZDIFF><SQLTABLE><TABLENAME>NTMEMORY</TABLENAME><COLUMNS><NAME>ORIGINNODE</NAME><NAME>AVAILBTMEM</NAME><NAME>AVAILKB</NAME><NAME>A
(5D84DCB4.000A-7A4:kraasfsc.cpp,210,"printfSelf") *SFCP-INFO: -------------------------------------------------------------------------------------
(5D84DCB4.000B-7A4:kraasfsr.cpp,2092,"loadServerCacheFromFile") Subscription Capsule 399D070 - cache ID 1858 with Caching Duration 3600 Original Enqueue Time = Fri Sep 20 16:01:54 2019
In fact, looking earlier into the logs, there appear to be problems with the csh files:
(5D84DCB3.1B41-7A4:kraasfsr.cpp,2092,"loadServerCacheFromFile") Subscription Capsule 399D8E0 - cache ID 1898 with Caching Duration 3600 Original Enqueue Time = Fri Sep 20 16:05:54 2019
(5D84DCB3.1B42-7A4:kraasfsr.cpp,2521,"readSubCapBuffer") *SFCP-INFO: Read subscription capsule record from cache file C:\IBM\APM\localconfig\nt\nt_asfSavedSubCap.378661Q.csh error detected - status 2 No such file or directory
(5D84DCB3.1B43-7A4:kraasfsr.cpp,2207,"loadServerCacheFromFile") Total 48 Server http://10.0.0.177:80/ccm/asf/request subscription data loaded from file C:\IBM\APM\localconfig\nt\nt_asfSavedSubCap.378661Q.csh
(5D84DCB3.1B44-7A4:kraasfsr.cpp,2265,"loadServerCacheFromFile") Total 0 Server http://10.0.0.177:80/ccm/asf/request subscription data copied from file C:\IBM\APM\localconfig\nt\nt_asfSavedSubCap.378661Q.csh
(5D84DCB3.1B45-7A4:kraasfsr.cpp,2281,"loadServerCacheFromFile") Server http://10.0.0.177:80/ccm/asf/request subscription data file C:\IBM\APM\localconfig\nt\nt_asfSavedSubCap.378661Q.csh delete error detected - 13 Permission denied
(5D84DCB3.1B46-7A4:kraasfsr.cpp,2287,"loadServerCacheFromFile") Server http://10.0.0.177:80/ccm/asf/request subscription data file C:\IBM\APM\localconfig\nt\nt_asfSavedSubCap.378661T.csh deleted
(5D84DCB3.1B47-7A4:kraasfsr.cpp,2328,"loadServerCacheFromFile") Server http://10.0.0.177:80/ccm/asf/request subscription data store in file begin
Specifically, the agent encountered an error when deleting a *.csh file, which prevents the agent from completing the cache upload and resuming normal data collection. The agent starts resending the same cached data, which explains why the Kafka service on the Cloud APM server receives and stores more data than expected.

Resolving The Problem

To resolve the problem, update the agents that can cause the problem with Cloud APM 8.1.4.0 core framework interim fix 12 or the latest interim fix that supports APM and delete the agent *csh files. The /tmp/windows-agents-to-update.log file created by a procedure in Diagnosing the problem from the Cloud APM server contains the list of agents to update. If you have many agents to update, prioritize the Windows OS agents that are sending the most data according to steps 5 and 6 of the Db2 procedure in Diagnosing the problem from the Cloud APM server.
To update the agents, download the latest Cloud APM 8.1.4.0 core framework interim fix for APM from Fix Central and perform the steps below on each host listed in the /tmp/windows-agents-to-update.log file:
1) Log in to the agent system as the user who is running the agent
2) cd <agent-home>\localconfig 
             where <agent-home> is the Cloud APM agent installation directory
3) Search for any *.csh files in the subdirectories of <agent-home>\localconfig
4) If there is a *.csh file in the directory or subdirectory for an agent type, then stop the corresponding agent and delete it's *.csh file. The localconfig subdirectory names contain the two character product code of an agent.  For example, the localconfig\nt directory is for the Windows OS agent.   To determine the product code for each agent type, see: 
5) Install the core framework interim fix on the agent system. The interim fix installation restarts all of the agents.

6) Install the server version of the core framework on the Cloud APM server and, then run the scripts described in this IBM Knowledge Center link to update the agent packages used to install new agents:  https://www.ibm.com/support/knowledgecenter/SSHLNR_8.1.4/com.ibm.pm.doc/install/install_agent_preconfig.html
Note: If you must schedule a maintenance window to update the core framework, then steps 1-4 can be performed to delete the cached data so that the agents stop sending so much data to the APM server.   After step 4, start the agents that had *.csh files and install the core framework interim fix later.
If the Cloud APM server file system is filling up or the Db2 WAREHOUS database on a remote Db2 server is getting large, then you can perform these steps on the Cloud APM server to minimize the file system growth until you update all affected agents. 
Note: The steps assume the Cloud APM server is installed in the /opt/ibm directory.
1) Configure the Kafka service to free up data more quickly.
By default, the Kafka service can cache data for up to 4 hours. The steps below configure the Kafka service to cache data in the metric.protobuf topic for up to 1 hour.
1a) Set JAVA_HOME
    export JAVA_HOME=/opt/ibm/java/jre
1b) Enter this command to reduce the retention time to 1 hour: 
   /opt/ibm/apm/kafka/bin/kafka-topics.sh --zookeeper 127.0.0.1:2181 --topic metric.protobuf --alter --config retention.ms=3600000  
2) If your Cloud APM server file system is close to filling up and you cannot add more space, then you can delete the metric.protobuf topic to temporarily free up space.  However, if you delete the topic, then you lose any data that has not already been written to the Db2 WAREHOUS database.   
Note: The metric.protobuf topic is automatically re-created after you delete it and will start growing in size so you might need to delete it again if the /opt/ibm/kafka/data/kafka-logs/metric.protobuf* directories become too large with new data.
 2a) Add the following line to /opt/ibm/kafka/conf/server.properties 
   delete.topic.enable=true
2b) apm restart kafka 
2c) export JAVA_HOME=/opt/ibm/java/jre
2d) /opt/ibm/kafka/bin/kafka-topics.sh --zookeeper 127.0.0.1:2181 --delete --topic metric.protobuf 
Note: If you need to delete the metric.protobuf topic again to free up space, then you only need to perform steps 2c) and 2d) since the Kafka service was already configured to accept delete requests in steps 2a) and 2b).
3)  By default, 8 days of resource monitoring data is saved in the Db2 WAREHOUS database. If the file system for your Db2 WAREHOUS database is also filling up, then you can configure the Cloud APM server to save 2 days of data in the database to free up disk space. 
Note: If your Cloud APM server is configured to use a local Db2 server, the WAREHOUS database data is saved under the /opt/ibm/db2/DB/db2apm/NODE0000/WAREHOUS directory. If you have a remote Db2 server, then the WAREHOUS database is saved on the remote Db2 server in the directory specified by your Db2 administrator when the database was created.  
Perform these steps on the Cloud APM server to reduce the amount of data saved in the Db2 WAREHOUS database: 
3a) cd /opt/ibm/ccm
3b) If you configured custom retention settings for the Db2 WAREHOUS database, then enter the command below to retrieve the current retention settings. You will use the file created by this command to restore the custom settings after the Cloud APM server file system size is stable.
./get_metrics_retention.sh -pw db2Usrpasswd@08 -retention CURRENT
  
       where you need to specify your Db2 instance user password in place of db2Usrpasswd@08 if you are not using the default password 
     
    Then rename and save the /tmp/history_file.cfg file that is created by the get_metrics_retention.sh script.

3c) ./set_metrics_retention.sh -pw db2Usrpasswd@08 -retention 2DAYS
        where you need to specify your Db2 instance user password in place of db2Usrpasswd@08 if you are not using the default password
3d) apm stop ksy
3e) Edit /opt/ibm/sy/config/sy.ini and uncomment KSY_ON_DEMAND=Y by removing # at the beginning of the line with that environment variable
3f) cd /opt/ibm/sy/bin
3g) ./itmcmd agent start sy
3h) Keep checking the latest *sy_java* log file in the /opt/ibm/sy/logs directory.  Once you see messages similar to the messages below at the end of the log file, then the partition cleanup is complete.  
Note: It might take a few minutes for the ksy service to delete partitions so be patient.

== 4750 t=Thread-5 Summarization and pruning agent successfully ended
== 2020-01-03 03.03.02.264 +0800 : Trace paused
3i) ./itmcmd agent stop sy

3j) Edit /opt/ibm/sy/config/sy.ini and comment out KSY_ON_DEMAND=Y by adding # at the beginning of the line with that environment variable

3k) apm start ksy
3l) To force Db2 to clean up data from the database partitions that were dropped perform the steps below on the computer system where the Db2 server is installed: 
      su - db2apm

           Note: If you are using a custom Db2 instance user, then specify your instance name instead of db2apm
      db2 connect to WAREHOUS 
      db2 "alter tablespace userspace1 reduce"

      
Once you update your agents with the core framework interim fix and the Cloud APM server file system size is stable, then perform these steps on the Cloud APM server to restore the WAREHOUS database retention time to 8 days or your customer retention settings: 
1) cd <apm-server-home>/ccm

2) If you were using the default 8-day retention setting, then run this command: 
./set_metrics_retention.sh -pw db2Usrpasswd@08 -retention 8DAYS
            where you need to specify your Db2 instance user password in place of db2Usrpasswd@08 if you are not using the default password
3) If you were using custom retention settings, then run this command: 
./set_metrics_retention.sh -pw db2Usrpasswd@08 -history your-history-config-file
            where
                 - you need to specify your Db2 instance user password in place of db2Usrpasswd@08 if you are not using the default password 
                 - you replace your-history-file-config-file with the path to the history_file.cfg that you saved before changing the retention time to 2 days.   
Note:  For more information about the set_metrics_retension.sh script, see: https://www.ibm.com/support/knowledgecenter/SSHLNR_8.1.4/com.ibm.pm.doc/install/admin_history_scripts.html 
 

Document Location

Worldwide

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSVJUL","label":"IBM Application Performance Management"},"Component":"kafka","Platform":[{"code":"PF016","label":"Linux"}],"Version":"8.1.4","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Product Synonym

apm;apmv8

Document Information

Modified date:
28 August 2020

UID

ibm11074534