Technical Blog Post
Abstract
Dormant Data cleanup in TADDM - A Customer Story
Body
You probably have seen several documents about dormant data cleanup in TADDM and hopefully are already taking steps to cleanup these old objects. Cleaning up old objects that are no longer discovered ensures that the TADDM data is current and can also improve performance by reducing the size of the database. But do you wonder if you are really cleaning up everything that is old? Most of the dormant cleanup articles refer to Computer Systems and change history -- is there more to it then that? Recently we went through a detailed analysis of dormant data in a customer environment. This analysis reviewed all classes which had a count of more then 10,000 objects with an last modified time of greater then 45 days. The customer was running TADDM 7.2.1.4 with a DB2 database and approximately 6 million CIs. This blog entry is meant to document the findings and recommendations that resulted from that analysis. Actually, while the analysis is 90% done, the work still progresses so there may be more updates to this as we finish up or learn more from other customers. That being said, this is what we found:
USE THE API TO CLEANUP DORMANT CIs
In general, the TADDM API delete command should be utilized to cleanup dormant CI data. This is because in most cases a CI is represented in many database tables and SQL cleanup is risky as you would need to know all the tables that CI is in, otherwise you risk corrupting the database and potentially causing errors during discovery.
The format of the api delete command is;
There are a few exceptions to this "use the api" rule which I will discuss later in this post.
CLASSES TO CLEANUP
The following tables represent the classes we analyzed and the database query we used to find the dormant guids to pass to the api delete. In all queries listed below, ${DATE} represents the time stamp, in epoch format, of the date you want to delete CI's older then. This ${DATE} will be compared to the CI's last modified time (LMT) to determine if it meets the dormant criteria or not. I will discuss methods to obtain this date later in this post.
TABLE 1: These classes can be cleaned up directly based solely on the last modified time(LMT) in the database. In the customer analysis, the LMT we used was 45 days prior to the current date.
Class Name | Database query to find dormant guids |
ComputerSystem | select GUID_C from BB_COMPUTERSYSTEM40_V where lastModifiedTime_c < ${DATE} |
AppServer |
select GUID_C from BB_APPSERVER6_V where lastModifiedTime_c < ${DATE}
|
Segment | select GUID_C from BB_SEGMENT21_V where lastModifiedTime_c < ${DATE} |
Service | select GUID_C from BB_SERVICE81_V where lastModifiedTime_c < ${DATE} |
OperatingSystem | select GUID_C from BB_OPERATINGSYSTEM62_V where lastModifiedTime_c < ${DATE} |
CPU | select GUID_C from BB_CPU10_V where lastModifiedTime_c < ${DATE} |
NetworkConnection | select GUID_C from BB_NETWORKCONNECTION38_V where lastModifiedTime_c < ${DATE} |
Relationship | select GUID_C from BB_RELATIONSHIP26_V where lastModifiedTime_c < ${DATE} |
PhysicalPackage | select GUID_C from BB_PHYSICALPACKAGE60_V where lastModifiedTime_c < ${DATE} |
PhysicalConnector | select GUID_C from BB_PHYSICALCONNECTOR3_V where lastModifiedTime_c < ${DATE} |
Vlan | select GUID_C from BB_VLAN89_V where JDOCLASS_C = 'com.collation.topomgr.jdo.topology.net.VlanJdo' AND lastModifiedTime_c < ${DATE} |
SoftwareComponent | select GUID_C from BB_SOFTWARECOMPONENT68_V where lastModifiedTime_c < ${DATE} |
TransportEndpoint | select GUID_C from BB_TRANSPORTENDPOINT84_V where lastModifiedTime_c < ${DATE} |
Class Name | Database query to find dormant guids |
L2Interface | SELECT BB_L2INTERFACE41_V.GUID_C FROM BB_L2INTERFACE41_V LEFT OUTER JOIN BB_COMPUTERSYSTEM40_V ON (BB_L2INTERFACE41_V .PK__PARENTL2INTERFACE_C = BB_COMPUTERSYSTEM40_V.PK_C) WHERE (BB_COMPUTERSYSTEM40_V.GUID_C IS NULL) |
TABLE 3: Some classes represent naming attributes which can be used for multiple CI's. For example, an IpAddress is used as part of the BindAddress naming attribute which is one of the naming rules for AppServers. So a single IpAddress could act as the naming rule for many AppServer objects on a host. If you simply delete the IpAddress based on LMT you will be removing the AppServer objects as well which may not be dormant. This can occur in cases where the LMT on the IpAddress is not properly modified due to such conditions as a failed OS sensor, yet the Application sensor worked and the AppServer CI has a current LMT. For the classes below, you should not only check the LMT of the guid, but also check if it is referenced as a superior guid in the database table superiors. If it is in the table superiors, then it is serving as a naming rule attribute for one or more CIs and should not be deleted.
NOTE - the queries in the table below are DB2 specific. For Oracle, the HEXTORAW function must be used. For example, when Oracle is the back end database, use this for the LogicalContent query;
select GUID_C from BB_LOGICALCONTENT42_V where lastModifiedTime_c < ${DATE} and HEXTORAW(GUID_C) not in(select sup_supr_guid from superiors)
Class Name | Database query to find dormant guids |
LogicalContent | select GUID_C from BB_LOGICALCONTENT42_V where lastModifiedTime_c < ${DATE} and GUID_C not in(select hex(sup_supr_guid) from superiors) |
Fqdn | select GUID_C from BB_FQDN31_V where lastModifiedTime_c < ${DATE} and GUID_C not in(select hex(sup_supr_guid) from superiors) |
IpAddress | select GUID_C from BB_IPADDRESS73_V where lastModifiedTime_c < ${DATE} and GUID_C not in(select hex(sup_supr_guid) from superiors) |
BindAddress |
DB2: "select b.GUID_C from BB_BINDADDRESS67_V b inner join PERSOBJ_ALIASES pa
on b.GUID_C = pa.GUID_X where b.lastModifiedTime_c < ${DATE} and pa.NRS_GUID_X not in (select sup_supr_guid from superiors)"
Oracle: select guid_c from BB_BINDADDRESS67_V left outer join superiors on HEXTORAW(GUID_C) = sup_supr_guid where sup_supr_guid is null and lastModifiedTime_c < ${DATE}
|
One caveat -- if this is the first time you are doing cleanup, or if your cleanup interval is not on at least a weekly basis, you should use proper SQL to delete in batches. If you try to delete too many rows at once it can cause your DB2 server to run out of transaction log or Oracle to run out of UNDO space. Use "FETCH FIRST" in DB2 or "WHERE ROWNUM <" in Oracle to only delete the number of rows that your database can handle in one transaction. For example to delete in batches of 50,000 for DB2;
And for Oracle;
"DELETE FROM FROM CHANGE_HISTORY_TABLE WHERE ID IN (SELECT ID FROM CHANGE_HISTORY_TABLE WHERE PERSIST_TIME < ${DATE} and ROWNUM < 50000)"
"DELETE FROM CHANGE_CAUSE_TABLE WHERE CAUSE_ID IN (SELECT CAUSE_ID FROM CHANGE_CAUSE_TABLE WHERE CAUSE_ID NOT IN (SELECT ID FROM CHANGE_HISTORY_TABLE) and ROWNUM < 50000)"
Additionally, you may find that the first time you start cleaning up the BindAddress class that there may be so many guids to delete that the api is not a reasonable choice. This class generates a lot of dormant records, at the customer we worked with, 60-80K a week were created, and the first time we did cleanup there were millions of guids. With this in mind, the FIRST time you cleanup this class, if using TADDM 721, you can use SQL to do the deletes. Always perform a DB backup prior to deleting anything directly from the database. The tables to cleanup are bindaddr, persobj, mssobjlink_rel and aliases. Below is an example snippet of the code used to accomplish this;
-- First find the guids to delete(this query may take a long time if the bindaddr table is large);
...
Please use api delete after the initial cleanup as the tables that bindaddr CIs exist in today may change in the future. As noted above, deleting directly from the database is risky, only perform this task if you are comfortable with SQL and monitoring the effects of the queries and that they completed successfully. Always have a DB backup available in case problems are encountered. Consult your DBA for further assistance if needed.
LastModifiedTime, otherwise known as 'LMT' in TADDM is the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC taken on the StorageServer which is storing the data.
On Unix systems the command;
date '+%s'
returns seconds since 1970-01-01 00:00:00 UTC, so this value is 1000 times less then LMT which is in milliseconds. You can use this command on Unix systems to return current time stamp, multiple it by 1000 and then subtract (dormant days * 86400000) to get your DATE value. 86400000 is one day in milliseconds. Then assign calculated value to query e.g.
Here are some functions you can create on DB2 if you prefer to use DB2 time stamp in queries:
To use this;
java DateToString 1/11/2008
Date is Sat Nov 01 00:00:00 EST 2008
MillSeconds is 1225515600000
COULD I STILL HAVE DORMANT DATA AFTER THIS PROCESS IS COMPLETED?
Yes, there will probably be some, perhaps low numbers from before fixes were implemented or in tables that the customer we analyzed did not have data in. To check your database for other dormant data, most CI specific tables contain a lastmodifiedtime_x column that you can query. You can get a list of all tables in the database with SQL such as;
DB2: select TABNAME from syscat.tables where tabschema = '<com.collation.db.user>'";
ORACLE: select table_name from dba_tables where owner='<com.collation.db.user>' order by table_name;
Replacing <com.collation.db.user> with the value of the db user in your collation.properties file. The user name must be in upper case in the query.
Then loop thru those tables querying for for a row count where lastmodifiedtime_x is less then your desired dormant date. E.g.;
..
If a table does not have this column, you can ignore the table, only class specific tables will have the column.
If you find large amounts of data not cleared after you implemented the cleanup process for all tables above, open a PMR and Support will help determine next steps.
PROBLEMS WE ENCOUNTERED DURING THE ANALYSIS
While we were analyzing the customer data, we did find some issues. As noted above we were running with TADDM 7.2.1.4, the following APARs were opened during the analysis to correct problems found. These APARs will be shipped in upcoming fixpacks(some are in 7.2.1.5 which is already GA);
You can look up the status of these APARs, and any other on the TADDM Support Portal here;
OTHER RESOURCES
There is an RFE to make this process part of the product. If this topic interests you, please vote for it here;
http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=26783
There is sample code for a dormant CI cleanup process on the wiki here;
And also for change history here;
Please comment below if you have any questions on this document.
UID
ibm11275388