Creating a custom data class that finds column similarities

Create a custom column similarities data class that uses a machine learning algorithm to search for similarities between a reference column and a column that is classified.

You must run a column analysis on the column that contains the data that you consider being representative of the domain that you want to classify. This is the reference column.

Column similarity is a machine learning algorithm which tries to find similarities in the characteristics of the data between a reference column and a column to be classified. The main objective of this data class is not to find the same values, but to find the same kind of data. The algorithm takes different aspects of the data into consideration such as the format, eventual repeating tokens, combination of characters which are often used or never used, statistical distributions, and so on.

After the column similarity data class is created, you do not need to keep the data set or the workspace that contains the data set. When you create the data class, all deployment information is copied over to the data class definition and it does not access the original analysis results of the reference column after it is deployed. If you analyze the reference column again because the data has changed, you need to redeploy the data class. After you create the data class, you can import or export that data class to another computer that does not have the reference column.

On the services or client tiers, run the following command to create a data class of the type Column Similarity out of the reference column:

IAAdmin -user user_name -password password -url https://host:port
-createColumnSimilarityDataClass -classCode <class_code_of_new_class> 
-className <name of new class>[-classDescription <description of new class>] 
-projectName <project of reference column> -referenceColumn <name of reference column> 
[-confidenceThreshold <confidence threshold of new class>]

/opt/IBM/InformationServer/ASBNode/bin/IAAdmin.sh -user admin -password admin 
-url https://myserver.organization.com:9443  -createColumnSimilarityDataClass -classCode test3 
-className test3Name -projectName Accounts 
-referenceColumn PRODUCT_DB.PRODUCTS.PRODUCTS_LIST.PRODUCT_CODE  
-confidenceThreshold 0.6