Enhancing the query performance through caching
You can configure different layers of caching in watsonx.data to optimize the query performance.
watsonx.data on Red Hat® OpenShift®
About this task
This procedure helps you to add (enable), edit, and delete different layers of caching that watsonx.data supports.
- Data cache: It reduces the need to repeatedly fetch the same data from the storage layer for running similar queries and the overall query processing time.
- Fragment result cache: It stores intermediate results of queries to speed up the repeated or similar queries.
- Metastore versioned cache: It stores the metadata about tables, like structure and schema, to reduce the need for repeated metadata lookups.
- File list cache: It keeps a list of files for a specified table and partition to avoid querying the storage system repeatedly for this information.
- File and stripe footer cache: It stores the footers of files and stripes in ORC/Parquet files, reducing the I/O for reading file metadata.
Metastore versioned cache and File and stripe footer cache are enabled by default with an option to update any of the properties listed in the following procedure:
Procedure
-
Set up the caching on a
wxdengineinstance.- Set the namespace in the console.
oc project ${PROJECT_CPD_INST_OPERANDS} - Determine which Presto engine you want to update.
oc get wxdengine -o custom-columns='DISPLAY NAME:spec.engineDisplayName,ENGINE ID:metadata.labels.engineName'Example:oc get wxdengine -o custom-columns='DISPLAY NAME:spec.engineDisplayName,ENGINE ID:metadata.labels.engineName' DISPLAY NAME ENGINE ID Presto presto750 presto-01 presto-01 - Delete all the
statefulsetsbelonging to that Presto engine by using itsENGINE IDfrom the previous command. This allows each Prestostatefulsetto be re-created with new volume mounts.
Example:oc delete sts -l engineName=<ENGINE_ID>oc delete sts -l engineName=presto750 statefulset.apps "ibm-lh-lakehouse-presto750-coordinator-blue" deleted statefulset.apps "ibm-lh-lakehouse-presto750-prestissimo-worker" deleted statefulset.apps "ibm-lh-lakehouse-presto750-presto-worker" deleted statefulset.apps "ibm-lh-lakehouse-presto750-single-blue" deleted - Add the cache configuration under the
specsection of engine configuration.
Example:oc patch wxdengine/<engine-name> \ --type=merge \ -n ${PROJECT_CPD_INST_OPERANDS} \ -p '{ "spec": { "<property1>": "<value1>", "<property2>": "<value2>" } }'oc patch wxdengine/lakehouse-presto-01 \ --type=merge \ -n ${PROJECT_CPD_INST_OPERANDS} \ -p '{ "spec": { "hive_metastore_cache_partition_versioning_enabled": "true", "hive_metastore_cache_scope": "PARTITION", "hive_metastore_cache_ttl": "3d", "hive_metastore_cache_refresh_interval": 4d, "hive_metastore_cache_maximum_size": "20000000" } }'Note: Add all the configurations that you want to change injsonformat.- Customizable configurations for different levels of cache
-
- Data cache
-
cacheStorageClass: <nfs-client> cacheStorageSize: <10Gi> cache_alluxio_max_cache_size: <8GB>Table 1. Data cache properties Property Type Default value Required/Optional Description cacheStorageClassString <nfs-client>Required The storage class of the PVC to be used for storing the cache. cacheStorageSizeString (number + unit of measure) <10Gi>Required The size of the cache volume. cache_alluxio_max_cache_sizeString (number + unit of measure) <8GB>Optional A default size based on the volume size is calculated if not specified. Should be at most 90% of the volume size. - Fragment result cache
-
fragment_result_cache_enabled: true fragment_result_cache_max_cached_entries: 1000000 fragment_result_cache_ttl: 36h fragment_result_cache_partition_statistics_based_optimization_enabled: true fragmentCacheStorageClass: <your storage class> fragmentCacheStorageSize: <10Gi>Table 2. Fragment result cache properties Property Type Default value Required/Optional Description fragment_result_cache_enabledBoolean trueRequired truemeans enable the cache andfalsemeans disable the cache.fragment_result_cache_max_cached_entriesInteger 1000000Optional Maximum number of result entries to be cached. fragment_result_cache_ttlTime 36hOptional Duration to keep the entries. fragment_result_cache_partition_statistics_based_optimization_enabledBoolean trueOptional Enable partition statistics-based pruning, allowing flexible caching of leaf query fragment and improving cache hit. fragmentCacheStorageClassString <your storage class>Required The storage class of the PVC that the cache volume is based on. fragmentCacheStorageSizeString <10Gi>Required The size of the cache volume. - Metastore versioned cache
-
hive_metastore_cache_partition_versioning_enabled: false hive_metastore_cache_scope: ALL hive_metastore_cache_ttl: 10m hive_metastore_cache_maximum_size: 10000000 hive_metastore_cache_timeout: 3m hive_metastore_cache_refresh_interval: 2mTable 3. Metastore versioned cache properties Property Type Default value Required/Optional Description hive_metastore_cache_partition_versioning_enabledBoolean falseRequired truemeans enable the cache andfalsemeans disable the cache.hive_metastore_cache_scopeString ALLOptional Whether to cover all or just partition. hive_metastore_cache_ttlTime 10mOptional Duration to keep cached metastore data as valid. hive_metastore_cache_maximum_sizeInteger 10000000Optiional Hive metastore cache maximum size. hive_metastore_cache_timeoutTime 3mOptional Timeout for Hive metastore requests. hive_metastore_cache_refresh_intervalTime 2mOptional Asynchronously refresh cached metastore data after access if it is older than this time but is not yet expired. It allows subsequent accesses to see fresh data. Note: The propertyhive_metastore_cache_timeoutis not applicable for watsonx.data 1.1.4 and later. - File list cache
-
file_status_cache_expire_time: 24h file_status_cache_tables: "*" file_status_cache_size: 100000000Table 4. File list cache properties Property Type Default value Required/Optional Description file_status_cache_expire_timeTime 24hOptional Time interval to expire or refresh the cache content. file_status_cache_tablesString "*"Optional Specifies the tables that are cached. *indicates that all tables in all schemas are cached.file_status_cache_sizeInteger 100000000Optional The size of the file cache.
- File and stripe footer cache for ORC or DWRF
-
file_metadata_orc_file_tail_cache_enabled: true file_metadata_orc_file_tail_cache_size: Calculated based on jvm.Xmx file_metadata_orc_file_tail_cache_ttl_since_last_access: 6h file_metadata_orc_stripe_metadata_cache_enabled: true file_metadata_orc_stripe_footer_cache_size: Calculated based on jvm.Xmx file_metadata_orc_stripe_footer_cache_ttl_since_last_access: 6h file_metadata_orc_stripe_stream_cache_size: Calculated based on jvm.Xmx file_metadata_orc_stripe_stream_cache_ttl_since_last_access: 6h file_metadata_orc_use_column_names: trueTable 5. File and stripe footer cache properties for ORC or DWRF Property Type Default value Required/Optional Description file_metadata_orc_file_tail_cache_enabledBoolean trueRequired trueto enable andfalseto disable this cache.file_metadata_orc_file_tail_cche_sizeInteger Calculated based on jvm.Xmx Optional The per-calculated size of cache. file_metadata_orc_file_tail_cache_ttl_since_last_accessTime 6h Optional Time to refresh or expire the cache content. file_metadata_orc_stripe_metadata_cache_enabledBoolean trueRequired Enable or disable this cache. file_metadata_orc_stripe_footer_cache_sizeInteger Calculated based on jvm.Xmx Optional The per-calculated size of cache. file_metadata_orc_stripe_footer_cache_ttl_since_last_accessTime 6h Optional Time to refresh or expire the cache content. file_metadata_orc_stripe_stream_cache_sizeInteger Calculated based on jvm.Xmx Optional The per-calculated size of cache. file_metadata_orc_stripe_stream_cache_ttl_since_last_accessTime 6hOptional Time to refresh or expire the cache content. file_metadata_orc_use_column_namesBoolean trueOptional If true, uses column name. If false, uses index for access. - File and stripe footer cache for Parquet
-
file_metadata_parquet_metadata_cache_enabled: true file_metadata_parquet_metadata_cache_size: Calculated based on jvm.Xmx file_metadata_parquet_metadata_cache_ttl_since_last_access: 6h file_metadata_parquet_metadata_batch_read_optimization_enabled: true file_metadata_parquet_metadata_use_column_names: trueTable 6. File and stripe footer cache properties for ORC or DWRF Property Type Default value Required/Optional Description file_metadata_parquet_metadata_cache_enabledBoolean trueRequired trueto enable andfalseto disable this cache.file_metadata_parquet_metadata_cache_sizeInteger Calculated based on jvm.Xmx Optional Size of cache, per-calculated. file_metadata_parquet_metadata_cache_ttl_since_last_accessTime 6hOptional Time interval to refresh or expire cache content. file_metadata_parquet_metadata_batch_read_optimization_enabledBoolean trueOptional Whether to enable the read optimization. file_metadata_parquet_metadata_use_column_namesBoolean trueOptional trueto use column name.falseto use index for access.
If you want to update the properties for Metastore versioned cache, File list cache, and File and stripe footer cache without deleting the stateful sets, then update thecr.yamlfile and run the following command to track the status of operator reconcile and the config maps.
Thewatch --color "oc get wxdEngine -n ${PROJECT_CPD_INST_OPERANDS} -o custom-columns='NAME:metadata.name,DISPLAY NAME:spec.engineDisplayName,RECONCILE:status.engineStatus,STATUS:status.middleEndStatus'"RECONCILEstatus first appears asRUNNINGand then changes toCOMPLETED. Restart all Presto pods to apply the new configuration. - Set the namespace in the console.
- Optional: Set up Persistent Volume (PV).
This step is for setting storage classes for data cache and fragment result cache. You need two separate PVs and storage classes for data cache and fragment cache. Use the
localStorageProvisioner, and complete the following steps:- Create a
yamlfile.vi pv1.yaml - Copy the content to
pv1.yamlfile and save it.Note: The values forname,storage,storageClassName,path, andnodeAffinityare based on the client requirement.For example, in name: presto-cache-pv1 in the following output, presto-cache-pv1 is customizable.
You must set a path, which has sufficient storage.
apiVersion: v1 kind: PersistentVolume metadata: name: presto-cache-pv1 spec: capacity: storage: 1Gi volumeMode: Filesystem accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Delete storageClassName: cache-storage local: path: /tmp/cache nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - worker0.bicorn.cp.fyre.ibm.com - worker1.bicorn.cp.fyre.ibm.com - To get the values field under
nodeAffinity, runoc get nodesand use the name of the worker nodes that have the disk space available to mount the cache of Presto pods. - Access the debugging session on the selected worker node and create the necessary
directory structure.
oc debug node/<name of node> -- chroot /host mkdir -p <path used under local, /tmp/cache>Note: Repeat steps c and d for all the selected nodes. - Run the following command to apply the PV configurations.
oc apply -f pv1.yaml - Provision more PVs based on the t-shirt sizing. If you have three Presto pods, create
pv2.yaml, andpv3.yaml. Forpv2, use name aspresto-cache-pv2and path aspath /tmp/cache/pv2. Forpv3, use namepresto-cache-pv3and path aspath /tmp/cache/pv3. - Create two or more PVs based on the requirements.
oc apply -f pv2.yaml pv3.yaml
- Create a
- Delete the caching.
- Determine the Presto instance from which you want to remove the
cache.
Example:oc get wxdengine -o custom-columns='DISPLAY NAME:spec.engineDisplayName,ENGINE ID:metadata.labels.engineName'oc get wxdengine -o custom-columns='DISPLAY NAME:spec.engineDisplayName,ENGINE ID:metadata.labels.engineName' DISPLAY NAME ENGINE ID Presto presto750 presto-01 presto-01 - Delete all the
statefulsetsbelonging to that Presto engine by using itsENGINE IDfrom the previous command. This allows each Prestostatefulsetto be re-created with new volume mounts.
Example:oc delete sts -l engineName=<ENGINE_ID>oc delete sts -l engineName=presto750 statefulset.apps "ibm-lh-lakehouse-presto750-coordinator-blue" deleted statefulset.apps "ibm-lh-lakehouse-presto750-prestissimo-worker" deleted statefulset.apps "ibm-lh-lakehouse-presto750-presto-worker" deleted statefulset.apps "ibm-lh-lakehouse-presto750-single-blue" deleted - Remove the configurations of the cache that you want to delete.
oc patch wxdengine/<engine-name> -n ${PROJECT_CPD_INST_OPERANDS} --type='json' -p='[{"op": "remove", "path": "/spec/property1"}, {"op": "remove", "path": "/spec/property2"}]' - Delete the Persistent Volume Claims (PVC) of the cache.
- For data cache
-
- List the PVCs of the data
cache.
$ oc get pvc | grep ibm-lh-cache-mount - Delete the
PVCs.
$ oc delete pvc <PVC's name>
- List the PVCs of the data
cache.
- For fragment-result cache
-
- List the PVCs of the fragment-result
cache.
$ oc get pvc | grep ibm-lh-fragment-cache - Delete the
PVCs.
$ oc delete pvc <PVC's name>
- List the PVCs of the fragment-result
cache.
- Delete the PVs if you have completed step
2.
oc delete pv <pv name>
Data cachecacheStorageClasscacheStorageSizecache_alluxio_max_cache_size
Fragment result cachefragment_result_cache_enabledfragment_result_cache_max_cached_entriesfragment_result_cache_ttlfragment_result_cache_partition_statistics_based_optimization_enabledfragmentCacheStorageClass and fragmentCacheStorageSize
Metastore versioned cachehive_metastore_cache_partition_versioning_enabledhive_metastore_cache_scopehive_metastore_cache_ttlhive_metastore_cache_refresh_intervalhive_metastore_cache_maximum_sizehive_metastore_cache_timeoutNote: The propertyhive_metastore_cache_timeoutis not applicable from version 1.1.4 and later of watsonx.data.
File list cachefile_status_cache_expire_timefile_status_cache_tablesfile_status_cache_size
File and stripe footer cachefile_metadata_parquet_metadata_cache_enabledfile_metadata_parquet_metadata_cache_sizefile_metadata_parquet_metadata_cache_ttl_since_last_accessfile_metadata_parquet_metadata_batch_read_optimization_enabledfile_metadata_parquet_metadata_use_column_namesfile_metadata_orc_file_tail_cache_enabledfile_metadata_orc_file_tail_cache_sizefile_metadata_orc_file_tail_cache_ttl_since_last_accessfile_metadata_orc_stripe_metadata_cache_enabledfile_metadata_orc_stripe_footer_cache_sizefile_metadata_orc_stripe_footer_cache_ttl_since_last_accessfile_metadata_orc_stripe_stream_cache_sizefile_metadata_orc_stripe_stream_cache_ttl_since_last_accessfile_metadata_orc_use_column_names
- Determine the Presto instance from which you want to remove the
cache.