IBM Support

IBM Sterling Order Management - Performance Guide

General Page

IBM Sterling Order Management Complete Performance Guide.
Overview
Our IBM® Sterling Support mission is to partner proactively with you to best position you for success when using our IBM® Sterling Order Management System Software and services. We have accumulated years of experience of supporting our SaaS, Cloud, and on-premise clients, while providing a proactive model for SaaS production application monitoring.  Tapping into these experiences, we've derived a robust collection of best practices.

Below is a consolidated collection of our most critical recommendations focused on peak workload performance and stability. This document, along with the ongoing webinar series, provides insight into our proven best practices around customization patterns, configurations, testing, and ongoing housekeeping.  As with all recommendations, be sure to test these out in a non-production environment to validate and tune to your specific environments and use cases.

Application Performance
Integration

JMS Performance

    
  • Review MessageBufferPutTime relative to ExecuteMessageCreated statistic from YFS_STATISTICS_DETAIL table for any slowness
  • Use non-persistent queues for internal agent queues
  • User persistent queues for external integration or integration server processes.
  • Avoid using message Selector, instead have dedicated internal/external queues
  • Avoid longer transaction to prevent MQRC_BACKED_OUT error message
  • Optimize output template to prevent MQRC_MSG_TOO_BIG_FOR_Q error while posting a message
    • ​​Use message compression when handling large inbound or outbound message. Read more → 
      • Note: IBM Sterling Order Management on Cloud allows 4 MB message size. 
  • Cache JMS bindings file (Specially if you serving the file from PV with slow I/O)
    • yfs.jms.sender.alwayslookupqueue.disabled=true
  • Enable JMS Session Pool
    • yfs.yfs.jms.session.disable.pooling=N
  • Enable multi-threaded PUT’s; Can help improve performance of RTAM server. Read More →
    • yfs.yfs.jms.sender.multiThreaded=true
  • Use anonymous reuse (requires JMS Session pooling to be enabled)
    • yfs.jms.sender.anonymous.reuse=true
  • Enable JMS connection retries. Read more → 
    • Retry Interval (milliseconds) 100 ms
    • Number of Retries – at least 3 retries.
  • Enable agent bulk sender properties to POST message in batches.
    • yfs.agent.bulk.sender.enabled=Y|true
    • yfs.agent.bulk.sender.batch.size=5000

External System

  • Use connection pool (cached/persistent connection, keep-alive) with appropriate connect and read timeouts.
  • Cache authentication token for reuse, regenerate upon expiry, or 401, 403 status codes. Read more → 
  • Adhere to best practice when invoking Inventory Visibility APIs. Read more for updated best practices →
  • Use 100 item-node per payload when invoking APIs for multiple lines.
  • Implement polling process to retrieve failed events. Read more →
  • Space out the sync supply and snapshot calls, check with all stakeholders for ad-hoc execution or special requests during peak.
    • Avoid redundant calls to generate snapshot
  • Avoid redundant Network availability recomputes, consider following situations:
    • Recompute Network Availability API recomputes availability for existing DG.
    • Update DG API will recompute availability for newly created or modified DGs but not for existing DG.
    • Bulk update to turning on/off existing nodes. 
      • Note: This requirement to turn of fulfillment by type can be achieved by frontend feature flag. 
  • Make sure to review release notes and API documents.
    • Timely refactor the logic to avoid running into unforeseen risks of using deprecated or discontinued APIs. Read more →
    • Upgrade to V2 APIs for IBM Sterling Intelligent Promising. Read more →
Database

Database Performance

    
  • Enabling property yfs.yfs.app.identifyconnection=Y to identify the source of DB query / connection.
  • Enable and optimize entity cache.
  • DB2: Enable stmt_conc (LITERALS)*
  • Oracle: Avoid full table scan with ConsiderOracleDateTimeAsTimeStamp attribute when using Oracle database. Read more →
  • Monitor the frequently invalidated table caches and disable them if needed
  • Avoid Blank queries, one of the common use case when non optimal API input is passed in. Ensure APIs are invoked with key filtering attributes.
  • Make sure SQLs aren’t formed with unique values at runtime, it impacts the cache reusability.
  • Enable application performance features and properties required for concurrent workload to avoid DB contentions (HOTSku, Capacity, etc.)
  • Watch for YFS_PERSON_INFO query; apply index if needed. Read more →
    • Check for duplicate records in YFS_PERSON_INFO table. 
  • Disable unnecessary transaction (entity) audits (Order Audits, General Audits, etc.)
  • Disable resource intensive database maintenance during peak period (such as REORGs & RUNSTAT)
  • Tune / Avoid ad-hoc queries used for reporting purpose, if possible, use standby or replica instances to query.
  • Monitor database for long queriers and queries in lock-wait, and transaction logs usage.

Database Hygiene

  • Maintaining healthy database can prevent disruption in production.
  • Reduce the IBM Sterling Order Management database size with entity level compression and enhanced purges. More details →
  • Accumulation of transactional data over long periods of time (and failure to purge as possible), may degraded query performance. Ensure all necessary purges are running to maintain healthy & lightweight database, which in-turn minimizes performance issues.
Platform (Server Profile, Certified Containers)

Server Profile (JVM/Pod specification)

    

There are three server performance profiles for the agent or integration server. 
 
Balanced: Provides moderate memory and computing power for typical Sterling Order Management System Software workload. 
Compute: Provides additional computing power and moderate memory. 
Memory: Provides additional memory and moderate computing power . 

For Example: The Memory/Compute profile is suitable for the servers/processes working with large dataset (large XMLs) within single transaction boundary. RTAM server is good example of it, because the transaction could be processing n number of inventory activities within single transaction.

There is no definite formula to identify the right profile. However, you can use the following guidelines to select a performance profile for optimal results.
  • Start with a single thread for the server, single instance of the server, and the Balanced profile.
  • Increase the number of threads gradually to arrive at the correct profile and maximum number of threads per server.
  • If CPU or memory allocation does not change significantly with each additional thread, continue with the Balanced performance profile. Servers that spend most of the time calling external services display this kind of resource use pattern.
  • If the JVM heap utilization stays around 80% or increases significantly with each additional thread, change the profile to Memory.
  • With any of the performance profiles, if the CPU allocation stays around 70% or memory allocation stays around 80%, you might scale the server (Pod) rather than increase the number of threads.
Select optimal server profile and thread configuration for agent processes and integration service to ensure service can scale with custom logic and configuration. Here is the general approach:
step by step process
 
Recommendations:
  • Spawning additional (untuned) instances of agent to try and improve throughput can lead to exhaustion of resource allocation available, my stress the system, and it can have cascading impact to other transactions. 
  • Community article on Sterling OMS Performance Profiles

Certified Containers

With certified containers, the Sterling OMS operator provides ability to define serverProfiles to group workloads like appserver, agent/integration servers, and more based on common compute and memory requirements. Read more on Configuring serverProfiles parameter for certified container →
Best Practices:
  • Review CPU and memory resource requests and limits, define optimal profiles, and agent & integration threads.
                serverProfiles:
                - name: "profile-name"
                  resources:
                    limits:
                      cpu: millicores 
                      memory: bytes
                    requests:
                      cpu: millicores
                      memory:    bytes
                ...
                - name: agents-huge
                  profile: ProfileHuge
                  property:
                    customerOverrides: AgentProperties
                    envVars: EnvironmentVariables
                    jvmArgs: BaseJVMArgs
                  replicaCount: 1
                  agentServer: 
                    names:[ScheduleLargeServer,ReleaseLargeOrderServer]
  • Watch for CPU throttling. 
  • Adjust Default Executor Threads, Data source pool size according to serverProfile you have defined for the Application Server.
    • Separate traffic using
            appServer:
              ingress:
                contextRoots: [smcfs, sbc, sma, wsc, isf]
  • Leverage Pod/Cluster Autoscalers. Read more on Configuring horizontalPodAutoscalers parameter 
  • Network monitoring, latency to Database and JMS server; have network utility to validate the connections.
  • Understand & Tune Ingress/Egress request/response limits
  • Consume latest, keep the environment to up to date.  
  • Avoid automatic operator updates for production.
  • Review and be prepared to captured diagnostics as per the Mustgather. IBM Order Management Software Certified Containers: Performance Issues →
  • Check for SSL certificate validity for all internal and external communications.
  • Monitor NFS IOPS for shared mounts used to store certs, catalog index, etc.
  • Reduce redundant RMI calls; verify host ulimits (Open File/Socket)

Recommendations:

  • Spawning additional (untuned) instances of agent to try and improve throughput can lead to exhaustion of resource allocation available, my stress the system, and it can have cascading impact to other transactions. 
  • Do not solely rely on pod autoscaler, make sure to performance test for expected workload and scale replica's accordingly.
Performance Testing Guidance

Overview

    

To best position for success on the OMS platform, it is important to understand how your application handles various scenarios known to challenges performance or stability. Testing in pre-production with data/workloads representative of production enables ability identify and address issues without impact to production business and operations. Performance testing is an art, but a mandatory one! It is imperative to vet out issues in advance on pre-production load testing, rather than wait for it to surface as a business-critical production issue!

Performance Testing & Optimization

Best Practices

  • Projected peak volumes – Ensure business and IT are in sync on expected peak loads to ensure planned tests are accurate.
  • Representative Combination Tests – Assemble components to reflect real time DATA, scenarios and run in parallel to ensure adherence with NFR; Stage data for various components and run them under full load (ie. Create + Schedule + Release+ Create Shipment + Confirm Shipment + Inventory Snapshot (IV) )
    • Ensure inventory picture (supply), node capacity (resource pool), distribution groups, and nodes setup reflect production state. 
    • Do not use same PI data, as it can lead to unexpected contention. In production PI data is unique. 
  • Agent and integration servers – ensure asynchronous batch processing components are tested in isolation and in combination with broader workload; ensure to tune agents and integration server (profile, threads) to meet expected peak SLAs/NFRs on throughput
  • Test Failure Scenarios – validate resiliency of overall system and operations, ensuring graceful recovery if front-end channel (web, mobile, Call Center, Store, EDI, JMS), backend OMS, or external integration endpoints fail. Include ‘kill switches’ in any components that can be disabled to avoid magnifying an isolated issue into system wide one, especially for any synchronous calls.
  • Confirm Peak days and Hours - Share any specific key dates or max burst times with IBM Support, including code freezes, flash sales.

Note: Refer to Knowledge Center for detailed Tuning and performance guidance.

   
Monitoring

Common Alerts

    
Synthetics 
  • Availability Check (Ping, Scripted Browser, API Test)
Application
 
  • Golden signals (Throughput, Latency, Error Rate, Saturation)
  • Agent/Integration Server JVM Health (GC & Heap Utilization) 
  • Application Server JVM Health (GC, Heap, Default Executor Threads)
  • Container workload (Pod) (CPU throttling, availability of desired number of Pods, frequent restarts, etc.)
  • Application Statistics: YFS_STATISTICS_DETAIL
Database
  • Queries in Lock-Wait
  • Long Running Query
  • Transaction Log
  • Tablespace
  • HADR log replay (replication lag monitoring)
 
JMS (MQ)
 
  • Queue depth
  • Message delay or equivalent queueing/dequeuing rate for all JMS queues.
 
Infrastructure (VM, Container Orchestration Platform)
  • Local, NFS Disk Utilization
  • VM (host) not responding
  • CPU, Memory, Disk Usage
  • CPU Steal
  • Open Sockets (ulimit)
 
Network (Proxy, Load Balancer, Firewall)
 
  • Up/Down stream latency
  • Number of Queued Requests
  • Average Queue Time
  • Active Session
  • Bytes Received/Sent Per Second
Note:  IBM Sterling Order Management on Cloud has built in monitoring for these components. Read More 
     

Note: Please make sure to validate any of the described configuration change(s)  in lower environment(s) before promoting to the production. 
 

General IBM Support hints and tips

[{"Type":"MASTER","Line of Business":{"code":"LOB59","label":"Sustainability Software"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SS6PEW","label":"IBM Sterling Order Management"},"ARM Category":[{"code":"a8m0z000000cy01AAA","label":"Performance"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
02 August 2024

UID

ibm17112352