GitHubContribute in GitHub: Edit online

copyright: years: 2017, 2023 lastupdated: "2023-01-06"


Dynamically configuring the Speech to Text service or Speech to Text Adapter

By using the IBM® Voice Gateway API, you can dynamically configure the IBM® Speech to Text service or Speech to Text Adapter during a call. To change the configuration, define the vgwActSetSTTConfig action in the output of a node response in your Watson Assistant dialog tree. For more information about using the API, see Defining action tags and state variables.

Watson Speech to Text service instances that are not part of Premium plans, by default, log requests and their results to improve the service for future users. To prevent IBM usage of data in this way, see Data collection in the Watson Speech to Text API reference.

Note: Changing the configuration for the Speech to Text service causes the connection from the voice gateway to the Speech to Text service to disconnect and reconnect, which might cause the voice gateway to miss part of an utterance. Typically, the connection is reestablished while audio is streamed to the caller from the Watson Assistant response, which avoids missing any part of an utterance unless the caller barges in quickly.

See the following sections for examples of defining the vgwActSetSTTConfig action:

Speech to Text service

This example shows configuration that you can add to the node response in your Watson Assistant dialog tree. The settings are transparently passed as JSON properties to the Speech to Text service.

{
  "output": {
    "vgwAction": {
      "command": "vgwActSetSTTConfig",
      "parameters": {
  			"credentials": {
  		  	"url": "https://api.us-south.speech-to-text.watson.cloud.ibm.com/instances/{instance_id}",
  		  	"apikey": "{apikey}",
  		  	"tokenServiceProviderUrl": "https://iam.cloud.ibm.com/identity/token"
  			},
  			"config": {
  				"x-watson-learning-opt-out": true,
  				"model": "en-US_NarrowbandModel",
  				"profanity_filter": true,
  				"smart_formatting": true,
  				"customization_id": "81d3630-ba58-11e7-aa4b-41bcd3f6f24d",
  				"acoustic_customization_id": "e4766090-ba51-11e7-be33-99bd3ac8fa93"
  			},
  			"confidenceScoreThreshold": 0.7,
  			"echoSuppression": true,
  			"bargeInResume": true,
  			"connectionTimeout": 30,
  			"requestTimeout": 15
  		}
    }
  }
}

Table 1. JSON properties for the Speech to Text service
JSON property Description
credentials Credentials for the IBM® Speech to Text service. If not defined, the default credentials from the Media Relay configuration are used. You can also reduce call latency times by configuring the tokenAuthEnabled credential to enable token authentication for Version 1.0.0.5a and later. See Enabling user name and password based token authentication for Watson services.
config Parameters for the Watson Speech to Text service when using narrowband audio. For a full list of parameters, see the WebSockets API reference for Watson Speech to Text Service.
broadbandConfig Parameters for the Watson Speech To Text service when broadband audio is enabled. Required only when bandPreference is set to broadband. At minimum, the language model must be defined on the model property. For a full list of parameters, see the WebSockets API reference for Watson Speech to Text Service. Version 1.0.0.4 and later.
bandPreference Defines which audio band to prefer when negotiating audio codecs in the session. Set to broadband to use broadband audio when possible. The default value is narrowband. Version 1.0.0.4 and later.
confidenceScoreThreshold Confidence threshold of messages from the Speech to Text service. Messages with a confidence score that are under the threshold are not forwarded to Watson Assistant. The default value of 0 means that all responses will be used. The recommended values are between 0 and 1.
echoSuppression Indicates whether to suppress results from Speech To Text that might occur from an echo of Text To Speech synthesis. Version 1.0.0.4c and later.
bargeInResume Set to true to resume playing back audio after barge-in if the confidence score of the final utterance is lower than the threshold specified by the confidenceScoreThreshold property. Version 1.0.0.5 and later.
connectionTimeout Time in seconds that Voice Gateway waits to establish a socket connection with the Watson Speech to Text service. If the time is exceeded, Voice Gateway reattempts to connect with the Watson Speech to Text service. If the service still can't be reached, the call fails. Version 1.0.0.5 and later.
requestTimeout Time in seconds that Voice Gateway waits to establish a speech recognition session with the Watson Speech to Text service. If the time is exceeded, Voice Gateway reattempts to connect with the Watson Speech to Text service. If the service still can't be reached, the call fails. Version 1.0.0.5 and later.
updateMethod Optional. Specifies the update strategy to choose when setting the speech configuration. Possible values:
  • replace
  • replaceOnce
  • merge
  • mergeOnce

See Using updateMethod. Version 1.0.0.7 and later.

The parameters that you can set under the config and broadbandConfig JSON properties reflect the parameters that are made available by the Speech to Text WebSocket interface. The WebSocket API sends two types of parameters: query parameters, which are sent when Voice Gateway connects to the service, and message parameters, which are sent as JSON after the connection is established. For example, model and customization_id are query parameters, and smart_formatting is a WebSocket message parameter. For a full list of parameters, see the WebSockets API reference for Watson Speech to Text Service.

You can define the following query parameters for the Media Relay's connection to the Speech To Text service. Any other parameter that you define under config or broadbandConfig is passed through on the WebSocket message request.

  • model
  • customization_id
  • acoustic_customization_id
  • version - Version 1.0.0.4c and later
  • x-watson-learning-opt-out

Note: The following parameters from the Speech to Text service can't be modified because they have fixed values that are used by the Media Relay.

  • action
  • content-type
  • interim_results
  • continuous
  • inactivity_timeout

Example: Setting the Speech to Text language model to Spanish (es-ES_NarrowbandModel)

In this example, the language model is switched to Spanish and smart formatting is enabled. Because the credentials property isn't defined, the Media Relay will use the credentials defined through the Media Relay configuration (WATSON_STT_URL, WATSON_STT_USERNAME, and WATSON_STT_PASSWORD)

{
  "output": {
    "vgwAction": {
      "command": "vgwActSetSTTConfig",
      "parameters": {
  			"config": {
  				"model": "es-ES_NarrowbandModel",
  				"smart_formatting": true
  			}
  		}
    }
  }
}

Using updateMethod

You can use the updateMethod property in dynamic configuration to define how changes to the configuration are , by either replacing the configuration or merging new configuration properties, and specifying whether these changes occur for the duration of the call or one conversation turn.

Table 2. Available options to update properties when using updateMethod.
JSON property Description
replace Replaces the configuration for the duration of the call.
replaceOnce Replaces the configuration once, so the configuration is used for only the following conversation turn. Then, it reverts to the previous configuration.
merge Merges the configuration with the existing configuration for the duration of the call.
mergeOnce Merges the configuration for one turn of the conversation, and then reverts to the previous configuration.

Example: Using the mergeOnce method to update the Speech to Text recognizeBody property

The following example shows the grammar for the mergeOnce method in the Watson Assistant dialog. By setting the updateMethod to mergeOnce, Watson Assistant uses the action tag, vgwActSetSTTConfig to append the recognizeBody property to the STT config in the Voice Gateway JSON configuration file. These properties are used by Voice Gateway until the next turn event.

{

  "output": {
    "text": "I can speak Spanish now",
    "vgwAction": {
      "command": "vgwActSetSTTConfig",
      "parameters": {
        "updateMethod": "mergeOnce",
        "config": {
          "recognizeBody": {
            "contentType": "application/sgrs",
            "body": "#ABNF 1.0 ISO-8859-1;\nlanguage en-US;\nmode voice;\nroot $pattern;\n$pattern = $alphanum <1-> ;\n$alphanum = $digit | $letter\n$digit = zero | one | two | three | four | five | six | seven | eight | nine;\n$letter = \\"A.\\" | \\"B.\\" | \\"C.\\" | \\"D.\\" | \\"E.\\" | \\"F.\\" | \\"G.\\" | \\"H.\\" | \\"I.\\" | \\"J.\\" | \\"K.\\" | \\"L.\\" | \\"M.\\" | \\"N.\\" | \\"O.\\" | \\"P.\\" | \\"Q.\\" | \\"R.\\"| \\"S.\\" | \\"T.\\" | \\"U.\\" |\\"V.\\" | \\"W.\\" | \\"X.\\" | \\"Y.\\" | \\"Z.\\" ;\n"
          }
        }
      }
    }
  }
}

Updating fields that are not root level

When configuring dynamically from Watson Assistant, it's important to note that only the root level fields, such as config or bargeInResume, are updated. If they are omitted from the action, the original configuration settings persist. You can use the different updateMethod properties for merge and mergeOnce to merge config fields with the existing configuration.

Voice Gateway Speech to Text Adapter

The following example for Cloud Speech API shows configuration that you can add to the node response in your Watson Assistant dialog tree. The settings are transparently passed as JSON properties to the Cloud Speech API.

{
  "output": {
    "vgwAction": {
      "command": "vgwActSetSTTConfig",
      "parameters": {
        "config": {
          "languageCode": "es-ES",
          "profanityFilter": true,
          "maxAlternatives": 2,
          "speechContexts": [{
              "phrases": [ "Si", "Por supuesto", "Claro", "Si por favor"]
          }]
        },
        "thirdPartyCredentials": {
            "type": "service_account",
            "project_id": "my_google_project",
            "private_key_id": "d2f36f96cb0c58309a5eba101cef4af0663d9465",
            "private_key": "-----BEGIN PRIVATE ... \n-----END PRIVATE KEY-----\n",
            "client_email": "developer1@my_google_project.iam.gserviceaccount.com",
            "client_id": "100033083330209022330835",
            "auth_uri": "https://accounts.google.com/o/oauth2/auth",
            "token_uri": "https://accounts.google.com/o/oauth2/token",
            "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
            "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/developer1@my_google_project.iam.gserviceaccount.com"
        }
      }
    }
  }
}

The JSON properties define the values that will change. If a property isn't defined, the existing value is used.

Table 3. JSON properties for the Google Cloud Speech API
JSON property Description
config Parameters for the Google Cloud Speech API RecognitionConfig request. For a full list of parameters, see the RecognitionConfig API documentation.
thirdPartyCredentials Contents of a Google Cloud project service account JSON file. If this property is omitted, the credentials specified on the GOOGLE_APPLICATION_CREDENTIALS environment variable are used.

Note: The following fields for RecognitionConfig in the Cloud Speech API can't be modified because they have fixed values that are used by the Speech To Text Adapter.

  • encoding
  • sample_rate_hertz

Deprecated: Configuring the Speech to Text service by defining state variables

In Version 1.0.0.2, configuring Watson speech service by defining state variables was deprecated in favor of the action tags described in the previous sections.

Important: Although the state variables continue to function, you can't define these deprecated state variables and the action tags within a node. Your Watson Assistant dialog can contain a mixture of action tags and deprecated state variables, but the JSON definition for each node can contain only one or the other.

{
	"context": {
		"vgwSTTConfigSettings": {
			"credentials": {
		  	"url": "https://api.us-south.speech-to-text.watson.cloud.ibm.com/instances/{instance_id}",
		  	"apikey": "{apikey}",
		  	"tokenServiceProviderUrl": "https://iam.cloud.ibm.com/identity/token"
			},
			"config": {
				"x-watson-learning-opt-out": true,
				"model": "en-US_NarrowbandModel",
				"profanity_filter": true,
				"smart_formatting": true
			},
			"confidenceScoreThreshold": 0.7
		}
	}
}
Table 3. JSON properties for the Speech to Text service
JSON property Description
credentials Credentials for the Watson Speech to Text service. If not defined, the default credentials from the Media Relay configuration are used.
config Parameters for the Watson Speech to Text service. See the WebSockets API reference for Watson Speech to Text Service.
confidenceScoreThreshold Confidence threshold of messages from the Speech to Text service. Messages with a confidence score that are under the threshold are not forwarded to Watson Assistant. The default value of 0 means that all responses will be used. The recommended values are between 0 and 1.