Contribute in GitHub:

About IBM Voice Gateway

IBM® Voice Gateway enables direct voice interactions over a telephone with a cognitive self-service agent or the ability to transcribe a phone call between a caller and agent so that the conversation can be processed with analytics for real-time agent feedback. Voice Gateway orchestrates Watson services and integrates them with a public or private telephone network by using the Session Initiation Protocol (SIP).

Ways to use IBM Voice Gateway

With IBM Voice Gateway, you can set up both self-service agents and agent assistants.

Self-service agents are similar to an Interactive Voice Response (IVR) system, which provides an automated way to communicate with callers by using audio over a telephone call. With Watson, self-service agents communicate in a more conversational manner and can handle complex interactions that are difficult for traditional IVRs.
Agent assistants provide a way to run real-time analytics on a phone call between a caller and a live human agent by converting the voice streams into text. These text utterances can then be processed with services outside of Voice Gateway, such as the Watson Natural Language Classifier service and the Watson Discovery service, to generate useful information that an agent can immediately use to help a caller. Note that integration with specific analytic services isn't covered in this documentation.

The type of implementation that you choose determines how you set up Voice Gateway. Learn more about each of these implementations in the following sections.

Self-service agents

With self-service agents, customers are directed through the voice gateway to interact with Watson services that you train to provide certain responses. You can optionally enable the Watson services to opt out to a call center agent by initiating a call transfer through the API.

The customer call is routed through the voice gateway, which orchestrates Watson services. If configured, the call can be routed to a human agent.

On the back end, a self-service agent is made of the following components, which each fulfill a different role:

Core capabilities: IBM Voice Gateway, which orchestrates and connects all other components
Your agent's voice: Watson services that enable voice interaction with the caller
- Speech to Text: Converts the caller's audio into text
- Watson Assistant: Analyzes the text, maps it to intents or capabilities, and provides a response according to a dialog
- Text to Speech: Converts the response into voice audio
Telephone network connection: A SIP trunk or session border controller, which enables customers to call Voice Gateway over the telephone network
Customization with APIs: An optional service orchestration engine (SOE), which sits between Watson Assistant and Voice Gateway so that you can further customize your environment with your own third-party APIs
Analytics and transcriptions: An optional REST server, which stores reporting events that contain call data for monitoring, logging, and further analysis

Watson service orchestration

The following diagram shows how Voice Gateway orchestrates the various Watson services to enable a self-service agent. Within seconds, utterances flow between the services to result in a natural-sounding conversation with the caller.

Voice Gateway acts as a hub through which the caller and each Watson service communicate.

The caller asks a question.
The question is streamed to the Speech to Text service.
A text utterance is returned.
The text is sent to Watson Assistant as a message request.
A message response is returned.
The response text is sent to the Text to Speech service.
Synthesized audio is returned.
Voice Gateway streams the audio response to the caller.

Conversation flow through a service orchestration engine

For self-service agents, you can optionally include a service orchestration engine (SOE) to your environment, which enables you add your own layer of customization to the communication between Voice Gateway and the Watson Assistant service. Voice Gateway and Watson Assistant communicate through the Watson Assistant REST API, sending request data using only the MessageRequest method and receiving a corresponding JSON response. The service orchestration engine acts as a proxy for Watson Assistant, intercepting message requests and responses and modifying them by using third-party APIs.

Message requests and responses between Voice Gateway and Watson Assistant flow through a service orchestration engine, which modifies them.

For production deployments of Voice Gateway, you might want to incorporate a service orchestration engine for the following reasons:

To de-identify requests to remove personal information such as PHI, PII, and PCI before it's sent to Watson Assistant
To personalize responses from Watson Assistant, for example by using customer location information to provide a personal weather forecast
To enable telephony features, such as including caller ID or collecting DTMF digits for account numbers
To customize interactions with customers by using APIs
To use Voice Gateway state variables, for example to complete a long transaction
To integrate voice security by using DTMF or biometrics

To learn more about how to implement a service orchestration engine, see Connecting through a service orchestration engine.

Features for self-service agents

Barge-in: Callers can interrupt Watson if the utterance Watson is sending to the caller isn't relevant to the context of the conversation.
Call transfer: The gateway can be signaled to initiate a transfer from the Watson Assistant service through the use of action tags. To perform the transfer, the gateway uses a SIP REFER request as defined in section 6.1 of RFC 5589.
Call hang-up: The gateway can be signaled to terminate a call from the Watson Assistant service through the use of an action tag.
Music on hold: The gateway can play an audio file that is specified by Watson Assistant for some period of time or until processing in Watson Assistant completes.
SSML tagging: Speech Synthesis Markup Language (SSML) tags are used to control how Text to Speech synthesizes utterances into audio. The gateway supports passing these tags through to Text to Speech when received from Watson Assistant.
Latency auditing: The gateway monitors latency, which is a key indicator of how well Watson is communicating with a caller. Because the gateway orchestrates several Watson services, it's critical to be able to identify when one of these services is slow to respond, which ultimately results in long voice response delays and unnatural conversations with the caller.
Context mapping: When transferring out of Watson, the gateway provides a way for Watson Assistant to specify metadata that gets embedded in the SIP REFER message. The metadata can be used to map context saved during the conversation back to a live agent session.
Audio recording: The gateway can be configured to record audio conversations in the form of 16-bit, single-channel (mono) WAV files. These WAV files are stored on a configured Docker volume and must be retrieved through some external means such as FTP. Typically, this feature is used to gather training data for the Speech to Text service.
DTMF support: The gateway supports RFC 4733, RTP Payload for DTMF Digits, Telephony Tones, and Telephony Signals. See Collecting Dual-tone multi-frequency (DTMF) responses.
Whitelisting: To prevent denial-of-service attacks, the gateway supports the ability to configure a whitelist. This whitelist enables filtering of inbound SIP INVITE requests based on the SIP to URI and from URI.

Agent assistants

The voice gateway provides the ability to transcribe caller and callee (e.g. contact-center agent) audio from an active phone call in real time using the SIPREC protocol. This capability requires a session border controller (SBC) that supports the ability to fork media out to the voice gateway, which is acting as a SIPREC Session Recording Server (SRS).

For agent assistants, the voice gateway forks the call to Watson services, which transcribe the conversation to provide feedback to a human agent.

Features for agent assistants

Audio recording: The gateway can be configured to record audio conversations in the form of 16-bit, single-channel (mono) WAV files. These WAV files are stored on a configured Docker volume and must be retrieved through some external means such as FTP. Typically, this feature is used to gather training data for the Speech to Text service.
Whitelisting: To prevent denial-of-service attacks, the gateway supports the ability to configure a whitelist. This whitelist enables filtering of inbound SIP INVITE requests based on the SIP to URI and from URI.

Architecture

IBM Voice Gateway is one of several components in the overall architecture of self-service agents and agent assistants. The architecture and technologies that are used differ depending on your implementation. For self-service agents, callers can either connect directly to the voice gateway through a SIP trunk or indirectly through a session border controller (SBC).

Voice Gateway architecture

Voice Gateway is composed of two separate microservices, the SIP Orchestrator and the Media Relay. These microservices are delivered in the form of two separate Docker images.

SIP Orchestrator: Orchestrates Watson Assistant service and the Media Relay
- Runs on WebSphere Application Server Liberty
- Acts as either a SIP User Agent Server (UAS) or a SIPREC Session Recording Server (SRS)
- Built in Java
- Delivered as a WebSphere Liberty User Feature
- Configured through Docker environment variables
Media Relay: Handles all media processing for the voice gateway
- Runs on Node.js
- Processes inbound and outbound RTP audio
- Orchestrates Watson Speech to Text (STT) and Text to Speech (TTS) services
- Built in JavaScript using Node Streams architecture
- Delivered as a Node Module
- Configured through Docker environment variables

The following diagram shows at a high level how these two microservices combine to provide the full functionality of IBM Voice Gateway:

The separate microservices in the voice gateway, the SIP Orchestrator and the Media Relay, communicate using APIs

Connecting to services using an MRCP server

In addition to using IBM® Speech to Text, IBM® Text to Speech, or the IBM® Voice Gateway Speech to Text Adapter, Voice Gateway also supports Media Resource Control Protocol Version 2 (MRCPv2) connections. You can use a mixture of third-party speech recognition and voice synthesizing services that are coordinated by Voice Gateway. See Configuring services with MRCPv2.

Self-service agent architecture when using a SIP trunk

When connecting to a self-service agent through a SIP trunk, you must configure your SIP trunk to forward INVITE requests to the voice gateway based on its IP address and SIP port.

Calls flow through a SIP trunk to the voice gateway, which communicates with Watson services though the API.

SIP trunks can be used to quickly set up and test the voice gateway by calling your Watson services from the public telephone network. In this case you can simply deploy the voice gateway to a public cloud Docker container service, such as IBM® Cloud Kubernetes Service. On-premises enterprise integration typically requires that you configure a session border controller (SBC), which is discussed in the next section.

Self-service agent architecture when using an SBC

Session border controllers are typically used in cases when you want to enable customers to be transferred to live contact center agents. In a self-service agent where communications flow through a session border controller (SBC), you need to configure the SBC to forward calls to the voice gateway based on its IP address and SIP port. Note that to enable call transfers, the SBC must stay in the call path so that it can handle SIP REFER messages:

Calls flow to an SBC and then to the voice gateway, which communicates with Watson services through the API.

Agent assistant architecture when conferencing calls through an MCU

For agent assistants, media from the call between a customer and a human agent must be shared with Voice Gateway so that it can transcribe the call. One method of routing call media to Voice Gateway is to conference it into the ongoing call. Typically, this conferencing requires a multipoint control unit (MCU) or a participant in the call that can act as an MCU. Voice Gateway sends call audio for speech-to-text processing and then sends returned transcriptions to a configured reporting REST server.

The call is conferenced with the agent and Voice Gateway through a multipoint control unit. Voice Gateway listens in on the call, sends call audio for speech-to-text processing and then sends returned transcriptions to a REST server or other analytics gateway.

Agent assistant architecture when forking calls through an SBC

Another option for agent assistants is to fork calls from a session border controller (SBC) to Voice Gateway, which acts as a SIPREC Session Recording Server (SRS). Voice Gateway sends call audio for speech-to-text processing and then sends returned transcriptions to a REST server or other analytics gateway that supports REST APIs.

The call goes to an SBC, which forks the call to the voice gateway as it forwards the call to the human agent. Voice Gateway sends call audio for speech-to-text processing and then sends returned transcriptions to a REST server or other analytics gateway.

Supported languages

Voice Gateway supports the following languages with Watson speech services:

English (UK)
English (US)
Japanese
Portuguese (Brazilian)
Spanish

The IBM® Voice Gateway Speech to Text Adapter and IBM® Voice Gateway Text to Speech Adapter enable you to use additional languages for self-service agents through the Google Cloud Speech API and Google Cloud Text-to-Speech API. For more information, see Integrating third-party speech services. By using the speech to text adapter and text to speech adapter, you can extend your Voice Gateway deployment to support languages that include the following:

French
German
Italian
Korean

For a language to be supported, it must be supported by all services you integrate with Voice Gateway, including the third-party speech services and the IBM Watson™ Assistant service. For more information, see Supported languages for the Watson Assistant service.You can enable support for additional languages by creating custom speech adapters, which you can use to integrate third-party speech recognition (speech-to-text) and speech synthesis (text-to-speech) services. The speech adapter samples can help you get started with creating the speech adapters.

Note: IBM Voice Gateway does not provide licenses to any external services, including the Watson services or third-party speech services.

Supported protocols

SIP/SIPS: The gateway supports connecting to Watson as if it were a SIP endpoint via a SIP trunk, from an enterprise session border controller (SBC), or from a multipoint control unit (MCU).
SIPREC: Session border controllers can fork media from a phone call between a customer and agent to Watson through the voice gateway using SIPREC to convert the voice streams to text for analytic processing. This protocol is used for agent assistants.
RTP: The Real-time Transport Protocol (RTP) is supported for audio media streams. The voice gateway supports the following audio encodings:
- G.711: G.711 Both μ-law, PCMU, and a-law, PCMA, pulse-code modulation (PCM) standard at 64 kbps.
- G.722: G.722 at 64 kbps.
- G.729: G.729 Annex A and B through the use of external codec services.
SRTP: The Secure Real-time Transport Protocol (SRTP) supports encryption of media streams in Version 1.0.4.0 and later.
RTCP: The RTP Control Protocol (RTCP) is supported to provide quality of service (QoS) statistics for RTP media streams in Version 1.0.0.5 and later.
DMTF: Dual-tone multi-frequency (DTMF) signals are converted into single digit text utterances that are sent to the configured Watson API. IBM Voice Gateway supports the following DTMF protocols:
- RFC 4733
- RTP Payload for DTMF Digits
- Telephony Tones
- Telephony Digits
- SIP Info (Version 1.0.4.0 and later)
HTTP/REST: These formats are supported for reporting events, such as call detail records (CDR), Watson Assistant turn events, and transcription events.
MRCPv2: The gateway supports using Media Resource Control Protocol Version 2 (MRCPv2) to connect to speech to text and text to speech services.

System requirements

To deploy Voice Gateway in production environments, the following minimum software and hardware levels are required.

Table 1. Supported platforms and operating systems
Platform	Operating system
Linux® 64-bit	Red Hat Enterprise Linux (RHEL) 7.5 and 7.6
	Ubuntu 16.04 LTS

Because IBM Voice Gateway is distributed as a set of Docker images, you can also deploy Voice Gateway on other platforms that support Docker and Kubernetes. For example, you can deploy Voice Gateway on 64-bit Windows environments using Docker for Windows and Docker Machine.

Table 2. Deployment environment requirements
Environment	Minimum version
Docker	Community Edition or Enterprise Edition Version 1.13 or later Note: Swarm mode isn't supported
Kubernetes	Version 1.7.3 or later
IBM Cloud Kubernetes Service	N/A - Cloud-based service

Table 3. Virtualized hardware requirements
Hardware	Minimum requirements
Virtual machine RAM	8 gigabytes (GB)
Virtual CPUs (vCPUs)	2 vCPU with x86-64 architecture at 2.4 GHz clock speed Note: Varies based on expected number of concurrent calls and other factors
Storage	50 gigabytes (GB) Note: Call recording and log storage settings significantly affect storage requirements

The exact virtualized hardware that is needed to reach your required level of performance varies greatly depending on several factors, including the expected number of concurrent calls, product configuration, and Watson Assistant dialog. If you need help planning your Voice Gateway environment, contact the product team as described in Getting help.