May 17, 2017 By Vidyasagar Machupalli 3 min read

Who’s speaking? : Speaker Diarization with Watson Speech-to-Text API

Distinguishing between two speakers in a conversation is pretty difficult especially when you are hearing them virtually or for the first-time. Same can be the case when multiple voices interact with AI/Cognitive systems, virtual assistants, and home assistants like Alexa or Google Home. To overcome this, Watson’s Speech To Text API has been enhanced to support real-time speaker diarization.

Post building a popular chatbot using Watson services, there are a couple of requests to include SpeakerLabels setting into our code sample.

So, What is Speaker Diarization?

Speaker diarisation (or diarization) is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity.

Why Speaker Diarization?

Real-time speaker diarization is a need we’ve heard about from many businesses across the world that rely on transcribing volumes of voice conversations collected every day. Imagine you operate a call center and regularly take action as customer and agent conversations happen — issues can come up like providing product-related help, alerting a supervisor about negative feedback, or flagging calls based on customer promotional activities. Prior to today, calls were typically transcribed and analyzed after they ended. Now, Watson’s speaker diarization capability enables access to that data immediately.

To experience speaker diarization via Watson speech-to-text API on IBM Bluemix, head to this demo and click to play sample audio 1 or 2. If you check the input JSON specifically Line 20 below; we are setting “speaker_labels” optional parameter to true. This helps us in distinguishing between speakers in a conversation.

{<br>
 "continuous": true,<br>
 "timestamps": true,<br>
 "content-type": "audio/wav",<br>
 "interim_results": true,<br>
 "keywords": [<br>
  "IBM",<br>
  "admired",<br>
  "AI",<br>
  "transformations",<br>
  "cognitive",<br>
  "Artificial Intelligence",<br>
  "data",<br>
  "predict",<br>
  "learn"<br>
 ],<br>
 "keywords_threshold": 0.01,<br>
 "word_alternatives_threshold": 0.01,<br>
 "smart_formatting": true,<br>
 "speaker_labels": true,<br>
 "action": "start"<br>
}

A part of output JSON after real-time speech-to-text conversion:

{<br>
 ....<br>
     "confidence": 0.927,<br>
     "transcript": "So thank you very much for coming Dave it's good to have you here. "<br>
    }<br>
   ],<br>
   "final": true,<br>
   "speaker": 0<br>
  }

You can see that a speaker label is getting assigned to each speaker in the conversation.

Steps to enable speaker diarization

  • Watson speech-to-text is available as a service on IBM Bluemix, a cloud platform from IBM. Create a new service to leverage your application.

  • If you are taking the Rest API approach, don’t forget to include the optional parameter “speaker_labels: true” in your request JSON.

  • Based on the programming language your application is created, use any of the easy-to-use SDKs available on Watson Developer Cloud ranging from Python, Node, Java, Swift etc.,

Refer chatbot-watson-android code sample to get a gist of how to enable or add speaker diarization to an existing android app. Similarly, you can use other SDKs to achieve speaker diarization.

Note: Speaker labels are not enabled by default. Check ToDos in the code to uncomment.

Use cases

From integrating into chatbots to interacting with home assistants like Alexa, Google Home etc.; from call centers to medical services, the possibilities are endless.

For Bluemix Code samples and Tutorials, please visit our Bluemix github page.

Was this article helpful?
YesNo

More from

How a solid generative AI strategy can improve telecom network operations

3 min read - Generative AI (gen AI) has transformed industries with applications such as document-based Q&A with reasoning, customer service chatbots and summarization tasks. These use cases have demonstrated the impressive capabilities of large language models (LLMs) in understanding and generating human-like responses, particularly in fields requiring nuanced language understanding and inferencing. However, in the realm of telecom network operations, the data is different. The observability data comes from proprietary sources and encompasses a wide variety of formats, including alarms, performance metrics, probes…

IBM launches Mistral AI with IBM, enabling customers to deploy Mistral AI’s most powerful foundation models on premises with IBM watsonx

2 min read - Foundation modes are trained on billions of parameters of data, but most of this data is general purpose and from the public domain. While useful in some scenarios, enterprises must often train these base foundation models on their own proprietary data, a step called “fine-tuning.” Tuning helps to maximize a model’s productivity in terms of overall accuracy for any specific use case. Given the potentially sensitive nature of this data and an organization’s data security standards, uploading proprietary data to a…

IBM watsonx Assistant for Z V2 now offers flexibility to clients to ingest their enterprise documentation for a more personalized experience

3 min read - Imagine your mainframe users having accurate, curated responses to all their questions instantly at their fingertips. What if your system programmers and operators could perform both routine and complex tasks correctly every time, with minimal reliance on subject matter experts? That’s the transformation IBM watsonx™ Assistant for Z aims to deliver. IBM watsonx Assistant for Z is a generative AI assistant launched at Think 2024 earlier this year. This AI assistant uniquely combines conversational artificial intelligence (AI) and IT automation to…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters