Keyword extraction

The e-communication keyword extraction service is used to extract key words and phrases from text, such as an email or chat. The algorithm parses the text into sentences and removes the most frequent but least useful words for determining meaning (stop-words). It then applies various statistical and frequency methods to determine the most significant key words and phrases.

Business problem

Business texts, such as emails, can be long and wordy. It is useful to see a list of key words and phrases to quickly assess the validity, subject, and themes of a text and its classification.

Approach to solving the business problem

This service uses an implementation of the RAKE Automatic Keyword Extraction from Individual Documents Algorithm. This is a domain-independent method for automatically extracting keywords, as sequences of one or more words, from individual documents. RAKE can be applied to individual documents and does not need to see the whole corpus, unlike term-frequency or inverse document frequency, for example.

For more information on this algorithm, see https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents.

You can configure the algorithm with the following parameters:

  • The maximum word length (in characters) for any word
  • The number of key words or phrases required for each text. For example, the default value is 10 keywords/phrases per text
  • The maximum length of any key phrase

Assumptions

The algorithm currently works only for English text.

The algorithm uses punctuation and sentence structure to determine keywords so the text should not have been cleaned of these before use.

A list of stop-words is used to remove the most common words that do not act as keywords. This is provided as part of Surveillance Insight, and it may be necessary to amend and supplement this list.

Using the REST service

The service allows users to derive keywords from a text.

Starting the REST Service
python3 keywordsRESTAPI.py &
Sample input
Just a short message to apprise every one of the cleaning results performed on the filter separator from the Transwestern M/S. Wipe tests taken of the interior portions of the separator prior to cleaning revealed PCB results of: 360, 500 and 520 microgram. This was not a standard wipe taken of a specific area. These wipes were taken over large areas to determine presence/absence of PCB's within the separator. After the cleaning operation was completed, three wipe samples were taken of the interior portions of the separator. All results showed non- detect at less than 1.0 microgram. I feel we can be fairly certain that the separator is free of residual PCBs. Jeff, would you pass this information on to your counterpart for PG&E?
Sample response
cleaning revealed pcb results, cleaning results performed, results showed, pcb, cleaning operation, wipe tests, wipe samples, standard wipe, specific area, short message
Service details
Table 1. get_keywords service details
Method URL Input Output
POST /analytics/models/v1/get_keywords JSON payload JSON response

The following is an example CURL command to POST:

curl -k -H 'Content-Type: application/json' -X POST --data-binary @text.json https://<ip_address>:<port>/analytics/models/v1/get_keywords/

The following code is an example JSON payload:

{"text”: “Just a short message to apprise everyone of the cleaning results performed on the filter separator from the Transwestern M/S. Wipe tests taken of the interior portions of the separator prior to cleaning revealed PCB results of: 360, 500 and 520 microgram. This was not a standard wipe taken of a specific area. These wipes were taken over large areas to determine presence/absence of PCB's within the separator. After the cleaning operation was completed, three wipe samples were taken of the interior portions of the separator. All results showed non detect at less than 1.0 microgram. I feel we can be fairly certain that the separator is free of residual PCBs. Jeff, would you pass this information on to your counterpart for PG&E?"}

The following code is an example response:

{"keywords": ["cleaning revealed pcb results", "cleaning results performed", "results showed", "pcb ", "cleaning operation", "wipe tests", "wipe samples", "standard wipe", "specific area", "short message"]}

Accuracy and limitations

It is difficult to determine the accuracy of this keyword extraction method since most existing approaches focus on the manual assignment of keywords by professional curators who may use a fixed taxonomy. Alternatively, keywords are produced by the author of the text, and rely on the author's judgment to provide a representative list.

This is a heuristic method based on punctuation, stop-words, and frequency counts to determine key words and phrases. It does not use an NLP model to determine the meaning of the text.

The algorithm uses a list of English stop-words.

Although there is no actual limitation on the size of input text, the number of keywords per text is defined in the configuration as a global constant, for example, 10. Overall the algorithm will give more representative keywords for shorter texts.