IBM Support

Improving Cloudant Query Index Selection

Troubleshooting


Problem

This article describes the new method Cloudant Query uses to select an index. This addresses a bug in JSON indexes which could lead to incorrect results being returned. For most customers, this will be a transparent change, but customers that require specific JSON indexes to be valid, for example because they specify "use_index" or "sort" fields, should understand the change outlined below and apply the workarounds to their queries. Just to be clear, this change only impacts indexes of "type": "json" - customers using text indexes or other secondary index APIs (views, search indexes, or geo) are not impacted.

Symptom

This is a description of new JSON index selection behavior that is available in Cloudant release 6588. It addresses a long-standing bug in how Cloudant Query selects indexes and changes the algorithm which determines whether an index is valid for a given query.

Cloudant Query allows you to define two different index types: JSON and text. Indexes are defined using a list of fields, for example:


{
   "index": {
       "fields": ["name", "age"]
   },
   "ddoc": "name_age_index",
   "type" : "json"
}

For indexes of type: json, documents are only indexed if they include all of the indexed fields, for instance, if you have two documents:

{
  "_id": "1",
  "name": "Jane",
  "age": 31
}


{
  "_id": "2",
  "name": "Joe"
}

The name_age_index index above includes document 1 but not document 2, because it does not contain an age field.

In previous versions of Cloudant, a JSON index was deemed usable for a query if the first field in the index was referenced by an equality operator: $eq, $gt, $gte, $lt, and $lte (but not $ne). Again, let's look at an example query:

{


  "selector": {
   "name": {
     "$gt": "J"  
   }
 }
}

Using the old approach, this selector would use the name_age_index index we defined because name is the first field in the index. However, that would return an incomplete result set because document 2 (Joe) was not included in that index.

With the new approach, name_age_index index is not deemed valid for the selector and the database falls back to a full database scan (or any valid index that is available).

Let's consider some more complex cases that are impacted by this change.

Sort

Queries with no explicit sort can always fall back to a database scan; it might be slower but will at least return a result without error. Where sorting is required, there must be a valid index available that matches the sort order. To illustrate this, let's add a sort to the query above:

{
  "selector": {
   "name": {
     "$gt": "J"  
   }
 },
"sort": {
  "name": "asc"
 }
}

Again, using the old approach, Cloudant Query used the name_age_index index we defined (with incomplete results). With the new behavior, we cannot fall back to a database scan - the sort requires a valid index. Without a valid index, Cloudant generates an error with a 400 status code that instructs the user to create a valid index.



To fix this, you can select from two options:

1. Change the selector so that it can use the index they already have (desirable if index building takes a long time, for example, the database is large).

To change the selector to use the existing index, we need to explicitly require the indexed fields to exist in the results:

{
 "selector": {
   "name": {
     "$gt": "J"  
   },
   "age": {
     "$exists": true
   }
 }
},
"sort": {
 "name": "asc"
}
}

This will not include document 2 (Joe) but effectively generates the same results as the old behavior without an error.

2. Create a new index to cover the query/sort correctly. In this case, the following index would work:

{
   "index": {
       "fields": ["name"]
   },
   "ddoc": "name_index",
   "type" : "json"
}

use_index field

The use_index field allows users to explicitly specify an index when making a query. The new index selection behavior may cause these indexes to no longer be valid for the queries they are specified for. Previously, if the user specified a value in use_index that was invalid (e.g. the index wasn't valid for the query, or we can't find an index with that name), Cloudant would fail with an error - 400 Bad Request - clearly a disruptive change if it occurs in production.

To reduce the impact of this, _find now falls back to any available, valid index and includes a warning in the response. For example:

{

  "selector": {
   "name": {
     "$gt": "J"  
   }
 }
"use_index": "name_age_index"
}



An example response where _find fell back to an alternative index would be:

{
  "docs":[
    {
      "_id":"1",
      "_rev":"1-6912b4eb9e8782eb9d499ed94ae485f5",
      "name":"Jane",
      "age":31
    },
    {
      "_id":"2",
      "_rev":"1-c092cdfa43ac59dcbf631c0acd7254f2",
      "name":"Joe"
    }
 ],                
  "bookmark":
"g1AAAAAyeJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYozGoEkOGASEKEsAErJDRs",
  "warning": "_design/name_age_index was not used because it does not contain a valid index for this query."
}

A user can investigate why the index might not be valid and which index was used by using the _explain endpoint. See more details under Related information below. The Cloudant dashboard now has support for Explain in the Query section to make this a little simpler.

[{"Product":{"code":"SSTPQH","label":"IBM Cloudant Data Layer Local Edition"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"--","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"Not Applicable","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
16 June 2018

UID

swg22011923