S3 select content from an object
The select object content API filters the content of an object through the structured query language (SQL).
See the Metadata collected by inventory section in the AWS Systems Manager User Guide for an example of the description of what should reside in the inventory object. In the request, you must specify the data serialization format as, comma-separated values (CSV) of the object to retrieve the specified content. Amazon Web Services (AWS) command-line interface (CLI) select object content uses the CSV format to parse object data into records and returns only the records specified in the query.
s3:GetObject permission for this operation.Syntax
POST /BUCKET/KEY?select&select-type=2 HTTP/1.1\r\n
Example
POST /testbucket/sample1csv?select&select-type=2 HTTP/1.1\r\n
Bucket
- Description
-
The bucket to select object content from.
- Type
-
String
- Required
-
Yes
Key
- Description
-
The object key.
- Length Constraints
-
Minimum length of 1.
- Type
-
String
- Required
-
Yes
SelectObjectContentRequest
- Description
-
Root level tag for the select object content request parameters.
- Type
-
String
- Required
-
Yes
Expression
- Description
-
The expression that is used to query the object.
- Type
-
String
- Required
-
Yes
ExpressionType
- Description
-
The type of the provided expression for example SQL.
- Type
-
String
- Valid Values
-
SQL
- Required
-
Yes
InputSerialization
- Description
-
Describes the format of the data in the object that is being queried.
- Type
-
String
- Required
-
Yes
OutputSerialization
- Description
-
Format of data returned in comma separator and new-line.
- Type
-
String
- Required
-
Yes
Response entities
If the action is successful, the service sends back HTTP 200 response. Data is
returned in XML format by the service:
Payload
- Description
-
Root level tag for the payload parameters.
- Type
-
String
- Required
-
Yes
Records
- Description
-
The records event.
- Type
-
Base64-encoded binary data object
- Required
-
No
Stats
- Description
-
The stats event.
- Type
-
Long
- Required
-
No
The Ceph Object Gateway supports the following response:
Example
{:event-type,records} {:content-type,application/octet-stream} {:message-type,event}
Syntax (for CSV)
aws --endpoint-URL http://localhost:80 s3api select-object-content
--bucket BUCKET NAME
--expression-type SQL
--input-serialization
{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}
--output-serialization {"CSV": {}}
--key OBJECT_NAME.csv
--expression "select count(0) from s3object where int(_1)<10;" output.csv
--key OBJECT_NAME do not require
.csv, .json, or .parquet from IBM Storage
Ceph 7 and later.Example (for CSV)
aws --endpoint-url http://localhost:80 s3api select-object-content
--bucket testbucket
--expression-type 'SQL'
--input-serialization
'{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}'
--output-serialization '{"CSV": {}}'
--key testobject.csv
--expression "select count(0) from s3object where int(_1)<10;" output.csv
aws --endpoint-url http://localhost:80 s3api select-object-content
--bucket BUCKET NAME
--expression-type 'SQL'
--input-serialization
'{"Parquet": {}, {"CompressionType": "NONE"}'
--output-serialization '{"CSV": {}}}'
--key OBJECT_NAME.parquet
--expression "select count(0) from s3object where int(_1)<10;" output.csv.csv.aws --endpoint-url http://localhost:80 s3api select-object-content
--bucket testbucket
--expression-type 'SQL'
--input-serialization
'{"Parquet":{}, "CompressionType": "NONE"}'
--output-serialization '{"CSV": {}}}'
--key testobject.parquet
--expression "select count(0) from s3object where int(_1)<10;" output.csvaws --endpoint-url http://localhost:80 s3api select-object-content
--bucket BUCKET_NAME
--key OBJECT_NAME.json
--expression-type 'SQL'
--input-serialization
'{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}'
--output-serialization '{"CSV": {}}'
--expression "select count(*) from S3Object[*];" /dev/stdoutaws s3api --endpoint-url http://localhost:80
select-object-content --bucket test
--key testobject.json
--expression-type 'SQL'
--input-serialization
'{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}'
--output-serialization '{"CSV": {}}'
--expression "select count(*) from S3Object[*];" /dev/stdoutExample (for BOTO3)
import pprint
import boto3
from botocore.exceptions import ClientError
def run_s3select(bucket,key,query,column_delim=",",row_delim="\n",quot_char='"',esc_char='\\',csv_header_info="NONE"):
s3 = boto3.client('s3',
endpoint_url=endpoint,
aws_access_key_id=access_key,
region_name=region_name,
aws_secret_access_key=secret_key)
result = ""
try:
r = s3.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
InputSerialization = {"CSV": {"RecordDelimiter" : row_delim, "FieldDelimiter" : column_delim,"QuoteEscapeCharacter": esc_char, "QuoteCharacter": quot_char, "FileHeaderInfo": csv_header_info}, "CompressionType": "NONE"},
OutputSerialization = {"CSV": {}},
Expression=query,
RequestProgress = {"Enabled": progress})
except ClientError as c:
result += str(c)
return result
for event in r['Payload']:
if 'Records' in event:
result = ""
records = event['Records']['Payload'].decode('utf-8')
result += records
if 'Progress' in event:
print("progress")
pprint.pprint(event['Progress'],width=1)
if 'Stats' in event:
print("Stats")
pprint.pprint(event['Stats'],width=1)
if 'End' in event:
print("End")
pprint.pprint(event['End'],width=1)
return result
run_s3select(
"my_bucket",
"my_csv_object",
"select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 from s3object where a3>100 and a3<300;")
Supported features
Currently, only part of the AWS s3 select command is supported. For a full list of supproted features, see Table 1.
| Features | Details | Description | Example |
|---|---|---|---|
|
Arithmetic operators |
^ * % / + - ( ) |
select (int(_1)+int(_2))*int(_9) from s3object; |
|
|
Arithmetic operators |
% modulo |
select count(*) from s3object where cast(_1 as int)%2 = 0; |
|
|
Arithmetic operators |
^ power-of |
select cast(2^10 as int) from s3object; |
|
|
Compare operators |
> < >= <= = != |
select _1,_2 from s3object where (int(_1)+int(_3))>int(_5); |
|
|
logical operator |
AND OR NOT |
select count(*) from s3object where not (int(1)>123 and int(_5)<200); |
|
|
logical operator |
is null |
Returns true/false for null indication in expression |
|
|
logical operator and NULL |
is not null |
Returns true/false for null indication in expression |
|
|
logical operator and NULL |
unknown state |
Review null-handle and observe the results of logical operations with NULL. The query returns
|
|
|
Arithmetic operator with NULL |
unknown state |
Review null-handle and observe the results of binary operations with NULL. The query returns
|
|
|
Compare with NULL |
unknown state |
Review null-handle and observe results of compare operations with NULL. The query returns
|
|
|
missing column |
unknown state |
|
|
|
projection column |
Similar to if or then or else |
|
|
|
projection column |
similar to switch/case default |
select case cast(_1 as int) + 1 when 2 then “a” when 3 then “b” else “c” end from
s3object; |
|
|
logical operator |
|
|
|
|
logical operator |
|
|
|
|
logical operator |
|
|
|
|
logical operator |
|
|
|
|
logical operator |
|
|
|
|
logical operator |
{expression} like {match-pattern} escape {char} |
select count(*) from s3object where “jok_ai” like “%#_ai” escape
“#”; |
|
|
casting operator |
|
||
|
casting operator |
|
||
|
casting operator |
|
||
|
casting operator |
|
||
|
non AWS casting operator |
|
||
|
non AWS casting operator |
|
||
|
non AWS casting operator |
|
||
|
Aggregation Function |
sum |
|
|
|
Aggregation Function |
avg |
|
|
|
Aggregation Function |
min |
|
|
|
Aggregation Function |
max |
|
|
|
Aggregation Function |
count |
|
|
|
Timestamp Functions |
extract |
|
|
|
Timestamp Functions |
dateadd |
|
|
|
Timestamp Functions |
datediff |
|
|
|
Timestamp Functions |
utcnow |
|
|
|
Timestamp Functions |
to_string |
select to_string( to_timestamp(“2009-09-17T17:56:06.234567Z”), “yyyyMMdd-H:m:s”) from
s3object; |
|
|
String Functions |
substring |
|
|
substring with from negative number is valid considered as first |
select substring(“123456789” from -4) from s3object; |
||
substring with from zero for out-of-bound number is valid
just as first,last. |
select substring(“123456789” from 0 for 100) from s3object; |
||
|
String Functions |
trim |
|
|
|
String Functions |
trim |
|
|
|
String Functions |
trim |
|
|
|
String Functions |
trim |
|
|
|
String Functions |
lower or upper |
|
|
|
String Functions |
char_length, character_length |
|
|
|
Complex queries |
|
||
|
alias support |
|
Reference
-
See Amazon’s S3 Select Object Content API for more details.