Operator Parse
The Parse operator is similar to the FileSource, TCPSource, and UDPSource operators, in that it transforms input data in a raw form into well-structured SPL tuples. The difference is that unlike source adapters, the Parse operator is not tied to a particular external resource. Instead, it can be used inside an SPL application on data that came from any external source.
The Parse operator accepts data in many formats (such as line or bin), therefore the data is passed in using a blob attribute. The Parse operator generates the SPL tuples corresponding to the input format.
Checkpointed data
When the Parse operator is checkpointed in a consistent region, any partially parsed input data and logic state variables (if present) are saved in checkpoint. When the Parse operator is checkpointed in an autonomous region, logic state variables (if present) are saved in checkpoint.
Behavior in a consistent region
The Parse operator can be used in a consistent region, but not as a start operator. When a region is drained, the Parse operator reads as much of its input as it can produce output tuples from, but there might be some residual data that is not sufficient to produce an output tuple. This residual data, if any, is stored in the checkpoint. On reset, the Parse operator clears any input data it has, reads the residual data from the checkpoint, and adds that as the start of its read buffer. Logic state variables (if present) are also automatically checkpointed and reset.
Checkpointing behavior in an autonomous region
When the Parse operator is in an autonomous region and configured with config checkpoint : periodic(T) clause, a background thread in SPL Runtime checkpoints the operator every T seconds, and such periodic checkpointing activity is asynchronous to tuple processing. Upon restart, the operator restores its internal state to its initial state, and restores logic state variables (if present) from the last checkpoint.
When the Parse operator is in an autonomous region and configured with config checkpoint : operatorDriven clause, no checkpoint is taken at runtime. Upon restart, the operator restores to its initial state.
Such checkpointing behavior is subject to change in the future.
Exceptions
If there are errors while extracting tuples from the input data, the Parse operator generates a tracing message or throws an exception. You can use the parsing parameter to control this behavior.
Summary
- Ports
- This operator has 1 input port and 1 output port.
- Windowing
- This operator does not accept any windowing configurations.
- Parameters
- This operator supports 11 parameters.
Optional: blockSize, defaultTuple, eolMarker, format, hasDelayField, hasHeaderLine, ignoreExtraCSVValues, parseInput, parsing, readPunctuations, separator
- Metrics
- This operator reports 1 metric.
Properties
- Implementation
- C++
- Threading
- Always - Operator always provides a single threaded execution context.
- Ports (0)
The Parse operator is configurable with a single input port, which ingests tuples that contain data to be parsed into tuples.
- Properties
-
- Optional: false
- ControlPort: false
- TupleMutationAllowed: true
- WindowingMode: NonWindowed
- WindowPunctuationInputMode: Oblivious
- Assignments
- This operator requires that assignments made to output attributes cannot reference input stream attributes.
- Output Functions
-
- OutputFunctions
-
- int64 TupleNumber()
-
Tuple number generated in this file
- <any T> T AsIs(T)
-
Return the input value
- Ports (0)
-
The Parse operator is configurable with a single output port, which produces tuples that are parsed from the input data.
If the format parameter value is bin and the the readPunctuations parameter value is true, then a window punctuation and final punctuation is generated based on the input data in the blob. Otherwise, a window punctuation and a final punctuation are generated when a final punctuation is received.
The output stream from the Parse operator must meet all the requirements of the first output stream of the FileSource operator, with respect to the format parameter. For example, if the format is block, then the output stream must have exactly one attribute of type blob that is not set in an output clause.
- Properties
-
- Optional: false
- TupleMutationAllowed: true
- WindowPunctuationOutputMode: Generating
Optional: blockSize, defaultTuple, eolMarker, format, hasDelayField, hasHeaderLine, ignoreExtraCSVValues, parseInput, parsing, readPunctuations, separator
- blockSize
Specifies the block size for the block format. For more information, see the blockSize parameter in the spl.adapter::FileSource operator.
- Properties
-
- Type: uint32
- Cardinality: 1
- Optional: true
- ExpressionMode: AttributeFree
- defaultTuple
Specifies the default tuple to use for missing fields. For more information, see the defaultTuple parameter in the spl.adapter::FileSource operator.
- Properties
-
- Cardinality: 1
- Optional: true
- ExpressionMode: AttributeFree
- eolMarker
Specifies the end of line marker. For more information, see the eolMarker parameter in the spl.adapter::FileSource operator.
- Properties
-
- Type: rstring
- Cardinality: 1
- Optional: true
- ExpressionMode: Constant
- format
Specifies the format of the data. For more information, see the format parameter in the spl.adapter::FileSource operator.
- Properties
-
- Type: DataFormat (csv, txt, bin, block, line)
- Cardinality: 1
- Optional: true
- ExpressionMode: CustomLiteral
- hasDelayField
Specifies whether the format contains inter-arrival delays as the first field. For more information, see the hasDelayField parameter in the spl.adapter::FileSource operator.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- ExpressionMode: Constant
- hasHeaderLine
Specifies to ignore the first line or lines of the file in CSV format. For more information, see the hasHeaderLine parameter in the spl.adapter::FileSource operator.
- Properties
-
- Cardinality: 1
- Optional: true
- ExpressionMode: AttributeFree
- ignoreExtraCSVValues
Specifies whether to skip any extra fields before end of line when reading in CSV format. For more information, see the ignoreExtraCSVValues parameter in the spl.adapter::FileSource operator.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- ExpressionMode: Constant
- parseInput
Specifies which input attribute is parsed.
If this parameter is not specified, the input stream must contain only one attribute of type blob.
Note: Because this parameter must be of type blob, the data is binary in the sense that it is written as a sequence of bytes to a blob data type. However, that binary data can represent different formats, such as txt, line, or bin.
- Properties
-
- Type: blob
- Cardinality: 1
- Optional: true
- ExpressionMode: Expression
- parsing
Specifies the parsing mode. For more information, see the parsing parameter in the spl.adapter::FileSource operator.
- Properties
-
- Type: ParseOption (strict, permissive, fast)
- Cardinality: 1
- Optional: true
- ExpressionMode: CustomLiteral
- readPunctuations
Specifies whether to read punctuations from bin format input. For more information, see the readPunctuations parameter in the spl.adapter::FileSource operator.
- Properties
-
- Type: boolean
- Cardinality: 1
- Optional: true
- ExpressionMode: Constant
- separator
Specifies the separator character for the csv format. For more information, see the separator parameter in the spl.adapter::FileSource operator.
- Properties
-
- Cardinality: 1
- Optional: true
- ExpressionMode: Constant
- Parse
-
stream<${streamType}> ${streamName} = Parse(${inputSchema}) { param format : "${format}"; }
- nInvalidTuples - Counter
-
The number of tuples that failed to read correctly in csv or txt format.
- spl-std-tk-lib
- Library Name: streams-stdtk-runtime
- Include Path: ../../../impl/include