We are excited to announce the availability of Time Series Libraries in Watson Studio Spark Environments starting today (October 8, 2020).
This library, developed by IBM Research, includes a full set of time series functionality that is not available in any other competing offerings. It joins our IBM Research Assets, Geospatial functionality, Data Skipping, and Parquet Encryption libraries as fully supported features by Watson Studio Spark Environments.
The time series library allows users to perform various key operations on time series data, including construction of a collection of time series, imputation functions (like segmentation), transformers, reducers, joins, and machine learning functions (such as forecasting, clustering, and discriminatory sequence mining). The library supports various time series types, including numeric, categorical, and arrays.
Examples of time series data include the following:
- Stock share prices and trading volumes
- Clickstream data
- Electrocardiogram (ECG) data
- Temperature or seismographic data
- Network performance measurements
- Network logs
- Electricity usage as recorded by a smart meter and reported via an Internet of Things data feed
Key features of the Time Series Libraries in Watson Studio Spark Environments
I. Data model
- A core data model for univariate and multivariate time series
- Time Reference Systems for handling different timestamp representations
- Support for aperiodic, duplicate, and time of order timestamps
- Spark RDD and dataframe extensions for timeseries
- Numeric and categorical timeseries
- Lossless and lossy compression
II. Transformation and segmentation functions
- Math: Mean, variance, skew, correlations, PAA, SAX, covariance matrix, Graphical Gaussian Model, etc.
- Statistical tests: Augmented Dickey-Fuller, Ljung-box, Granger causality
- Distance metrics: Dynamic Time Warping, Damerau Levenshtein, Longest Common Subsequence, Jaro-winkler,
- Timeseries reconciliation: Hungarian algorithm, Earth mover distance
- Change point detection: CU-SUM, Bayesian, Gaussian
- Segmentation: Window, Record-based, Burst-based, Anchor, Regression
III. Forecasting functions
- ARIMA
- Holt-Winters
- BATS
- Vector auto-regression
- Anomaly detection
IV. Joins
- A complete suite of temporal joins, including inner, outer, left-outer, right-outer, left-inner, and right-inner supported
V. SQL extensions
- A rich set of timeseries SQL extensions using Spark SQL
VI. Spark machine learning
- Sequence mining
- Timeseries clustering: K-means, K-shape, Motif-based, Cluster drift detection
- Data connectors for feature engineering that provide Spark data frame iterators to TensorFlow and Sci-kit learn.
For full list of functions and how to get started, please refer to the documentation.
Learn more about data lakes in the IBM Cloud
- What is a Data Lake?
- Big Data Explained
- For Jupyter notebook users, we also have an in-depth tutorial of using this functionality for data science.
If you would like to know more about time series use case on IBM Cloud, please reach out to Kiran Guduguntla or Josh Rosenkranz.