IBM Cloud Pak® for Data Version 4.8 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.
Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.8 reaches end of support. For more information, see Upgrading from IBM Cloud Pak for Data Version 4.8 to IBM Software Hub Version 5.1.
Create the data handler
Each party in a Federated Learning experiment must get a data handler to process their data. You or a data scientist must create the data handler. A data handler is a Python class that loads and transforms data so that all data for the experiment
is in a consistent format.
About the data handler class
The data handler performs the following functions:
- Accesses the data that is required to train the model. For example, reads data from a CSV file into a Pandas data frame.
- Pre-processes the data so data is in a consistent format across all parties. Some example cases are as follows:
- The Date column might be stored as a time epoch or timestamp.
- The Country column might be encoded or abbreviated.
- The data handler ensures that the data formatting is in agreement.
- Optional: feature engineer as needed.
The following illustration shows how a data handler is used to process data and make it consumable by the experiment:
Data handler template
A general data handler template is as follows:
# your import statements
from ibmfl.data.data_handler import DataHandler
class MyDataHandler(DataHandler):
"""
Data handler for your dataset.
"""
def __init__(self, data_config=None):
super().__init__()
self.file_name = None
if data_config is not None:
# This can be any string field.
# For example, if your data set is in `csv` format,
# <your_data_file_type> can be "CSV", ".csv", "csv", "csv_file" and more.
if '<your_data_file_type>' in data_config:
self.file_name = data_config['<your_data_file_type>']
# extract other additional parameters from `info` if any.
# load and preprocess the training and testing data
self.load_and_preprocess_data()
"""
# Example:
# (self.x_train, self.y_train), (self.x_test, self.y_test) = self.load_dataset()
"""
def load_and_preprocess_data(self):
"""
Loads and pre-processeses local datasets,
and updates self.x_train, self.y_train, self.x_test, self.y_test.
# Example:
# return (self.x_train, self.y_train), (self.x_test, self.y_test)
"""
pass
def get_data(self):
"""
Gets the prepared training and testing data.
:return: ((x_train, y_train), (x_test, y_test)) # most build-in training modules expect data is returned in this format
:rtype: `tuple`
This function should be as brief as possible. Any pre-processing operations should be performed in a separate function and not inside get_data(), especially computationally expensive ones.
# Example:
# X, y = load_somedata()
# x_train, x_test, y_train, y_test = \
# train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
# return (x_train, y_train), (x_test, y_test)
"""
pass
def preprocess(self, X, y):
pass
Parameters
your_data_file_type: This can be any string field. For example, if your data set is incsvformat,your_data_file_typecan be "CSV", ".csv", "csv", "csv_file" and more.
Return a data generator defined by Keras or Tensorflow
The following is a code example that needs to be included as part of the get_data function to return a data generator defined by Keras or Tensorflow:
train_gen = ImageDataGenerator(rotation_range=8,
width_sht_range=0.08,
shear_range=0.3,
height_shift_range=0.08,
zoom_range=0.08)
train_datagenerator = train_gen.flow(
x_train, y_train, batch_size=64)
return train_datagenerator
Data handler examples
Parent topic: Creating a Federated Learning experiment