Skip to Content

Custom R Components - Random Split (Random Sample)

Update: The functionality provided by this custom component is now available in the standard functionality with the "Partition" node. It should not be necessary any more to add this component.

When classifying records you need to split your historic dataset in two parts for training and testing. The Custom R Component described in this article explains how to split the dataset randomly to avoid any bias.

Background

The historic dataset already contains a classification and you want to find a model that describes this classification. For instance you might have launched a new product in a small test-region. It was promoted to your existing customers in this region and you know exactly which customer did or did not buy the new product. So your customer-base from the test region is your historic dataset together with the information whether the customer did or did not purchase the new product. The column with the information about a purchase (or non-purchase) is your classification. Now you want to find a model that describes the customers that did purchase. You can then run this model on your remaining customer-base in other regions to identify further prospects for the new product.

So there are three steps. First you train the model. Then you test the model and finally you can predict. Typically you will run through a number of training and testing iterations to find the best model before predicting.

But how do you test the model? This is where splitting the historic data comes in.

  • You take for instance two thirds of the historic data to train the model.
  • Then you run this model on the remaining third of your historic data. At this stage you pretend you do not know the classification (here: whether the customer did or did not purchase).
  • Once the model has returned the classification results of your model, you compare these results with the actual classification.
  • That comparison tells you how many records were correctly classified. Therefore you know how "good" your model is.
  • Then you may want to go back to reconfigure the model and you run a new training and testing round.
  • Compare the test results from each iteration.
  • Pick your favourite model and now use it to predict the classification on new data (here: which customers to target in the other regions).

When splitting the historic data you need to make sure that the datasets are as representative as possible. A very common approach is to split the data randomly. The random approach helps to get evenly distributed datasets. For instance you eliminate problems with sorting that may exists in the data. Imagine the data was sorted by customer revenue. Without the random element your training set might only include your larger customers. And the testing data would include only your smallest customers. The model might work well for the larger customers, but it might fail badly on the smaller ones. To avoid such a bias, random selection is important. And similarly you want to be sure that no record that was used to train the model will be used for the testing. That is another bias you will be able to avoid with this component.

Usage

Load the historic data into SAP Predictive Analysis and add the Random Split component. Further below you find the details for adding this logic to your own installation. Configure the component by setting the percentage of how many records should be added to the first dataset. Then declare the label that will identify whether a record belongs to the first or second dataset.

Run the model and the component will assign each row randomly to one of the two subsets of data. It shows this membership in a new column called 'Split Label'.

Now add two Filter components to separate the data flow.

Configure the filters so that they restrict the data down to the two subsets. You only need to set the Row Filter for the 'Split Label' column.

Now that the data is broken into training and testing set, you can apply your algorithms. This example uses the R-CNR Tree.

And finally you can now load the dataset you want to classify. Just apply the trained model and you have your classification.

How to Implement

The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option "Import/Model Component", which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.

Disclaimer

Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.

Tags: