cancel
Showing results for 
Search instead for 
Did you mean: 

GMM prediction function

former_member186543
Active Contributor
0 Kudos

Hi team,

I am trying to model a GMM model to find probability distribution of input features and finally mark them as anomalous or not.

I have run the GMM PAL and got back in return the probability and the model table as per my training set. I want to understand, now which is the best PAL function that we can use to predict probability value on each new value of parameter on GMM predicted model.

SAP documentation has no information about it.

Ex: K-means PAL function is used to train a kmeans model and CREATEDT , PREDICTWITHDT functions are used to predict kmeans classification for new values but I am not sure if we can choose the same for GMM as well, since we need probability value here and not only the cluster ?

My understanding is that may be we can use PREDICTWITHDT directly on the GMM received model values to predict, however we need the probabilities for each new input for each cluster in Gaussian space, similar to the outputs we receive in GMM. Please advise !


Update: I used the JSON model from GMM and passed it to PREDICTWITHDT function and getting the below error message now:


Could not execute 'CALL "HRAFIQ".PAL_DT_SCORING_PROC(PAL_DT_SCORING_DATA_TBL, #PAL_CONTROL_TBL, ZPREDICTED_MODEL, ...' in 651 ms 411 µs .

SAP DBTech JDBC: [423]: AFL error: search table error: _SYS_AFL.AFLPAL:PREDICTWITHDT: [423] (range 3) AFL error exception: exception 73001060: PAL error[73001060]:Internal error. Check trace for details.

Thanks,

Hasan

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

Hi Hasan,

GMM is a clustering algorithm and is usually seen as un-supervised learning algorithm. Decision tree is supervised learning algorithm and PREDICTWITHDT is used for model trained from C4.5, CHAID, and CART. We cannot pass a GMM cluster result to a decision tree prediction function. Even for different model trained by different supervised algorithms, we should use corresponding scoring functions.

As to your question, I understand you want to apply new data points to get an estimate of the probability belonging to each cluster. As this is unsupervised learning, usually there is no such a cluster assignment function. The reason is that there is no guarantee that the new data come from the same distribution from the original data. This is specially true for outlier detection. If there is a new type of outlier, the assignment will mark the new outlier into the existing clusters, which might be a mis-clustering.

In PAL, there is a cluster assignment function which assign new data points to existing clusters generated by cluster algorithm under the assumption that the user is aware that the new data come from the same distribution. Unfortunately, GMM is not yet supported. In your case, if the data are not huge, you can re-run GMM with the new data and get the new clusters and probabilities.

Best regards,

Xingtian

former_member186543
Active Contributor
0 Kudos

Hi Xingtian,

Thanks a lot for your response !

I even tried to re-run the GMM with existing training set + 1 row of new data, however we were slightly skeptical that mean and variance values will change with new row of data and this happened as well.

Due to this reason, the probability values were in-correct compared to what we were expecting( simply gave probability to outlier as 1, which is incorrect ).


INSERT INTO PAL_GMM_DATA_TBL VALUES(0,0.10,0.10,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(1,0.11,0.10,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(2,0.10,0.11,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(3,0.11,0.11,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(4,0.12,0.11,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(5,0.11,0.12,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(6,0.12,0.12,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(7,0.12,0.13,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(8,0.10,0.10,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(9,0.11,0.10,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(10,0.10,0.11,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(11,0.11,0.11,'A');

INSERT INTO PAL_GMM_DATA_TBL VALUES(99,20.01,20.01,'A');

Point 99 is new data and also an outlier but these are the results in retrain => probability = 1:


Also, the point that the GMM might mark a new outlier to an existing cluster, doesn't seem to be a point of worry in this case.

We are not interested or worried as to which cluster the point belongs( we ignore GMM's output about cluster assignment) but just to understand what is the probability value that the new point falls in the known trained data region.

Training strategy:

We have trained our GMM model only on positive non-anomalous examples as we would have done in a normal distribution model or Gaussian mixture model as per concepts of ML either in Python or Octave.


Prediction for new examples:

Ex: If our positive trained set has 3 clusters then we will simply predict for new example, pickup the highest probability among three clusters for the new point and compare if it is less than a threshold value "X" to mark it as an anomalous value.

We believe this is a very common case for Anomaly detection based on probability, does SAP PA currently have any algorithm to help us with this scenario ? Or how can we use GMM in this regards ?

We can't use PAL K-means or PAL Anomaly detection, as you are already aware as per the other thread we were discussing.

As the last fallback we might have to use Logistic regression/ SVM but we don't want to use them as for that we already should have good amount of positive and negative examples which is not possible.

Thanks,

Hasan

Message was edited by: Hasan Rafiq

Answers (1)

Answers (1)

achab
Product and Topic Expert
Product and Topic Expert
0 Kudos

Hi Hasan, I am looping in

Best regards

Antoine

former_member186543
Active Contributor
0 Kudos

Hi Antoine,

Thanks a lot for looping Xingtian on this thread.

We are still stuck on this issue and cant proceed further, can you guide us to any documentation or more experts in loop as Xingtian could be busy with priority tasks.

Thanks,

Hasan

achab
Product and Topic Expert
Product and Topic Expert
0 Kudos

Hi Hasan, I am sure you went through the PAL documentation already? http://help.sap.com/hana/SAP_HANA_Predictive_Analysis_Library_PAL_en.pdf

Have you searched on Google/SCN if people faced a similar issue in the past?

The error message points to a specific trace - have you checked for the details there?

Check trace for details.

Please update on these questions, I'll see what I can do next - as you probably realized by now I am not a PAL specialist 😉

Thanks & regards,


Antoine





former_member186543
Active Contributor
0 Kudos

Hi Antoine,

You are definitely a help to us

I tried to Google/SCN but it seems that people have not used GMM till date in custom scenarios. As per my previous thread and discussion, you must be knowing that we are trying to model a scenario of Anomaly detection but we are stuck as to how to predict probability on new values. May be my friends: , may be able to point out the error.

Structure of my model table( ZPREDICTED_MODEL ), predicted from GMM run and passed to CREATEWITHDT:

Prediction code:


DROP TYPE PAL_DT_SCORING_DATA_T;

CREATE TYPE PAL_DT_SCORING_DATA_T AS TABLE(

ID INTEGER,

ATTRIBUTE1 DOUBLE,

ATTRIBUTE2 DOUBLE,

ATTRIBUTE3 varchar(100)

);

DROP TABLE ZPREDICTED_MODEL;

CREATE COLUMN TABLE ZPREDICTED_MODEL(

ID integer not null primary key generated by default as IDENTITY,

MODEL varchar(5000)

);

DROP TYPE PAL_DT_SCORING_TREEMODEL_T;

CREATE TYPE PAL_DT_SCORING_TREEMODEL_T AS TABLE(

  "ID" INTEGER,

  "MODEL" VARCHAR(5000)

);

DROP TYPE PAL_CONTROL_T;

CREATE TYPE PAL_CONTROL_T AS TABLE(

  "NAME" VARCHAR(100),

  "INTARGS" INTEGER,

  "DOUBLEARGS" DOUBLE,

  "STRINGARGS" VARCHAR(100)

);

DROP TYPE PAL_DT_SCORING_RESULT_T;

CREATE TYPE PAL_DT_SCORING_RESULT_T AS TABLE("ID" INTEGER, "SCORING" VARCHAR(50), "PROB" DOUBLE);

DROP TABLE PAL_DT_SCORING_PDATA_TBL;

CREATE COLUMN TABLE PAL_DT_SCORING_PDATA_TBL (

  "POSITION" INT,

  "SCHEMA_NAME" NVARCHAR(256),

  "TYPE_NAME" NVARCHAR(256),

  "PARAMETER_TYPE" VARCHAR(7)

);

INSERT INTO PAL_DT_SCORING_PDATA_TBL VALUES (1, 'HRAFIQ', 'PAL_DT_SCORING_DATA_T', 'IN');

INSERT INTO PAL_DT_SCORING_PDATA_TBL VALUES (2, 'HRAFIQ', 'PAL_CONTROL_T', 'IN');

INSERT INTO PAL_DT_SCORING_PDATA_TBL VALUES (3, 'HRAFIQ', 'PAL_DT_SCORING_TREEMODEL_T', 'IN');

INSERT INTO PAL_DT_SCORING_PDATA_TBL VALUES (4, 'HRAFIQ', 'PAL_DT_SCORING_RESULT_T', 'OUT');

CALL "SYS".AFLLANG_WRAPPER_PROCEDURE_DROP('HRAFIQ', 'PAL_DT_SCORING_PROC');

CALL "SYS".AFLLANG_WRAPPER_PROCEDURE_CREATE('AFLPAL', 'PREDICTWITHDT', 'HRAFIQ', 'PAL_DT_SCORING_PROC', PAL_DT_SCORING_PDATA_TBL);

DROP TABLE  #PAL_CONTROL_TBL;

CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_CONTROL_TBL (

  "NAME" VARCHAR(100),

  "INTARGS" INTEGER,

  "DOUBLEARGS" DOUBLE,

  "STRINGARGS" VARCHAR(100)

);

INSERT INTO #PAL_CONTROL_TBL VALUES ('THREAD_NUMBER', 2, null, null);

INSERT INTO #PAL_CONTROL_TBL VALUES ('IS_OUTPUT_PROBABILITY', 1, null, null);

INSERT INTO #PAL_CONTROL_TBL VALUES ('MODEL_FORMAT', 0, null, null);

DROP TABLE PAL_DT_SCORING_DATA_TBL;

CREATE COLUMN TABLE PAL_DT_SCORING_DATA_TBL LIKE PAL_DT_SCORING_DATA_T;

INSERT INTO PAL_DT_SCORING_DATA_TBL VALUES (1,0.10,0.10,'A');

INSERT INTO ZPREDICTED_MODEL(model) select models from PAL_GMM_RESULTSMODEL_TBL;

DROP TABLE PAL_DT_SCORING_RESULT_TBL;

CREATE COLUMN TABLE PAL_DT_SCORING_RESULT_TBL LIKE PAL_DT_SCORING_RESULT_T;

CALL "HRAFIQ".PAL_DT_SCORING_PROC(PAL_DT_SCORING_DATA_TBL, #PAL_CONTROL_TBL, ZPREDICTED_MODEL, PAL_DT_SCORING_RESULT_TBL) with OVERVIEW;

SELECT * FROM PAL_DT_SCORING_RESULT_TBL;

I wish the help documentation by SAP had a separate section under each function's explanation as to what functions should be used to predict values on trained models.

Thanks,

Hasan

Former Member
0 Kudos

If there is a prediction function, it will follow after the training function in the PAL manual.

Best regards,

Xingtian