Mega-Hubs cleansing in Social Network Analysis
One issue we always face in Social Networks Analysis is mega-hubs. These nodes are extremely connected compared to the average connectivity of other nodes.
SAP InfiniteInsight Social provide accurate features to automatically manage mega-hubs. Anyway, it is always a good practice to address manually mega-hubs analysis and cleansing at the beginning of a project to have a good understanding of the situation and also efficiently set up SAP InfiniteInsight Social.
Impact of the mega-hubs
The impact of Mega-Hubs on SNA analysis is two-fold. First, it increases dramatically the runtime, above all for the Community detection phase. Secondly, they disturb the Community detection process by masking all other possible interactions and make communities less relevant.
In addition, these top nodes are meaningless for the standpoint of the social analysis. They often represent companies who receive of make many calls (eg. Call Centers, Taxis…) or machines that send automatically SMS (booking confirmation, Banking Account Alerts…), not real people building relationships and having influence. Because of their high degree, such nodes are likely to create ‘artificial cluster’ in the network, making it harder to detect communities, and hiding the real structure of the graph.
Given the bad impact of Mega-Hubs, they have to be dealt with as outliers are for predictive modeling. In this case, the better choice is to remove them from the links data set.
Deleting mega-hubs proved its efficiency to dramatically reduce runtime, while excluding only a little number of nodes. As a result, the detected communities proved also to be more relevant, with the elimination of abnormally big communities that are organized around one single node.
How to detect mega-hubs?
Mega-hubs will be detected as top nodes, that is nodes with higher degree.
Note that mega-hub is a social networks definition, whereas top node is a statistical definition. The distinction may sound artificial or subtle, but customers may get confused when we use both terms alternatively and often ask whether there is a difference
As a measure for degree, one may choose the absolute degree (number of distinct contacts) or the weighted version (by number of calls, by duration...). It is also possible to differenciate by direction or call type.
Once the degree is calculated for each node, the best practice is to exclude all nodes exceeding the threshold of "mu + 4 sigma" , mu standing for the average of degrees and sigma standing for the standard deviation of the degrees.
Detailed statistical analysis on projects proved nevertheless that this 4sigma rule may badly detect top nodes, being too harsh or too permissive.
This may be caused by extreme values of the sigma:
- A low value of the standard deviation, compared to the mean, may result in too harsh a filtering
- A high value of the standard deviation will result in almost no filtering. This may happen when considering the degree for SMS OUT because the population often separated into heavy SMS senders and almost-no SMS senders.
This was the case for a project with a Telecom company in the Middle East. Standard deviations were very comparable to the Means, so that mu+4sigma resulted in values that were high but still realistic that could correspond to the behavior of a real user. And for the SMS OUT degree the opposite occurred: the value of the Standard deviation was that high that almost no record was filtered out.
As a consequence, we recommend testing several values of sigma, and evaluating the number of mobile numbers and calls that would be filtered out, before deciding of a threshold.
Examples of results
On the project with a Telecom company in the Middle East, only 16 877 nodes were excluded (0.03% of total nodes), but this resulted in filtering out 16 750 000 links, that is 4.2% of total links and a ratio of nearly 1 thousand links per node excluded!
For a project with another Telecom company, 568 000 nodes were identified as mega-hubs out of 102M nodes (0.56%). As a result, 25% of links were filtered out! (330M out of 1 300M links, a ratio of 580 links/node). Another major finding is that the resulting CDR table contain far less distinct nodes than expected: 83M, that is a reduction of 19%. This is because most of these subscribers only received SMS sent by identified mega-hubs.
You will find attached to this page an example of SQL code used in a project to calculate the degrees and analyze the sigmas.