A synthetic data generator for online social network graphs

9/25/2023

The problem is that the information that is public may be used to disclose information that the users expect to remain confidential. But each person has personal preferences about what it considers public and what it considers private. In online social networks, users volunteer perhaps what is considered more personal information to their selected circles.

a synthetic data generator for online social network graphs

Using the web for communication, purchases, searching information and/or socializing generates data, about ourselves, our connections and our activities, which is collected easily. We give examples and use cases and confirm the effectiveness of the approach, using a sophisticated synthetic OSN-data to define risk attribute and user combinations which coincide with the risk ranking produced by our algorithm. We find that all of these risks are quantifiable and a risk ranking of attributes and friends/potential friends can be individualized for each user. In this paper we propose a novel rule induction approach (that incorporates predictive metrics) which enable a user to evaluate the potential risk incurred by unmasked attributes, friends’ attributes and also the risk of befriending new users. In contrast to the conventional data mining approach, in which a model is built for all users, we build a rule set which is individualized for each user. Furthermore, even if a user masks all its data, it may still be inferable from the unmasked data of its friends, due to affinities in likes and personal attributes. However, some of this masked data may be inferable from other data the user has not masked. Internet users in general and on-line social networks users in particular are becoming more savvy about masking data they consider private. Our results show that as predictability improves, in some cases different SAN metrics emerge as predictors. For the second issue we explore the impact of re-identification on the predictability by two different machine learning algorithms: C4.5 (decision tree induction) and SVM-SMO (Support Vector Machine with SMO kernel).

For the first issue, we show that a set-difference approach is remarkably effective. Second, what gains in predictability are obtained after re-identification. First, how to identify that two or more user-accounts belong to the same user.

This formalization enables us to reason on two issues. In this paper, we explore this situation first by a formalization using the SAN model to conceptually structure information as a graph, which includes user and attribute type nodes. We highlight that all this should occur within the bounds of the data protection and privacy laws as well as the users' expectations on such matters to avoid backlash. The re-identification of users serves several purposes: (i) commercial use so as to avoid redundant mailing to the same user (ii) enhancement of the information available about these users by unifying information from different sources (iii) consolidation of accounts by on-line social network providers (iv) identification of potentially malicious users and/or bots. We consider the re-identification of users of on-line social networks when they participate in several different on-line social networks, potentially using several different accounts.

0 Comments

A synthetic data generator for online social network graphs

Leave a Reply.

Author

Archives

Categories