A Systematic Review on the Suspicious Profiles Detection on Online Social Media Data

Escalating crimes on digital facet alarms the law enforcement bodies to keep a gaze on online activities which involve massive amount of data. This will raise a need to detect suspicious activities on online available social media data by optimizing investigations using data mining tools. This paper intends to throw some light on the data mining techniques which are designed and developed for closely examining social media data for suspicious activities and profiles in different domains. Additionally, this study will categorize the techniques under various groups highlighting their important features, challenges and application realm. Article history Received: 14 July 2017 Accepted:08 August 2017

publish illegal and suspicious contents (images, videos, texts ...) in order to exchange data online and share ideas that could affect the security of countries or institutions.Advent of these interacting networks, led to the increase of countless crimes, as they offer ease to criminal conversations and transfer of data (suspicious messages).As an example, the media sharing websites for YouTube allow to publish videos in relation to "how to create a bomb".The social network Facebook and the micro blog Twitter also help criminals to coordinate and manage online suspect actions.As per a recent record, there is found a crime percentage 16 of seven big Indian cities as shown in figure 1.
A great survey (Gladwell and Shirky 2011) shows that the popular social networking websites Facebook and Twitter have over 1 billion and over 240 million active users respectively 1 .People share several news related articles and also comments on the posts of other people, thus making news experience more participatory than before.In 2010, according to the report by CNN, 75% of the news was forwarded to other people through email and 37% of the news items were shared on Facebook and Twitter.The network of the users of the social media is becoming the fastest and effective way to for the news dispersion, to comment, review, and communicate etc. throughout the whole world.The websites have their freely available API's like Twitter has is API named as Twitter API, thus allowing the availability of the datasets.
Thus for detecting suspicious discussions on the social media dataset, numerous data mining methods have been adopted till date.Through this, suspicious activities can be uncovered by analyzing the interests of all users.The main hurdle faced by researchers in doing so, is the lack of information retrieval and data analysis tools for real time data of social media websites.The resultant database is quite huge and thus to extract desired knowledge from the large search space of social data, an intelligent data mining algorithm is required.The basic process of suspicious activity identification is shown in figure 2.

Challenges in Social Media Data
Presently, the individuals are fortunate enough to share and articulate their opinions and viewpoints concerning various aspects of life on a single message board, Social website data.This data actually signify massive virtual space, where anyone can hold discussion in the form of posted messages.User preferences are generally captured by analyzing their attitudes and behaviors mentioned on the websites as comments.To measure a user's loyalty and to keep track of their sentiments towards any topic is achieved by monitoring the suspicious activities and discussions done through their posts on the social media websites.The main hurdle faced by researchers in doing so, is the lack of information retrieval and data analysis tools for real time data.The resultant database is quite huge and thus to extract desired knowledge from the large search space of online social media data, an intelligent data mining algorithm is required.Moreover, the involvement of massive numbers of parameters in the search space makes the extensive search infeasible.Consequently, proficient search approaches are of imperative significance.Thus, the study and analyses of data from online social websites' textual data consists of numerous issues and challenges.The data available is not in ready-to-use format.Some of the major problems are discussed below:

Grammar and Spellings
Most of the users make a lot of semantic or even spelling mistakes when they post something on the web.Using datasets, these mistakes are processed during the pre-processing phase of any application.

Trustworthiness
The number of user's views on different subjects signifies the importance of the data on the web.Unfortunately, numerous fake accounts are made to give these posts a fake view and also fake reviews are also given to either push or to pull a post or an entity on the web platform.

Format
Every other online website exhibits their own way or format for data posting and also the different users have different style of posting.For example: hashtag (#) is used when subjects are to be tagged or @ is used to refer different users.Hence, each website needs to be studied separately.

language
The option to post views or data in different languages is also available in online websites.Also the translator option is available to understand the other language's post.

Categories of Techniques Employed
The online suspicious detection activities can be categorized under different heads depending on the way they handle the data.Researchers have tried to group the developed techniques for monitoring suspicious discussions based on different criteria's.However, the one proposed by Murugesan, Devi et al. 2016 2 , received the acceptance.These categories are presented below with their specifications and related work.

Brute Force Algorithms
In brute force strategy type, the relations between inflected and root forms are contained by the stemmer's lookup table.A word is stemmed by querying the table when inflection matching to stem is found.When matching inflection is found, the root form associated with it is returned.

Matching Algorithms
Matching algorithms use stem database (Example: a document set containing stem words).The algorithm searches the stem database for a match of the word that needs to be stemmed.Different constraints like the relative length of the stem.For example the stem "inter" of the words "international" or "interpersonal" should not be considered as stem of the word "interest".

Emotional Algorithms
Emotional algorithm is used to detect the emotions of the human beings via video, audio, text and so on.In online websites users post their comments or share their thoughts mainly in a text format.Following methods are used to detect emotions in the text viz.keyword Spotting Technique, Learning-Based Methods, Hybrid Methods.

keyword Spotting Technique
The keyword spotting technique or pattern matching involves the process of finding the keywords occurrences from a already give substring set.In the past, the problem has been studied and various algorithms have been suggested for its evaluation.
In terms of emotion detection, this technique involves pre-defined keywords.These words are classified as happy, fear, disgust, disgusted, exclaimed, dull etc.The input to this technique will be the text and then the tokenization of the text will be performed.Words in the textual data which are related to the emotion keywords will be identified and analysed.Sentence is checked for the presence for negation and identified emotion class is delivered as output.

learning-based Methods
Learning-based methods have different way for problem evaluation.Initially, the methods involve problem to identify the emotions from the given input data but now days it considers the problems of input text classification into different emotions.These methods are different from keyword-based methods.Learning-based methods use previously trained classifier for emotion identification.It makes use of different machine learning classifiers like Naïve Bayes, Support Vector Machine etc to identify the emotion present in the textual data.

hybrid Methods
The results acquired by only using keywordbased technique and learning-based technique are not satisfactory.So, some systems make use of the hybrid approach.This approach combines the features of the both of the techniques thus improving accuracy and output results.The hybrid based system proposed by Wu, Chuang and Lin is one of the most significant systems.The system utilizes a rule-based using hybrid approach and extracts the semantics related to specific emotions.The Chinese lexicon ontology is extracted to get the attributes.These attributes and semantics contain emotions.Hence, emotion keywords are replaced by rules and served as features for training the classifier.Unfortunately, the emotion categories find out by this approach are limited.

Soft computing techniques
These methods have been applied to text document clustering as an optimization problem.A soft clustering algorithm such as fuzzy c-means has been applied in for high-dimensional text document clustering.It allows the data object to belong to two or more clusters with different membership.Further, it was combined with harmony search to improve the efficiency of document clustering.An innovative field of nature inspired methods were also employed like PSO (karol and Mangat 2013) 3 , GA (Song and Park 2006) 4 , ACO (Azaryuon and Fakhar 2013) 5 and bees algorithm (AbdelHamid, Halim et al. 2013) 6 .To develop more efficient hybrid approaches they were further combined and successfully applied to text mining (Premalatha and Natarajan 2010 7 .

Existing Work
In order to perform experiments, the scarcity of real data publically available and lack of properly researched methods and techniques publications are the two most often considered criticisms related to the research of suspicious detection based on data mining.

Study Gap
In recent years, much research has focused on understanding the expression of individuals' opinion online, and exploring its use as an alternative data collection modality for surveys and fundamental ways to gather the information and predict opinion, identify knowledge, support, and related tasks.However, finding accurate information in social big data is becoming a challenge for public and private research organization.Moreover, the evolution of internet has led to the growth of more innumerable cybercrimes.Criminals uses social networking websites, cell phones, messenger applications to send suspicious messages, thus making dynamically tracing their activities more difficult.Table II

Fig. 1 :
Fig. 1: Crime Percentage of seven big indian Cities

Fig. 3 :
Fig. 3: Text Mining Process Recently the Facebook static messages are scanned to identify criminal's behavior.Also, in 2015, authors (Siguenza-Guzman, Saquicela et al. 2015) 14 presents a literature review of data mining applications in academic libraries.In this they have identified various techniques to monitor special category of words required for a specific journal or library.In 2016, authors Diaz, F. (Diaz, Gamon et al. , 2016) 15 discusses online and social media data as an imperfect continuous panel survey.One more research article (Tayal, Jain et al., 2015) 16 identifies various data mining techniques for crime detection and criminal identification specifically in India.
Consequently, proficient search approaches are of imperative significance.There are numerous papers published till now in this domain.However, so far no review paper is available in this domain which consolidate the current researches.

Table 1 : Various applications developed till date using the online or social media data
gives the particulars of work done by various authors during the span of time and the corresponding study gap.