Research and Investigation of Spam User from Twitter Dataset
A Research and Development Project
YASHVANT KUMAR PATEL (N9790608)
PROJECT COORDINATOR: Dr. VENKAT VENKATACHALAM
SUPERVISOR NAME: DARSHIKA KOGGALAHEWA
In today’s era, as we all know that, Social Media become more popular in this world because statics say that, most people (around 70%) using social media everyday also they are spending average 90% of their Internet time on social media. Facebook, Instagram, LinkedIn and Twitter are most popular Social Network site and apps in Digital era. So that theses OSNsn are adding new feature continuously as they are getting more response from user, so data privacy and data leaking is one of big issue now some days for this kind of websites.
Furthermore, People or groups of humans discussing many topics on social media like Twitter, not only that, they are also become one of promotion and advertisement platform for few industries like Entertainment Industry, e-Commerce website and so on. For example, On Twitter or Facebook, Entertainment Industries like Movie Business promote their upcoming product means movie or documentary on Social Media but due to competition of that many director and producer throwing few chunks of money to spam user for promote their product so usually actual user not getting proper review of product in short, they are creating fraud review and this type of fraud review creator known as spam user. Moreover, due to feature of twitter, Spammers spreading unwanted things and unnecessary message or advertisement, not only that, malicious user also target twitter user easily and they do phishing ; other attack like viruses and all. In addition to that, Spam user can be machine learning algorithm, robotics software etc. So, our main concern is about to identify spam user and their fake accounts.
Furthermore, many number of technics and analysis method has been proposed to identify Spam user and among them few also implemented but none of them are efficient one like few are not efficient with time means few are taking couple of year to detect few fake users from Data set but as we all know that, Movie industries product is movie which usually finish their business within a year.
In addition to that, we have done literature review form high ranked 17 paper and develop and investigation exiting and new method. Our main goal is to identify and develop technique for Twitter (micro blogging Platform where you send 140-character short message) Data set to detect spam user using pattern of spam user and legitimate user so that, we had research latest paper and its work to evaluate and understand all exiting technique and tools with its features for fraud user detection from social network sites especially twitter. These feature and dataset has been collected to create new efficient technology or workable model for spam or fraud user detection. The dataset has been tested using trustable famous classifiers. It also been tested with other dataset which is trusted or legitimate user data set. Also, we compared output of two type data set and identifying higher close accurate model for spammer filter from Twitter data set. In addition to that various technique has been used. Average follower count and followie count play prime role to detect spam user from data set in the social network sites. and we apply that strategies for large data set and test for million datasets then we compare Graph of both (legitimate and spam user). It indicates trend for fake and actual user.
Table of Contents:
1.1. Motivation …………………………………………7 1.2. Purpose ……………………………………………….7
1.3. Significance ……………………………………….7
1.4. Objectives ………………………………….………7
2. Project Methodology………………………………………………………….9
2.1 Stages of Literature Review ………………………..9
2.2 DSDM Approach …………………………………10
2.3 Sprint Table ……………………………………….12
3. Literature Review………………………………………………………………16
3.1 Online Social Network Twitter…………………….16
3.3 Traditional Approaches……………………………..19
3.4 Existing research.……………………………………20
3.5 Summary of Quality Work………………………….21
4. Development Plan……………………………………………………………….24 4.1Methodology………………………………..………..24
4.2 Collection of Data……………………………….…..24
4.3 Identifying properties of spam user …………………24
5. Output and Its Comparison………………………………………………..27
7. Future Development……………………………………………………………32
Appendix-I Detail summary of Each Literature Review..35
Appendix-II Example of tweet………………………………….70
Appendix-III Source code of System…………………………71
Appendix-IV Log sheet of Week Meetings…………………74
In today’s world, On the Internet, social medias are more popular than any other websites. Social media sites are such as Facebook, Instagram, LinkedIn and Twitter etc. Facebook have around 1.51 billion user bases on month from statics which released on 2016. Spam user usually use social media or Online social network to spread fake messages or review online every day. People or group of people whose intention is to send and post fake review, news, messages and etc. using many technics like machine learning algorithm, robotics, automation software etc. known as or refer as Spammer. They are usually or frequently using fraud account. Their main intention is that to do cyber-attack like phishing, fraud-driven-base download, advertisements or review of product. So main problem is that posting or sharing fraud activity might be end up with force to legitimate user to not spend time on social media and minimise social media interest of user by constantly posting this kind of activity. It might be leak of privacy of user and sometimes it may be lead to economic loss as well. Therefore, correctly identify spam user from online social network and make social websites more user friendly and increase use of such websites. Furthermore, security is main or prime concern for Online Social Network as nowadays, because social network is more popular todays era, so they are collecting more details from user like credit card detail’s and so on and some cases they link payment gateways too.
Furthermore, first and foremost challenge is that spammers change their strategies day by day. In other words, based on exiting method and technologies they are changing their strategies for clone like legal user constantly. They also use similar format and technique and also profile which look like similar to legal user. (Yang. 2013). In addition to that, they are also creating inter network of spam user for example, one spam user from Queensland make network with spam user from Victoria and other cities and states. In this kind of situation, it will become more complex to detect and identify as spam user because clustering algorithm also become inefficient.
In addition to that, our intention is that to detect spammers from Twitter which is microblogging website work based on followers and followie. People who follow user known as follower for user. And people to whom follow people known as followie. furthermore, Twitter enable one feature where people send one short message where you can type 140 character maximum and it will be visible for all your follower. By default, twitter profile are Public means that anybody can follow anyone unless they chaged default setting to protected. And if account is protected then you are required take permission from user to follow them.
For twitter user and its data set , exiting proposed system and method not working efficiently and effectively to identify spammers from twitter data set as there are billions of user twitter have including famous singer, Actor, Actress, Politian , Cricketers , Players and so on.Moreover , its contain billions of billions tweet which is nothing but short messages.In 2008,officially twitter declared that fake accounts followed by large number of user which create problem for whole system.At that time they use message based filtering approach to detect spam user from that but it has not been investigated properly and not even got accurate output too.Because, spammers are changing their URL and message pattern continuously and also inserting images in messages.
After that, many researcher upadtes many solution to overcome this problem but no one overcome this major issue with high accuracy so group of researcher trying to find more closer accurate solution day by day by doing and mapping relaction ship between all factor ,properties or feature of Twitter like URL, Follower count, Followie Count, Number of Tweet, length of tweet, relationship and network between follower and mutual follower or followie.Then also, spam user has not been detected by linking feature or distance algorithm.
From high rated and highly qualified researcher and their paper we notice that they are using KNN algorithm to identify distance from one user to another user. Here KNN stands for K nearest neighbour algorithm where K is constant value. Page rank algorithm is second choice of researcher but it has limitation that it need data of social network. In this algorithm as name suggest they use rank of pages and based on that they are mapping and detect spam or fake user. All in all, we need system that work well with chunk of dataset and it should be in real time.
Above diagram represent various theoretical method to detect spam user from social network based on literature review we conducted. Among them we are using one method and its strategy to refine our logic.
Therefore, we develop one method which identify distance between follower and flowy and based on that its identify spam user. Then investigation of spam user and legal user pattern.
Above figure represent Supervised and unsupervised learning with training data set.
Report contain, highly rank research paper and their outcome in glance and their method for social network site.
Firstly, we collect research paper then literature review for that make report using all paper and review all method and their context of problem also their outcome for individually.
Secondly, collect and identify appropriate dataset and social network also apply method or strategies to detect spam user from that dataset. Furthermore, we apply same strategy for legitimate data set as well.
• Primary aim of project is that we are identify exiting problem of social network and trying to find out closet solution.
• Conduct literature review for social network mining
• How to utilize finding a LR of theoretical based model
• Preliminary basic model evaluation
• Expansion of basic or theoretical model
• Develop method which can identify spam user in real time but with low cost and time.
First of all, as SNS (social network site )are more popular so that it also come up with spamming problem and this all just happen in last few years so many researcher start to research about spam and its activity or its active role so that they pick one part of whole objective as this is quite ongoing big issue but it was not started long back .So that, this paper summarise and actually this quality paper writer spend enough time to identify and collect latest research paper in aspect of whole scenario and did research of that paper and their methodology to understand what’s actually problem going on than summarise all means every high quality research paper in details and summarise them .Furthermore, he did analysis and verify techniques which they have proposed and come up with new Method which is quite efficient and stable so future development can be done top of that and we can get higher accurate results in near future practically.
Thus, this way, This Literature review help many scholars to understand flow of spam user activity and methodology which has been proposed and developed recently from last few years in one excellent literature review. Also, generic method explains in detail to identify distance from each factor in Twitter data set which can apply for other data set as well.
Objective of project listed below.
a) To do research for all factor of social networking for very high accuracy like near to 100%;
b) To talk about more about results and outcome of method;
c) To divide or categorise similar methodology in each group;
d) To do provide future study and scope of this kind of study and particular field.
This paper main goal is to make proficient literature review summary analysis of that and identify one methodology which can prove efficiency theoretically and practically so we can consider model as theoretically proven model. Also, it will be beneficiary for another scholar or researcher.
2. Project Methodology
2.1 Stages of Literature Review
2.1. Social Engineering Attacks
Phase 1: Search Highly qualified and High Rank Research Papers
This stage need the identification and accumulation of important articles for the cause, the objectives and the commitment point. The write request using the QUT library was used as a starting point to access intuitive databases. A variety of questions was asked for the database of the QUT library, where quality sheets were presented, which live in academic databases, for example, ProQuest, Emerald Knowledge and Wiley Online. Aligned with the size of the project, the material was examined in relation to the configuration of SaaS in view of the distribution dates from 1995 onwards. Speculations were speculated in the light of representative practice and client inspiration within hierarchical settings in light of production dates from 1985 onwards. Writing was not limited to land and political areas. For a case whose distribution has been interpreted from another language, the search for discoveries and claims presented in the selected document has been added and supported by other sources in English. This kept the risk of error in the data and discoveries supported by the objective improvement and the precise understanding of the writing.
Phase 2: Process Literature
In this phase, we were evaluating and analysing collected or downloaded literature paper. By doing this phase, we can confirm that whatever paper has been downloaded is very close relevant to objective means in our cases it should be related to social network mining and spam user detection. Furthermore, it should be also in manner that one can understand easily means it should be in proper format. Here also we conduct abstract ranking like which literature should be done first like first it should focus on generic way then more specific task and also for methodology. Main intention of this phase to extract good learning outcome for any researcher. Combination of theorical knowledge leads to very accurate and correct result based driven method.
Phase 3 : Sequence of Literature
In this stage, we do literature review in sequence and identify exiting problem of each paper followed by project methodology, analysis and outcome or result of that paper. Then we made one table containing glans review of all paper. Because of that, we can combine and compare each paper with other paper, by doing this we can analyse so many learning information like which algorithm or method more popular among latest all research paper also what kind of properties they are using to detect spam user and so on. In addition to that, it also contains all summary of all paper with outcome and conclusion.
Furthermore, Colum wise, first represent Index of research paper which is just number representing sequence, second represent name or title of literature followed by methodology of paper and analysis or conclusion of particular research paper.
Phase 4 : Summarise Literature Review
This phase is about gathering information and worth value of knowledge. Here we can identify difference between methodology of each article and various factor they have been used for different method. By doing this we will get similar and different functionality of each paper also we clearly understand stuff easily and get proper valuable knowledge with meaningful way of summaries. Also, it also ends up with anticipated motivation , purpose and significance of project with respect to detection of spam user on social network and attributes of malicious or fake users.
2.2 DSDM approach
The Coordinated DSDM technique is applied as undertaking administration philosophy. The benefits and motivations to pick out nimble as undertaking method are as underneath.
DSDM system in no way bargains with the quality and additionally it facilities across the task or business wishes because it had been. on this, I expected to middle across the nature of the exploration and in addition stipulations of the assignment, as an instance, a writing survey, examine, discoveries.
DSDM is continuously clear within the correspondence regularly and it appears that evidently. right here, I predicted to speak with my supervisor constantly and glaringly, it is vital piece of the project in mild of the truth that any question or problem or impede is fathomed with the aid of the assistance of the chief. in this project, I was given sever troubles. other than that, I used to be new to SPAM user detection framework and in addition difficult to understand approximately the Test results, but with the assistance of my chief, I could do explore about this project. He helped at every level at something point I required so it is primary to have clean correspondence with the boss.
Lithe is continuously conveyed on the time. As I had enough power confine it seemed make use of the mild-footed DSDM in mild of the reality that with the aid of making use of this method I’m prepared to finish this challenge on the time. This technique likewise utilizes the Moscow prioritization I had to understand them in scope and out of the extent of the assignment.
2.3 Sprint Table
Sprint Number Work or Task Category Number of days Initial date Finalise Date
1 Identify project and literature topic for project
Reviewing projects 6 26th feb 2018 3rd March 2018
2 Project plan and Agreement Literature Review 4 5th March 2018 9th March 2018
3 Collecting High ranked literature paper LR 8 10thMarch 2018 17th March 2018
4 Presentation of Plan Presentation 2 18th March 2018 19th March 2018
5 Investigating the deceptive information in Twitter spam LR 1 21st March 2018 22nd March 2018
6 Fake profile detection techniques in large-scale online social networks: A comprehensive review
LR 2 23th March 2018 24th March 2018
7 Twitter spam detection: Survey of innovative approaches and comparative study LR 1 25th March 2018 25th March 2018
8 Twitter spammer detection using data stream clustering
LR 2 27th March 2018 28th March 2018
9 Follow spam detection based on cascaded social information
LR 1 28th March 2018 28th March 2018
10 Discovering social spammers from multiple views
LR 1 28th March 2018 29th March 2018
11 Discovering social spammers from multiple views LR 1 30th March 2018 30th March 2108
12 Combating the evolving spammers in online social networks LR 2 1st April 2018 1st April 2018
13 Fame for sale: Efficient detection of fake Twitter followers
LR 1 2nd April 2018 2nd April 2018
14 Recent developments in social spam detection and combating techniques: A survey
LR 1 3rd April 2018 4th April 2018
15 Real-time event detection from the Twitter data stream using the Twitter News+ Framework
LR 1 5th April 2018 5th April 2018
16 Detecting spammers on social networks
LR 1 7th April 2018 8th April 2018
17 Co-detecting social spammers and spam messages in microblogging via exploiting social contexts
LR 2 8th April 2018 9th April 2018
18 Constrained NMF-based semi-supervised learning for social media spammer detection
LR 1 9th April 2018 9th April 2018
19 A Domain Transferable Lexicon Set for Twitter Sentiment Analysis Using a Supervised Machine Learning Approach
LR 2 9th April 2018 11th April 2018
20 Malicious accounts: Dark of the social networks LR 2 11th April 2018 13th April 2018
21 Automated detection of human users in Twitter
LR 1 14th April 2018 14TH April 2018
22 Combination report of all literature Writing 2 14th April 2018 15th April 2018
23 Summarise each Paper Summary 11 23th April 2018 12th May 2018
24 Collecting Dataset and Cleaning Data set for work Technical
2 20th April 2018 22nd April 2018
25 Identifying Spam user detection technique or algorithm Tech 7 25th April2018 4th May 2108
26 Applying Method and generating Result Tech 5 5th May 2018 10th May 2018
27 Editing factor and plotting graph appropriate Tech 4 11th May 2018 14th May 2018
28 Comparing result with legitimate user data set Tech 4 16th May 2018 20th May 2018
29 Preparing final Presentation of project Summary 3 20th May 2018 22nd May 2018
30 Writing Final Report and Sent to Supervisor for Review Writing 5 25th May 2018 31st May 2018
3. Literature Review
According to 3, the number of Sine Weibo site users has reached over 500 million. Statistics show that Weibo is consistently among thetop25mostfrequentlyvisitedwebsitesduringthepastfewyears 7. As one of the largest social networks in China, Weibo attracts millions of users online every day. Weibo application is similar to Twitter, where users post messages, interact with friends, talk about news and share interesting topics via social network services. It is designed as a microblogging website where users post short messages no more than 140 characters. The posted messages will be delivered to followers immediately. Each user is identi?ed by a unique username and could start following another user in order to receive friends’ latest messages on homepage. The user who is followed could either accept the request to follow back, or just reject. Fig. 1 describes a simple following graph, in which user A is following user B, and user B and user C are following each other. There are a number of expressions in Weibo, allowing users to interact with others in a better way, including mention, repost and hashtag.
Mention Arecibo message containing a series of keywords like @username, meaning that the message sender is willing to share something with the user mentioned. As a consequence, Weibo will automatically notify the user mentioned with the message in his/her homepage. World Cup, there were more than 672 million tweets related to World Cup (Rogers, 2014).
3.1 Online Social Network Twitter
Online social networks have become popular platforms for spammers to spread malicious content and links. Existing state-of-the-art optimization methods mainly use one kind of user-generated information (i.e., single view) to learn a classi?cation model for identifying spammers. Due to the diversity and variability of spammers’ strategies, spammers’ behaviour may not be completely characterized only by single view’s information. To tackle this challenge, we ?rst statistically analyse the importance of considering multiple view information for spammer detection task on a large real-world Twitter dataset .Because , Twitter is prime and triumphant platform for social media which has 271 million users which active on daily basis and transform around 500 million short messages per day. In addition to that, there are lot more improvements or growth of Spam activity.
Furthermore, there’s a completely high-quality line between tweeting to promote your provider and appearing as spam account. Medium scale commercial enterprise proprietors using Twitter commonly get tempted to use every opportunity to pressure traffic to their internet site by Tweeting links to their website. That includes tweeting approximately their enterprise the usage of hashtag. Large number of Twitter users are already aware about those strategies accompanied through companies and that they without problems ignore those tweets. in this weblog submit unwell write approximately a few functions that junk mail debts on Twitter generally exhibit.
Low recognition rating:
recognition is defined as ratio between pals and Number of follower (reputation = fans / pals). So, if reputation depend is close to zero it means the consumer is following more debts as compared to its own fans depend. spam debts have a tendency to get more follow extra human beings in the desire to build up its own follower base.
junk mail bills on Twitter publish equal content commonly with the aid of converting best a small part of the tweet. for example, they goal to ship out identical content material to many customers by using consisting of one-of-a-kind @usernames of their reproduction tweets.
Tweets with URLs:
probability that user might click hyperlink noted in it increases if user is following spammer tweet. Fraud user target their fans by means of posting hyperlinks to malicious websites in maximum in their messages. URLs offerings cover the goal URL so spam money owed have a tendency to use such offerings to put up links to their websites.
unsolicited mail bills attempt to entice legitimate customers to read their tweets by way of posting more than one unrelated stuff using hashtags. those money owed desire to attain more visitors fast, in order that they encompass trending and popular hashtags in their tweets.
excessive tweet interest:
spam accounts submit high variety of tweets every day. greater than 50 tweets each day is taken into consideration as high pastime because it method a person has to tweet minimum once in half of an hour all through the day. a few accounts on Twitter go to volume of posting lots of tweets every day.
Accordingly, we propose a generalized social spammer detection framework by jointly integrating multiple view information and a novel social regularization term into a classi?cation model. To keep the completeness of the original dataset and detect more spammers by the proposed method, we introduce a simple strategy to ?ll the missing data for each view. Experimental results on a real-world Twitter dataset show that the proposed method outperforms the existing methods signi?cantly.
The number one task contemporary detecting spammers is that they are upgrading their electronic mail strategies hastily to race with the improvement cutting-edge detection systems (Yang et al., 2013). For strategies that use common features primarily based on user prostate-of-the-articles and message content material, which includes Chu et al. (2012); Egele et al. (2013); Gao et al. (2012), spammers can avoid being detected by means of pur-chasing fans or the usage of gear to submit messages with the identical which means however special phrases robotically (Yang et al., 2013). Yang et al. (2012) discovered that spammers have a tendency to be inter-connected, forming account groups, therefore rendering positive superior features for detecting spammers such as Clus-tering Coefficient, useless. The primary assumption contemporary the PageRank-based method is that there are a constrained number state-of-the-art the rims preserving reciprocal social relationships among spammers and valid users, yet the evidence that legiti-mate users observe spammers more than predicted has been determined. Ghosh et al. (2012) found that a small fraction modern day user, referred to as social capitalists, observe back everyone who follows them to boom their popularity. Yang et al. (2012) also dis-blanketed supporter debts that assist spammers avoid detection by way of growing their fans, allowing them to prey on more sufferers.
As traditional detecting strategies cannot address the new strategies adopted by spammers, researchers have seasoned-posed some new approaches to fulfil these demanding situations. as an example, Yang et al. (2013) used some functions that are greater sophisticated than the previous ones to improve the efficiency present day gadget ultra-modern classifiers. Boshmaf et al. (2016), primarily based on the records modern-day victims who’re benign social community users and feature mutual connections with spammers, made the PageRank-primarily based technique extra robust. those studies provide a deeper perception into the distinction among spammers and valid users and enhance detection accuracy. but, whether or not an person user is a spammer is inferred by using these strategies primarily based on consumer characteristics at a single instantaneous latest time. Their real-global data approximately customers are accrued at a unmarried point modern time, and the experiment are performed and evaluated from the attitude cutting-edge a static social community. In truth, social networks are continuously converting, and spammers may be capable of improve the effectiveness brand new an attack via according consistent efforts (Liu et al., 2015). therefore, as spammers evolve their techniques to prevent detection, the ability of these approaches to effectively discover them will become doubtful.
on this examine, we introduce temporal elements into the detection modern day spammers with the aid of inspecting the activities contemporary users over an extended time frame and offer a detecting framework to become aware of the spammers that evade detection by means of converting their strategies. Intuitionally, although many spammers can make their money owed appear to be legitimate person debts at a few static time points to avoid being detected, it’s far impossible for them to govern the dynamic converting process modern functions over an prolonged period of time present day the excessive price (Yang et al., 2013).
To acquire our research goals, we accumulate the prtoday’siles modern a widespread wide variety today’s social community user and track their sports
over a sequence today’s factors modern day time. A window-based dynamic metric is used to evaluate the temporal evolution patterns contemporary users and discover a clean difference among legitimate customers and spammers regarding exceptional elements modern day the temporal pat-terns. based on the dynamic metric, new temporal consumer functions are designed to locate spammers. instead of the usage of those fea-tures to discover spammers directly, we look into the similarity in the temporal patterns of different spammers, and behaviour a clustering algorithm (Maulik and Bandyopadhyay, 2000) on customers with the aid of abstracting their dynamic metrics into feature vectors. The results suggest that it’s far relatively easier to institution spammers into the identical cluster. We integrate the brand new features with the clustering results to build a gadget mastering classifier for correct detection latest spammers. in the end, we evaluate our approach the usage of the actual-international dataset and reveal the effectiveness latest our method via evaluating it with two conventional spammer detection strategies.
Twitter spam has long been a critical but dif?cult problem to be addressed. So far, researchers have proposed many detection and defence methods in order to protect Twitter users from spamming activities. Particularly in the last three years, many innovative methods have been developed, which have greatly improved the detection accuracy and ef?ciency compared to those which were proposed three years ago. Therefore, we are motivated to work out a new survey about Twitter spam detection techniques. This survey includes three parts: 1) A literature review on the state-of-art: this part provides detailed analysis (e.g. taxonomies and biases on feature selection) and discussion. It methods on a universal testbed (i.e. same datasets and ground truths) to provide a quantitative understanding of current methods; 3) Open issues: the ?nal part is to summarise the unsolved challenges in current Twitter spam detection techniques. Solutions to these open issues are of great signi?cance to both academia and industries. Readers of this survey may include those who do or do not have expertise in this area and those who are looking for deep understanding of this ?eld in order to develop new methods.
Online Social Networks (OSNs) are popular collaboration and communication tools for millions of Internet users. As a major social networking platform, Twitter attracts users by providing free microblogging services for customers to broadcast or discover messages within 140 characters, follow other users and so on, through different devices such as mobile phones and desktops Every day, millions of Twitter users share their moments or post their discoveries, such as breaking news to their followers However, the openness and convenience of Twitter platform also attract criminal accounts (spammers),so as to attack the platform for the sake of making money illegitimately. These attacks include spam, scam, phishing As there is a restriction on the length of tweets, it is common for spammers to broadcast unsolicited spam tweets, which can redirect users to external malicious websites (Lee and Kim, 2013).Compared to the traditional spam which spread through emails, Twitter spam is more dangerous and sophisticated in luring Internet users to get deceived (Thomas et al., 2011).According to a recent report (Grier et al., 2010),the click-through rate of Twitter spam reaches 0.13%, while it only achieves 0.0003% ~ 0.0006% in email spam.
3.3 Traditional Approaches
Traditional approaches for combating spammers mainly focus on analysing and extracting users’ features, and then applying the existing classi?cation methods to detect spammers or spam campaigns 3,9– 11. As the spamming strategies evolve, these methods only relying on the features could not e?ectively detect spammers with new spamming strategies. Ranking schemes are also employed in some anti-spam measures using social network information, which can decrease the spammers’ impact on legitimate users 12,13. However, these ranking methods are hard to distinguish legitimate users and spammers only depending on network information.
Social media and the wide adoption of the Web introduce user feedback, reviews, and comments as consumable web content. Organizations can benefit from analysing these user inputs to offer better services, refine product designs, improve user experience, and manage the overall organization’s performance. User input is often presented online, and in the case of Twitter, the opinions are expressed in real or near real time with the potential of reaching a very wide audience in a matter of seconds. Overall as stated by Peoria et al. (2016), “the opportunity to capture the opinion of general public … has raised increasing interest of both scientific community and the business world.” To analyse user input, organizations use sentiment analysis or opinion mining tools. Sentiment analysis is defined as “the task of finding the opinions of authors about specific entities” (Feldman, 2013). Sentiment analysis can be based on and assessed at the document, sentence, or word level. For Twitter, we use the whole tweet as the basis for our analysis and assume that the whole tweet contains an opinion or a sentiment. For firms, taking advantage of Twitter data, however, requires them to collect, store, and analyse an immense amount of data produced by Twitter each day. In 2016, there were more than 319 million active users sending more than 500 million tweets per day (http://statista.com/statistics/282087/numberof-monthly-tive-twitter-users/). Some of the most prolific accounts on Twitter receive hundreds of thousands of Twitter messages a day (e.g. Xbox Support has more than 400,000 followers and receives more than 1.5 million tweets daily; Justin Bieber receives more than 300,000 tweets daily). During the 2014 Related works.
3.4 Existing research
In the past ten years, email spam detection and ?ltering mechanisms have been widely implemented. The main work could be summarized into two categories: the content-based model and the identity-based model. In the ?rst model, a series of machine learning approaches 8,9 are implemented for content parsing according to the keywords and patterns that are spam potential. In the identity-based model, the most commonly used approach is that each user maintains a whitelist and a blacklist of email addresses that should and should not be blocked by anti-spam mechanism 10,11. More recent work is to leverage social network into email spam identi?cation according to the Bayesian probability 12. The concept is to use social relationship between sender and receiver to decide closeness and trust value, and then increase or decrease Bayesian probability according to these values. With the rapid development of social networks, social spam has attracted a lot of attention from both industry and academia. In industry, Facebook proposes an Edge Rank algorithm 13 that assigns each post with a score generated from a few features (e.g., number of likes, number of comments, number of reposts, etc.). Therefore, the higher Edge Rank score, the less possibility to be a spammer. The disadvantage of this approach is that spammers could join their networks and continuously like and comment each other in order to achieve a high Edge Rank score. (followed-to-follower, URL ratio, message similarity, message sent, friend number, etc.) potential for spammer detection. However, although both of two approaches introduce convincible framework for spammer detection, they lack detailed approaches speci?cation and prototype evaluation. Wang 16 proposes a naïve Bayesian based spammer classi?cation algorithm to distinguish suspicious behaviour from normal ones in Twitter, with the precision result (F-measure value) of 89%. Gao et al. 17 adopts a set of novel features for effectively reconstructing spam messages into campaigns rather than examining them individually (withprecisionvalueover80%).
The disadvantage of these two approaches is that they are not precise enough. Benevento et al. 18 collects a large dataset from Twitter and identify 62 features related to tweet content and user social behaviour. These characteristics are regarded as attributes in a machine learning process for classifying users as either spammers or no spammers. Zhu et al. 19 proposes a matrix factorization-based spam classi?cation model to collaboratively induce a succinct set of latent features (over 1000 items) learned through social relationship for each user in Ren site (www.renren.com). However, these two approaches are based on a large amount of selected feature that might consume heavy computing capability and spend much time in model training.
In Sina Weibo ?eld, literature 20 investigates three types of spammer behaviour (aggressive advertisement, duplicate reposting and aggressive following) and extracts three separated sets of features. Different from the main approach with all feature used by one spammer classi?er, this proposal is based on a group of classi?ers, each using three generated feature sets and working jointly as a spammer classi?er to detect spammer. The concept of combining several spamming classi?ers together is expected to improve detection performance. However, because that each separated feature set might not contain enough feature items (8 at most), the computation result might be inaccurate (precise rate reaches only 82.06%). Generally, this paper follows similar concept with previous works, however, with a few distinguished points:
1. Our proposed SVM-based classi?cation model considers only 18 feature items and achieve the best performance result, with F-measure value reaching over 99%. This is the best result ever achieved (although different collected datasets with different contents might cause a bit deviation in result computation, a significant improvement of result is still comparable and signi?cant).
2. 2. The importance of each selected feature is studied and veri?ed through the Weka 21, a data mining software upon Java tool. The combination usage of these feature also explains why the proposed approach is capable to achieve much higher precision rate than other existing works.
3. 3. Instead of pure experiment upon speci?c dataset, a prototype software is speci?cally developed and opened for public usage, helping any user to distinguish spammer on the Sina Weibo network environment.
The above diagram shows URL filtering to remove spam user from social network. As we can notice that URL filter is system or techniques which based on length or properties of URL its restricted spam user and allow spam user. Furthermore, fire wall also user URL filtering to stop attack from unknown user or spam user.
3.5 Summary of Quality Work
Below table represent Summary of all literature review and its details attached in appendix 2
Index Literature Name Method or Techniques Result
1 Investigating the deceptive information in Twitter spam Graphical Clustering technique Most Click on Malware
2 Fake profile det Fake profile detection techniques in large-scale online social networks: A comprehensive review action techniques in large-scale online social networks: A comprehensive review Sybil detection strategies, along with their traits E?ciency is calculated based totally on a model’s computational necessities
3 Follow spam detection based on cascaded social information Page base Algorithm recommend the unconventional 3 strategies such as TSP-Filtering, SS-Filtering and Cascaded-Filtering
4 Discovering social spammers from multiple views Neurocomputing framework can identify more spammers compared with baseline approaches.
5 Addressing the class imbalance problem in Twitter spam detection using ensemble learning machine learning based schemes the original imbalanced data yield the best results
6 Combating the evolving spammers in online social networks window-based totally dynamic metric detect spammers using the supervised today’s type approach as described on our dataset
7 Fame for sale: Efficient detection of fake Twitter followers ma-chine brand new-primarily based classifiers exploiting capabilities capable of achieve detection prices comparable with the quality contemporary breed classifiers, whereas those latter necessitate overhead-annoying features
8 Recent developments in social spam detection and combating techniques: A survey unique techniques which have been devised or proposed to fight a selection state-of-the-art electronic mail gives an avenue-map on how more modern strategies may be devised in destiny to tackle the menace
9 Real-time event detection from the Twitter data stream using the Twitter News+ Framework detect events based on term co-occurrences in real time finished the usage of a publicly available corpus, will permit researchers to examine di?erent structures pretty in opposition to our device
10 Detecting spammers on social networks a naïve Bayesian based spammer classification algorithm to distinguish suspicious behaviour from normal ones in Twitter SVM-based class model considers best 18 function objects and attain the first-class performance end result, with F-measure price accomplishing over ninety-nine%.
11 Co-detecting social spammers and spam messages in microblogging via exploiting social contexts a con-strained nonnegative matrix factorization proposed version plays drastically higher than the conventionally applied supervised classifiers utilized for the spammer de-taction.
12 Constrained NMF-based semi-supervised learning for social media spammer detection maximum sentiment algorithms big corpuses which might be manually classified to provide for big education and testing datasets
13 A Domain Transferable Lexicon Set for Twitter Sentiment Analysis Using a Supervised Machine Learning Approach sentiment lexicon offer a possibility to employ a transferable lexicon set across all domains with superb accuracy as presented in this study
14 Malicious accounts: Dark of the social networks adjacency matrix or set modern-day functions proposed di?erent features and techniques to pick out malicious customers and their behaviours.
15 Automated detection of human users in Twitter set of rules with trendy metrics measuring behavioural characteristics category and clustering ap-preaches perform with equal accuracy in isolating human and now not human customers
4. Development Plan
Because of structure of this work or project, methodology divided for two parts like one for Literature review and second for Development. So, its include major subject as Data collection its analysis, implementation of logic or strategy and application of that strategy on that dataset which we will discuss in detail.
4.2 Collection of Data
Actually, we search many data set for Twitter user and analyse its feature with one another and finally we made decision to recruit Social Network Profile Dataset from Kaggle as its provide all most all feature which we require to perform our logic or strategy. Furthermore, its contain more than 17000 user profile data including Spam and legitimate users.
4.3 Identifying properties of spam user
Properties of Spam user:
As we all know spam user are usually more active user than any other User but because of its post nature people not following them much for example, if you are following one spam user who always post about just one product or Political stuff which you don’t like more so chances are there you will be unfollow that kind of user but meanwhile they are continuing following you as their intention is to promote or sell something to you. So, by considering this kind of behaviour we can get idea that one properties of spam user are that they less number of follower and they are followie for many users.
Also, there are clear relationship between Follower and followie count due to very few people like (celebrities) who have high number of follower and they are followie for less users.
Above figure explain algorithm method which help to identify spam user from twitter data set.
First, we try to understand feature twitter data set.
• URL ratio (U)
• Message Similarity (S)
• Messages Sent (M)
• Follower Count
Followie People or group of people to whom follow another user known as followie
URL: URL stands for Universal Resource Locator which is string which contain path of particular user or their messages or tweets.
Message Similarity: Already mined data and collect similar message and group them to together.
Follower: People or group of people who follow twitter user are follower for twitter user.
Follower Count: Number of people who following user
Message sent: Number of tweet generated from single user.
We have two data set one is social network Kaggle data set which contain more than 17000 twitter user data including follower count, folllowie count, profile name, number of tweet and so on.
Also, we have another data set called honeypot dataset which legitimate user data set separately so we can test our logic and verify that weather its working or not.
So first we consider follower count and followie count as we all know that for genuine user there not much difference between follower count and followie count as legitimate user are those have very close number equality of follower count and followie count.
First and foremost, we collected data set and clean that dataset using cleaning methodology of R language which is use for data mining, so we will get more accurate result. In data cleaning process we remove empty tuple, unexpected value from tuple and so on.
Secondly , we calculate number of Follower for each user and number of followie for each user which represent in image at the first block. Then using R language column function, we calculated difference of Follower count and followie count
After that, we calculate feature dataset using sum of difference. Then , we calculate centroid of vector
Finally, we calculate sparse vector from centroid to follower count.
We will plot graph against difference of follower count and followi with centroid difference.
Furthermore, we will apply same algorithm for legitimate user as well and we compare output of both data set. Where we can clearly notice legitimate user have particular pattern and Kaggle dataset output don’t have that pattern, so we can identify spam user closely but this model only using two factors so it might not be 100% accurate but future we can implement further.
5. Output and Its Comparison
As we have discussed we have two data set one is Social network profile Kaggle and second one is Honey pot data det for legitimate user data set.
Above picture represent data summary and all factor to require get output in graph for Kaggle dataset which contain data for both spam user and legitimate user.
Where , first column represents follower count , second followie count and third is difference between them. In addition to that we can notice mean difference for follower and followie count is 4118.38 which quite high due to spam user variation in it.
Above picture represent data summary and all factor to require get output in graph for Honeypot dataset for Legitimate user which contain data for only legitimate user.
Where, first column represents follower count , second followie count and third is difference between them. In addition to that we can notice mean difference for follower and followie count is just 547.0248 which is 4th time small from previous table. Here, one thing is confirming that spam user has major difference in follower and followie count.
Above diagram showing first 6 rows of Honeypot Dataset.
Above diagram represent graph for follower count and friends count with respect mean averages for Kaggle dataset where we can notice that their isolated blue dot represents spam user and where their group of blue dots on red line are legitimate user in in other words red colour user completely cover means those are legitimate user.
Above diagram represent graph for follower count and friends count with respect mean averages for legitimate dataset where we can notice that their isolated blue dot represents but its directly connected to user so it might be Celebrities like who have more follower.
Output and results shows that there is co-relation between follower count and followie count .So that spam user first identify from this strategy but now days they also changing their attribute or behaviour of spamming activity so this is one feature can’t get all spam user as I said earlier ,they are also trying to be look like genuine user and because of that no body identify them. Therefore, in future to map all relationship between all other factor will be very important. In addition to that, this paper can provide glans summary or overview of previous quality work so then can understand well specific task within short period of time. As we know that except celebrities their mutual relation between follower of user with other users so it will be great thing to do in future with respect to twitter mining and also here we can notice our system is iterative model so we can add new feature in it we test and we can continuous increase efficiency of this model and modern technique.
Furthermore, Length of messages and similarities of tweet also can be help full to identified spam activity as affiliate spammer s just posting URLs in their tweet as same way for any movie reviewer posting their movie name in every tweet or link of promo of movie or entertainment product.
Same way its also happen in political stuff like one can manipulate whole election by just doing misuse of social media. For example, In India, Political parties hire social network worker or Spammer to spread messages like they want and because that people never get to know actual situation of country of specific region.
7. Future Development
As we have discussed above we consider two factors like follower count and followie count. But still there are another factor to mine and we can achieve more accuracy like we can work on tweet or message length , URL.
For message length like usually spammer send similar kind of tweet frequently and sometimes they are posting one message with URL but this kind of case we have to focus on URL filtering which explain in above. Now big question is that why we didn’t do that part in this paper so answer is that, due to limitation of data and time we couldn’t include all properties of data set but in future it can be done.
Let’s have an example that one user A has 300 follower and user B has 250 users but frequently its happen for genuine user few of them are mutual follower or followie so that we can even map the distances from one user to other user and we can get high accurate result. So that way, this can be also done in future.
Furthermore , this paper contains 17 paper summaries as well so it will be very easy for new researcher to identify new method which might be give better result and output.
Thus, both way theoretically and practically there many things can be happening in near future.
Previous studies of spam user from social network site have create phases for ongoing research. Main aim and goal of this study is that review all related paper and its methodology to identify or create new feature or model to detect spam user from that and make sure this paper also help full for future researcher to get ideology and techniques of current ongoing research of spam detection on social networks as spam detection is vast topic in today digitalise era. Summary of all literature and development of strategy from that clearly show connection with literature review and development part. Our model is theorical model so anyone I mean any researcher can dig more about it .Also, scholar can apply hit and trial method to verify other properties of this social network and other social sites too. Also, extra analysis conveys that language of message is prime factor for identify spamming activity from social network.
Furthermore , this literature review part has standard methodology who applied by another researcher and it’s based on its hypothesis result. Especially, selecting data set and particular social network to mine with. In addition to that, this is paper has given opportunity to refocus on all methodology and exiting tool for spamming filtration process. Not only that which social network has been facing more spamming issue than other one. Although, this type of study might not be happening in future but it will reflect significance to motivate other researcher to do research more about this ongoing issue in general and specific way.
All in all, this paper main aspect to understanding main aspect and characteristics of spam user. Future studies might will get new properties of spam user and its behaviours on social network site Although ,purpose of that kind of user will be same. At the same time then can introduce other way to detect spam user from social network or twitter based on summary of literature review and methodology of development..
1 C. Grier, K. Thomas, V. Paxson, M. Zhang, @spam: the underground on 140
characters or less, in: Proceedings of the 17th ACM Conference on Computer and Communications Security, CCS’10, ACM, New York, NY, USA, 2010, pp. 27–37.
2 Z. Miller, B. Dickinson, W. Deitrick, W. Hu, A.H. Wang, Twitter spammer detection using data stream clustering, Inform. Sci. 260 (2014) 64–73.
3 K. Thomas, C. Grier, V. Paxson, Adapting social spam infrastructure for political censorship, in: Proceedings of the 5th USENIX Conference on Large-Scale Exploits and Emergent Threats, LEET’12, USENIX Association, Berkeley, CA, USA, 2012, 13–13.
4 A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, MOA: massive online analysis, Journal of Machine Learning and Research 11 (2010) 1601–1604
5 C. Bohm, K. Kailing, H. Kriegel, P. Kroger, Density connected clustering with local subspace preferences, in: ICDMI, 2004, pp. 27–34.
6 F. Cao, M. Ester, W. Qian, A. Zhou, Density-Based Clustering over an Evolving Data Stream with Noise, in: SIAM Conference Data Mining, Bethesda, 2006.
7 M. Ester, H. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: International Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), Portland, August 1996, pp. 226–231.
8 Facebook’s US User Growth Slows but Twitter Sees Double-Digit Gains, EMarketer(2012): n. pag. Web. 27 June 2012.