Recommending whom to follow in Twitter using Genetic Algorithm By Ajay Karri Rajiv Neal Harsheel Saraiya Mentored by Prof. Lixin Gao
 IntroductionA Social Network is composed of various individuals and organizations that are connected bysome common interest. The importance of Social Networks is increasing with popularity ofwebsites like Facebook, Twitter, and Orkut, and this is influencing human social behavior. Thereis always a lot of interest in understanding the complexity of a graph representing a network andtrying to predict formation of links between nodes in a network. MotivationWe suggest formation of new links between user nodes in the network. The motivation behindthis idea is that a Recommendation System might assist user to get reconnected with some oldfriend with whom he/ she has lost contact. With increase in popularity of e-commerce websites,our system can recommend buyers items that they might consider buying with regards to theproducts they already bought or browsed earlier. This has a strong probability in increasing thesales of items for the company. This approach is popular among websites like Amazon, Wal-Mart, and Target.How is our recommendation system different from twitter’s recommendation system?If we consider twitter it recommends only the famous personalities and it gives less importanceto the people whom we are being followed by, even though they have good connectivity andreputation. In our project we plan to give equal importance to the people whom we are beingfollowed by, along with the famous personalities considering other factors in mind which will bediscussed in section 4. Project OverviewIn our project we try to implement a friend recommendation system in case of Twitter, a popularsocial networking website. In essence, we recommend nodes that a given user should follow.Various algorithms are present which provides this functionality. Most of algorithms fall under‘k-nearest Neighboring approach’ or ‘Collaborative Filtering approach’.We use the Genetic Algorithm for ‘whom to follow in Twitter’ Recommendation System. It isbased on the topology and structure of network surrounding the central user. Unlike othertopology based approaches, we use a different concept of Clustering indexes and new calibrationtechnique which is the Fitness Function.We developed an algorithm that analyses the sub-graph composed by a user and all the othersconnected people separated by three degree of separation. However, only users separated by twodegree of separation are candidates to be recommended to be followed by the user. Thealgorithm uses the patterns defined by connections between the nodes to find those users whohave similar behavior as the root user.
In Section 4 we explain the parameters which are used in our recommendation system. In section5 we explain about genetic algorithm and in section 6 we propose our recommendation system. Recommendation System Steps/ MechanismThe Recommendation mechanism procedure utilizes graph topology to filter and order a set ofnodes that have some important properties in relation to a given user node vi. The nodes of theresulting set are recommendations of new edges that should be connected to node vi.The process of Recommendation is broadly divided into two main steps – Filtering, and Ordering[4.1] Filtering StepThe Filtering step filters out all the nodes that have high probability to get recommended to thegiven user node. This step is based on concept of Clustering Coefficient which is characteristic insmall world networks. Fig 4.1 A Visual example of a sub-network showing the links between single users in relation to his followers and followers-of-followers in a social network.As shown in the Figure 4.1, region A represents the central user and the people whom he isfollowing. Region B represents followers of followers of the main user. We will recommendonly those nodes that are two hops away from the central node. These nodes fall in shaded regionbetween circles defined by B and A.
[4.2] Ordering StepThe Ordering Step is equally important as the Filtering step. The procedure is based on thecalculation of different indexes (explained later) and their normalization.Three independent indexes are considered. Finally, a Fitness Function (explained later) iscalculated that merges these three indexes into one value. Based on this value we put the mostrelevant nodes in the top of the resulting list that we obtain after each iteration of the GeneticAlgorithm.Three indicators are mentioned in this method. These indexes measure specific properties of asub-graph composed by the nodes that are analyzed. You can use different indexes. The indexesthat we use are based on the concept of Friends-of-Friends approach. Motivated from the idea ofClustering coefficients, we define and use a parameter called the Adjacent Density.[4.2.1] Adjacent DensityConsider a network that has many nodes. Let ‘C’ represents the set of all nodes in the givennetwork. Then, the Adjacent Density DC among the nodes in given network is given by, ∑ ച ൫∑ೕ ച ெೕ ൯ Dc= ൫||∗ሺ||ିଵሻ൯⁄ଶM represents the Adjacency Matrix. Mij is one if there is a link between nodes i and j, else itszero. The denominator is a Normalizing factor where |C| represents the number of nodes in set C.
[4.3] Concept of Indexes[4.3.1] First IndexThe First Index measures number of nodes common to center node i and node j that is beingordered for the recommendation system. It is given by, ܫଵೕ = ห ܥ ⋂ܥ หThus, the intuition behind above idea is that it returns the number of nodes that are being“followed” in Twitter by both the Central user as well as the node being recommended to theCentral user.[4.3.2] Second IndexThe Second index measures the cohesion level inside the group formed by the common nodesfollowed by person i and person j. It is given by, ܫଶೕ = ቚ ܦ ⋂ ቚ The intuition behind this index is that we want to know how densely connected are the nodes inthe common region. This directly influences the Recommendation System. If the common regionis very dense, there will be many entries in matrix M that are value ‘1’ resulting in high secondindex value. If the common region does not have many links between the nodes then the secondindex value will be less.[4.3.3] Third IndexThe Third index is a variation of the second index. It measures the cohesion level in the regionformed by group of nodes that are being followed by node i or node j. It is given by, ܫଷೕ = ቚ ܦ ⋃ ቚ The intuition here is that a high second index value does not necessarily lead us to obtain highthird index value. The third index is independent of the second index.
[4.3.4] Fitness FunctionOur main goal is to select good nodes to be recommended from a pool of nodes based on a singleparameter. Hence, after getting the three index values of a particular node, we convert it into asingle value. This conversion is performed by the Fitness function that is mentioned below, M (n, w) = I1 (nc, n) * w1 + I2 (nc, n) * w2+ I3 (nc, n) * w3In essence, the Fitness Function is nothing but the weighted average of all the index values.w1, w2, w3 are the weights associated with each node. During the start of the Algorithm, all theseweighs have the same value which is ‘1’. The three weights mentioned above are fed to theGenetic Algorithm which tries to optimize the weight values. This happens at every iteration ofGenetic Algorithm. The Fitness Function values are ordered in decreasing order and only someset of nodes are considered who have high values for future iterations. Since the above functiondiscards the unfit nodes from the fit nodes it is called as the ‘Fitness’ function. Genetic AlgorithmThe genetic algorithm is a probabilistic search algorithm that iteratively transforms a set (calledpopulation) of mathematical objects (binary strings), each with an associated fitness value, into anew population of offspring objects using the Darwin’s principle of natural selection and usingoperations that are patterned after naturally occurring genetic operations, such as crossover andmutation. Figure 5.1: Represents genetic algorithm flow chart
The main procedure (Fig 5.1) returns the best weight combination for recommend friends for aspecific user:- • Generate an initial population with random weight values. • Until the fitness function value of the best individual of the population do not improve for five generations do: o Evaluate the fitness function for each individual in the population. o Exclude the worst individuals in the population according to the fitness function value. o Generate new individuals applying crossover in remaining individuals. o Apply mutation operations on children. o Merge children and parents eliminating duplicates. • Return the weight combination of the best individual.[5.1] CrossoverSelecting population and producing offspring from these populations is known as crossover.Multiple crossover techniques exits like single point crossover, multipoint crossover and randompoint crossover. We have implemented the project using single point crossover. In a single pointcrossover, a locus is chosen at which we swap the remaining alleles from parent. Each newgeneration replaces the worst individuals by children of the best individual.In the flowchart shown below (Fig 5.2), our aim is to select a dog who barks loud. Weights aredistributed among the dogs depending on their barking sound. The loudest barking dog isassigned a weight of 7 and lowest barking dog is assigned with a weight of 2. Among thefollowing dogs, higher weighted dogs are selected and prepared for crossover. After goingthrough the process of crossover, two worst barking dogs are replaced by the child of bestindividuals.
Figure 5.2: Represents the processing levels inside genetic algorithmThe function to generate population sons from two parents is summarized as:- • Crossover is applied over each pair from the Cartesian product of individual of the elite population. • Crossover is applied on same chromosomes type of two parents. Two chromosomes generate two new chromosomes, and the combination of three chromosomes from each parent six different new individual can be generated. Crossover is performed in a single random cutoff point.[5.2] MutationMutation is used in order to ensure that the individuals are not all exactly the same. Mutationgives a child his own traits to ensure his standing in the population is unique. Generally theprobability of mutation is usually between 1 and 2 tenths of a percent. Figure 5.3: Represents mutation by flipping the bits
[5.3] Chromosome:In our project we used a binary genetic algorithm where each chromosome is represented byweights w1, w2, and w3. Each population in a generation is represented by 3 bytes ranging from 0to 255.Chromosome= [w1 w2 w3]. Initially weights are assigned at random based on the connectivity ofthe population of the user to be recommended with the central user described in step 5 of section6.2.
 Recommendation ProcedureThe Recommendation procedure is represented in the figure below. Each step is explained indetail later. It consists of 6 steps in total. Get Data from twitter Form the network graph Perform Graph reduction Calculate the three indexes Apply Genetic Algorithm Recommend whom to follow Figure 6.1 Recommendation Procedure 6.1:[6.1] Recommendation Steps in detail tepsStep 1: Get Data from TwitterWe first decided the central node that should be recommended to follow some node. We got allthose nodes information till 4th level. We obtain all information using Twitter4J – A JAVAlibrary for Twitter API. The 1st level node is the Central User itself. The 2nd level nodes are allthose being followed by the central user. The 3rd level nodes are nodes being followed by nodesin 2nd level. Thus the recommended node to be followed by user belongs to level 3. Level 4nodes are nodes being followed by nodes in level 3 (superset of nodes being recommended touser). This is shown in the Figure 6.2 below:
Figure 6.2: Network ViewStep 2: Form the etwork GraphTo calculate the indexes values, we have to find the adjacency matrix first. However theadjacency matrix can be obtained only from a graph. Hence once we get information of all userswe form the network graph so that the indexes values can be found. This step is particularlyimportant in calculating the second index and third index.Step 3: Perform Graph ReductionIn this step we remove all those nodes in level 2 that have a link in between them. Considernodes A and B at level 2. If there is a link between them, say A is following B, then from Centraluser’s perspective, B falls in level 3. Hence our system might ask Central user to follow B. Butwe know that B is already being followed by Central user and should not be present in list ofnodes being recommended to be followed by Central user. Hence we discard such nodes. Hencethis step is called as a Graph reduction Step.Step 4: Calculate the three indexesThe three indexes concept and formulae were explained before. For each node we find the threeindexes values and then the Fitness Function value. Thus there is a one-to-one correspondencebetween a given node and its fitness function value.
Step 5: Apply Genetic AlgorithmAs mentioned previously, the three weights are given as input to Genetic Algorithm which triesto optimize it in every iteration. In our case the each chromosome represents an individual in apopulation. Each chromosome has three genes which are represents by the weights w1,w2,w3,assigned to indexes I1,I2,I3 respectively. If user in level 1 is directly being followed by level 3user we assign more weight to the level 3 user whose value will be greater than 127. Our goal isto optimize the weights such that the fittest of the population are to be recommended to the user.Here the each gene is considered to be of 8 bits length. Every time before starting next iterationwe consider only those nodes with high fitness function values. We take the weights of thesenodes and apply Crossover and Mutation techniques to get optimized weights.Step 6: Recommend whom to followFinally after running the Genetic Algorithm after four iterations, we come up with a set of nodeswith their fitness function values. We arrange these nodes in decreasing order of fitness functionvalues and recommend the top nodes that the user should follow.[6.2] Pseudo CodeWe have implemented the program in JAVA language. The Pseudo code for each step that wedefined earlier is shown below:Step 3: Perform Graph ReductionFor each user in level-3 if user equals level-2 user then remove from level-3 endendStep 4: Calculate the three indexesFor each user in level-3 Calculate First Index, Second Index, Third Index Check if level-3 user is connected to level-1 if so set flag to Y endend
Step 5: Apply Genetic AlgorithmInitial Step performed only once for a Central user node: Assign random weights to the nodesand form initial chromosomes.For each user in level-3 If user is directly connected to level-1 user then assign random weight ≥127 corresponding to each index Else assign random weight <127 corresponding to each index endEndStep 1: Create the population using the weights.Step 2: Select two random people from population at a time until count equal the population size.Step 3: Perform crossover operation to get offspring.Step 4: Perform mutation operation on off springs.Step 5: Calculate Fitness function for each person in the population.Step 6: Sort the population in descending order of Fitness Function value.Step 7: Discard least fit half of the population.Step 8: Repeat Step2 to Step7 for four iterationsStep 6: Recommend whom to followSelect the unique users out of the remaining population for recommendation. Give preferenceaccording to their Fitness Function value. Results:[7.1] Dataset:For our testing we have collected 11,459 user Ids related to a single user. We found that thenumber of users in the level 3 of the network graph were 59. After graph reduction therecommendable users reduced to 56.The total size of the chromosome is 24 bits. The Genetic algorithm is applied for 4 generationssuch that the characteristics of the fittest parent are passed over to the other population. There isa tradeoff between the number of generations and the accuracy of the results. Higher thegenerations better the output but it will increase the number of computations. The mutationprobability is set 4%Total Number of users Numbers of users in level 3 Number of recommendable users11,456 59 56 Table 7.1 Population
The accuracy of our recommendation system is limited as we were not able to get the privateuser data. Hence, some of the users who had a better chance of being recommended wereneglected because we were not able to collect whom they were following.[7.2] etwork Graphs:Connectivity Graph: Figure 7.1: Connectivity Graph Figure 7.1 represents the connectivity graph of the 11,456 users. It can be observed that thenetwork is pretty dense, which agrees with the social network principle. Due to intensive densegraph, it is not possible to see individual connection between users.Common Users:In Figure 7.2 the user 73048275 is the central user for whom the users are recommended, theuser 45597677 belongs to level 3 who is supposed to be recommended to the user; the rest of thenodes represent the common users between the central user and level 3 user. It can be seen fromFigure 7.2 that no link exists between the user 73048275 and 45597677. After running alliterations, user 45597677 is recommended to user 73048275 since the actual user is notfollowing him. It can be verified with Table 7.1 the value of 1st index is 3 which agrees withFigure 7.2 which shows three common users between 73048275 and 45597677 and moreoverthis connectivity graph is used in the calculation of 2nd Index.
Figure 7.2: Common UsersSub Graph: Figure 7.3: Sub GraphFigure 7.3 represents the connectivity graph of the common users and the non-common usersbeing followed by central user and level 3 users. This connectivity graph is used in thecalculation of 3rd Index.
Crossover:We show here a glimpse of how Crossover and Mutation works. The result shown below is afterfirst iteration of Genetic Algorithm applied on user with id 73048275. The chromosomes of 2nodes from level 3 are shown below:The crossover point for child 1 is 3. This means that the child will inherit first three bits from one Parentand next five bits from other Parent. This is true for each byte of chromosome. For example, the first threebits of first byte in Child 1 comes from Parent 1 and rest five bits come from Parent 2. This is shown byViolet line.Similarly, the crossover point for child 2 is 5. This means that the child will inherit five bits from one Parentand next three bits from other Parent. This is true for each byte of chromosome. For example, the first fivebits of second byte in Child 2 comes from Parent 2 while rest three bits come from Parent 1. This isshown by Red line.The next step is to perform Mutation on the chromosomes of children nodes. The result is shown below:Since the probability of mutating a bit is 4%, only one bit gets flipped in chromosome of child 1. Thechromosome of child 2 remains unaffected.
Recommendation:We have tested our Recommendation system on numerous users, and it mostly gave successfulresults. Our recommendation system recommends followers from the user’s network graph. Forrecommending a user various factors are considered: 1) Is the level 1 user directly followed by level 3 user? 2) How many common people the level 3 user and level 1 user are following (1st and 2nd Index)? 3) What’s the spread of the level 3 user (3rd Index) i.e. number of users he is following? How these factors influence the user recommendation process: If a user at level 3 has more common users with level 1 user then he has high chance of getting recommended. His chances are further enhanced if he is following level 1 user. If the level 3 user does not have more users in common with level 1 user then 3rd index comes into play. The more spread he has, greater the chance to get recommended.Recommendation List 1:Recommendation for user with user Id: 73048275The value within the brackets represents the fitness value of that user. • UserId 71496042 user name ankit goel chromosome (1017.978)110111001111111001101011 • UserId 75572867 user name Abhishek Agarwal chromosome (644.8637)111100011110101101110000 • UserId 45597677 user name Arun Kartha chromosome (892.2909)111110011110101101110000 • UserId 33821494 user name Ishaan Guliani chromosome (434.57565)011010001100110000001101 • UserId 60321463 user name LateNightTales chromosome (321.128)001000110010010101111101 • UserId 41877111 user name harman chromosome (135.34)011000100111000001101011 List 7.1 Output 1
User Id 1st Index 2nd Index 3rd Index Level 171496042 4 0.533333333 0.023470661672908864 y75572867 2 0.576 0.0553306342780027 y45597677 3 0.6 0.03831168831168831 n33821494 3 0.6 0.013512304250559284 n41877111 2 0.6666666666666666 0.03333333333333333 n60321463 1 0.4321 0.02666666666666667 n Table 7.2 IndexesTable 7.2 consists of individuals to be recommended to the actual user. Users are arranged takinginto account each index value. The character value in last column represents the situationwhether the user is following the actual user or not. In case, the character is y then recommendeduser is following the actual user else he is not following him.The list 7.1 shown above is the actual recommended list for user with Id 73048275. The priorityof the users being recommended is based on his fitness which is calculated through geneticalgorithm. If we compare the users being recommended in Table 7.2, we can observe that allthree indexes are equally important. However, being directly connected to level 1 user plays avital role in getting recommended as higher weights are assigned to users who are directlyconnected to level 1 user, which further agrees with the principle behind our recommendationsystem. If we compare users 45597677 and 33821494, they have the same 1st and 2nd index. Buttheir network spread (3rd index) played an important role which ultimately gave preference touser 45597677.Recommended List 2Recommendation for user with user Id: 113943142 • UserId 56304605 user name Rajdeep Sardesai chromosome (495.9645)100110110011001100100101 • UserId 145125358 user name Amitabh Bachhan chromosome (428.36288)011000001101111011010111 • UserId 135421739 user name sachin tendulkar chromosome (416.52875)011101010110001101010110 • UserId 113609977 user name Umal Muranjan chromosome (401.42856)011000001101111011010111
• UserId 161304900 user name Polynamous chromosome (315.66666)011000001101111011010111 • UserId 97865628 user name Farhan Akhtar chromosome (311.89334)011101100111000100001110 • UserId 116135959 user name Viral Desai chromosome (200.155)011011010000110000010111 • UserId 6509832 user name CNN-IBN News chromosome (194.233)010000001101111011010111The above list is an actual recommended list for a different user (user 2). Here you can observemost of the users who are being recommended are celebrities. This is because none of the usersin level 3 are directly following level 1 user. Conclusion:We have proposed and implemented a new way to recommend a user whom he can follow. Thismethod differs from the conventional Twitter recommendation system. Unlike Twitter whichgives more importance to famous personalities, we gave equal importance to all nodes. Ourapproach is based on the Friends-Of-Friends approach. The effectiveness of our system differs alittle because we were not able to retrieve the private user information. The Genetic Algorithmused is extremely difficult and complex, nevertheless we get good results. ImprovementsOne major problem that we face was that we cannot access data from private users. Hence, wehad to ignore private users.Other problem that we faced was the Rate-Limiting problem. To prevent abuse of manysequential searches in given period of time, Twitter blocks the requests from a machine withsame IP Address. As a result, we could not run the Algorithm on users who followed largenumber of nodes.The above problem could be solved using the Twitter Streaming API which does not have anyrate limiting problem. But the drawback is that we get access to only those users who arecurrently tweeting.
 References  A Graph-Based Friend Recommendation System Using Genetic Algorithm, Nitai B. Silva, Ing-Ren Tsang, George D.C. Cavalcanti, and Ing-Jyh Tsang.  Introduction to social network methods, Robert A. Hanneman and Mark Riddle  Practical Genetic Algorithms, Randy L. Haupt, S. E. Haupt  twitter4j, http://twitter4j.org  Jung, http://jung.sourceforge.net/