Basic explanation about graph mining for social network analysis (SNA). I tried to describe some metrics and benefit from SNA (focusing on telecommunication field). Basic spark with graphx script to analyse the graph also in the slide
2. Definition
From the point of view of data mining, a social network is a heterogeneous and
multirelational data set represented by a graph. The graph is typically very
large, with nodes (or vertex) corresponding to objects and edges
corresponding to links representing relationships or interactions between
objects. Both nodes and links have attributes
(Han & Kamber, 2006).
2
Call, sms, IM, trf. Balance, …
mention, follow, like, …
subscriber subscriber
3. Benefit of SNA
3
Identify role of subscriber in
community:
• Community leader
• Bridge
• Passive
• Follower
Identify high value/prospect
community by looking at:
• Community size
• Closeness
• Member’s profile (device,
usage, ARPU, location)
• Onnet/Offnet share in
community
Suspected same
subscriber
Comparing two social network to
identify single identity of
subscriber. By comparing two
social network
Further
Utilization
• New product campaign, targeting community leader, bridge, and high value community
• Retention program prioritization for community leader, bridge, and high value community
• Product adoption campaign for follower in community that already adopt the product
• Identifying rotational churner to be excluded in retention campaign, or to evaluate dealer
• SN variable can be used to enhance another predictive model. For example: social network
variable can increase the lift of churn model for high value customer (Imaduddin, 2014)
4. Social Network Graph Mining
By mining the graph of social network, we can extract valuable information such
as:
• Degree (in-degree, out-degree, max-degree). Degree related to number of edge attached
to one vertex/node. Vertex with high number of in-degree means that vertex receive many
information from others, and vice versa.
• PageRank. PageRank measures the importance of each vertex in a graph. If a Twitter user
is followed by many others, the user will be ranked highly. For CDR based social network,
reverse the graph direction before use PageRank function to identify the important vertex
• Local clustering coefficient (LCC). LCC represent how close a customer’s network. The
higher the LCC, the closer the network. LCC calculation derived from triangle counting of
each vertex.
4
𝐿𝐶𝐶 =
#𝑡𝑟𝑖𝑎𝑛𝑔𝑙𝑒
𝑛
2
, 𝑛 = #𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟
14. Solving Real World Problem
• Define the vertices. Is it subscriber, web pages, twitter account?
• Define the edge how the vertices connected. E.g. total call minutes in a month > 5
minutes, sms > 10, etc
• Identify the mega hubs. Mega hubs is vertex that connected to massive amount of vertices
(something like call center or spammer). Mega hubs can be removed, or process separately
based on the problem.
• Identify the measure needed (PageRank, degree, LCC, triangle, etc)
• Build the data source (separate the vertex properties data and the connection data – join it
later), and put it distributed on hadoop.
• Build the code, run it, and feed the result back to data warehouse or hadoop for further
utilization
14
15. References & Resources
• Han, J., & Kamber, M. (2006). Data Mining Concepts and Techniques. San Francisco: Morgan Kaufmann.
• Imaduddin, G. (2014). Evaluation and Improvement of Churn Model Using Customer Value and Social
Network. Jakarta: Universitas Indonesia.
15
• Apache Spark Overview. https://spark.apache.org/docs/latest/
• Databricks Training Resources. https://databricks.com/spark-training-resources
• GraphX Programming Guide. https://spark.apache.org/docs/latest/graphx-
programming-guide.html
• Social Network Analysis. http://en.wikipedia.org/wiki/Social_network_analysis
• Spark Scala API Doc.
https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.pac
kage
• The Scala Programming Language. http://www.scala-lang.org/