Social Network Analysis at
         Linkedin
                 Mitul Tiwari
     Search, Network, and Analytics (SNA)
                  LinkedIn
                      1
Who am I?




    2
Outline


About Me

About LinkedIn

Analytics products at LinkedIn

Fun Facts: Sequencing the Startup DNA


                    3
LinkedIn by the numbers

120,000,000+ users (August 4, 2011)

2+ new user registrations per second

81+ Million monthly unique users* (Comscore)

2 Billion People Searches in 2010

7.1 Billion page views in Q2 2010

2+ Million companies with LinkedIn Company Pages

2+ M LinkedIn Groups

                            4
Broad Range of Products




           5
User Profile




     6
Connections




     7
Communications

Communications




                        10


                 8
Hiring Solutions




       9
Job Search




    10
Company Pages




      11
Outline


About Me

About LinkedIn

Analytics products at LinkedIn

Fun Facts: Sequencing the Startup DNA


                    12
What do I mean by Analytics Products?




                 13
People You May Know




         14
Profile Stats: WVMP




        15
Viewers of this profile also ...




               16
Skills




  17
InMaps




  18
Related Searches


Millions of Searches everyday

Help users to explore and refine their queries




                     19
Analytics Products: Key Ideas

Recommendations
 People You May Know, Related Searches, Viewers of
 this profile ...

Insight
 Profile Stats: Who Viewed My Profile, Skills

Visualization
 InMaps
                      20
Analytics Products: Challenges


 LinkedIn: largest professional network

 120+ million members on LinkedIn

 Billions of pageviews

 Terabytes of data to process


                      21
Outline


Analytics products at LinkedIn

Deep Dive - Connection Strength

Deep Dive - Related Searches

Fun Facts: Sequencing the Startup DNA


                    22
Connection Strength


Let’s build “Connection Strength”

Systems and Tools we use

Hadoop MapReduce

Managing workflow

Serving data in production

                     23
Connection Strength


How well do people know each other?

Connection Strength

Application: reorder updates to show updates
from close connections



                      24
Connection Strength
  How well do
                          Alice
people know each
     other?


               Bob                Carol




                     25
Connection Strength
  How well do
                          Alice
people know each
     other?


               Bob                Carol




                     26
Connection Strength
  How well do
                               Alice
people know each
     other?


               Bob                     Carol



                   Triangle closing


                          27
Connection Strength
  How well do
                              Alice
people know each
     other?


               Bob                    Carol



                 Triangle closing
Prob(Bob knows Carol) ~ the # of common connections

                         28
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      29
Pig Overview
Load: load data, specify format

Store: store data, specify format

Foreach, Generate: Projections, similar to select

Group by: group by column(s)

Join, Filter, Limit, Order, ...

User Defined Functions (UDFs)
                        30
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      31
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      32
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      33
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      34
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              flatten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      35
Triangle Closing Example
                                   Alice




                  Bob                       Carol

                               connections = LOAD `connections` USING
1.(A,B),(B,A),(A,C),(C,A)      PigStorage();
2.(A,{B,C}),(B,{A}),(C,{A})
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
                              36
Triangle Closing Example
                                    Alice




                  Bob                         Carol


1.(A,B),(B,A),(A,C),(C,A)
                              group_conn = GROUP connections BY
2.(A,{B,C}),(B,{A}),(C,{A})   source_id;
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
                               37
Triangle Closing Example
                                     Alice




                  Bob                             Carol


1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A})
                              pairs = FOREACH group_conn GENERATE
3.(A,{B,C}),(A,{C,B})         generatePair(connections.dest_id) as (id1, id2);
4.(B,C,1), (C,B,1)
                                38
Triangle Closing Example
                                     Alice




                  Bob                           Carol


1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A})   common_conn = GROUP pairs BY (id1, id2);
                              common_conn = FOREACH common_conn
3.(A,{B,C}),(A,{C,B})         GENERATE flatten(group) as (source_id, dest_id),
4.(B,C,1), (C,B,1)            COUNT(pairs) as common_connections;
                                39
Connection Strength



Age

Company Overlap

School Overlap



                  40
Related Searches


Let’s build “Related Searches”

Systems and Tools we use

Hadoop MapReduce

Managing workflow

Serving data in production

                      41
Systems and Tools


Kafka (LinkedIn)

Hadoop (Apache)

Azkaban (LinkedIn)

Voldemort (LinkedIn)


                     42
Data Cycle




    43
Systems and Tools

Kafka
 publish-subscribe messaging system

 transfer data from production to HDFS

Hadoop

Azkaban

Voldemort

                      44
Systems and Tools

Kafka

Hadoop
 Java MapReduce and Pig

 process data

Azkaban

Voldemort

                    45
Systems and Tools

Kafka

Hadoop

Azkaban
 Hadoop workflow management tool

 to manage hundreds of Hadoop jobs

Voldemort

                     46
Systems and Tools

Kafka

Hadoop

Azkaban

Voldemort
 Key-value store

 store output of Hadoop jobs and serve in production

                      47
Related Searches


Let’s build “Related Searches”

Systems and Tools we use

Hadoop MapReduce

Managing workflow

Serving data in production

                      48
Hadoop MapReduce

Large-scale distributed data processing

Map phase: partition the problem in smaller
steps

Reduce phase: aggregate the output of smaller
steps

Allows parallelism and fault-tolerance

                     49
Related Searches


Let’s build “Related Searches”

Systems and Tools we use

Hadoop MapReduce

Managing workflow

Serving data in production

                      50
Our Workflow

         Triangle
Age                   School Overlap
         Closing



         Union




       push-to-prod



                 51
Our Workflow

                Triangle
  Age                        School Overlap
                Closing



                Union




push-to-qa    push-to-prod



                        52
Sample Workflow




      53
Azkaban Workflow Management

  Dependency management
  Regular Scheduling
  Monitoring
  Diverse jobs: Java, Pig, Clojure
  Configuration/Parameters
  Resource control/locking
  Restart/Stop/Retry
  Visualization
  History
  Logs
                           54
Our Workflow

         Triangle
Age                   School Overlap
         Closing



         Union




       push-to-prod



                 55
Our Workflow

         Triangle
Age                   School Overlap
         Closing



         Union




       push-to-prod



                 56
Related Searches


Let’s build “Related Searches”

Systems and Tools we use

Hadoop MapReduce

Managing workflow

Serving data in production

                      57
Voldemort

Large amount of data/Scalable

Quick lookup/low latency

Versioning and Rollback

Fault tolerance through replication

Read only

Offline index building

                        58
Voldemort RO Store




        59
Outline


Analytics products at LinkedIn

Deep Dive - Connection Strength

Deep Dive - Related Searches

Fun Facts: Sequencing the Startup DNA


                    60
Related Searches


Millions of Searches everyday

Help users to explore and refine their queries




                     61
How to build Related Searches?

 Collaborative Filtering: searches done in the
 same session



        Q1       Q2        Q3    Q4

             Time



                      62
How to build Related Searches?

 Searches correlated by results clicks


         Q1                 R1




         Qn                 Rm


                       63
How to build Related Searches?

 Searches with overlapping terms



 Q1     Jeff LinkedIn


 Q2         LinkedIn CEO




                        64
Outline


Analytics products at LinkedIn

Deep Dive - Connection Strength

Deep Dive - Related Searches

Fun Facts: Sequencing the Startup DNA


                    65
Sequencing the Startup DNA




            Done by Monica Rogati at
                   LinkedIn


            66
Sequencing the Startup DNA

•Top majors: Entrepreneurship, Computer Engineering
•Bottom: Nursing, Administration




                         67
Sequencing the Startup DNA

•Most founders age at first startup between 20 and 40




                          68
Sequencing the Startup DNA

•Top regions: San Francisco, NYC




                         69
Sequencing the Startup DNA


•Top Business Schools:
Stanford, Harvard, MIT Sloan




                         70
Things Covered


Analytics products at LinkedIn

Deep Dive - Connection Strength

Deep Dive - Related Searches

Fun Facts: Sequencing the Startup DNA


                    71
SNA Team



Thanks to SNA Team at LinkedIn

http://sna-projects.com

We are hiring!



                    72
Questions?




    73

Social Network Analysis at LinkedIn

  • 1.
    Social Network Analysisat Linkedin Mitul Tiwari Search, Network, and Analytics (SNA) LinkedIn 1
  • 2.
  • 3.
    Outline About Me About LinkedIn Analyticsproducts at LinkedIn Fun Facts: Sequencing the Startup DNA 3
  • 4.
    LinkedIn by thenumbers 120,000,000+ users (August 4, 2011) 2+ new user registrations per second 81+ Million monthly unique users* (Comscore) 2 Billion People Searches in 2010 7.1 Billion page views in Q2 2010 2+ Million companies with LinkedIn Company Pages 2+ M LinkedIn Groups 4
  • 5.
    Broad Range ofProducts 5
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    Outline About Me About LinkedIn Analyticsproducts at LinkedIn Fun Facts: Sequencing the Startup DNA 12
  • 13.
    What do Imean by Analytics Products? 13
  • 14.
  • 15.
  • 16.
    Viewers of thisprofile also ... 16
  • 17.
  • 18.
  • 19.
    Related Searches Millions ofSearches everyday Help users to explore and refine their queries 19
  • 20.
    Analytics Products: KeyIdeas Recommendations People You May Know, Related Searches, Viewers of this profile ... Insight Profile Stats: Who Viewed My Profile, Skills Visualization InMaps 20
  • 21.
    Analytics Products: Challenges LinkedIn: largest professional network 120+ million members on LinkedIn Billions of pageviews Terabytes of data to process 21
  • 22.
    Outline Analytics products atLinkedIn Deep Dive - Connection Strength Deep Dive - Related Searches Fun Facts: Sequencing the Startup DNA 22
  • 23.
    Connection Strength Let’s build“Connection Strength” Systems and Tools we use Hadoop MapReduce Managing workflow Serving data in production 23
  • 24.
    Connection Strength How welldo people know each other? Connection Strength Application: reorder updates to show updates from close connections 24
  • 25.
    Connection Strength How well do Alice people know each other? Bob Carol 25
  • 26.
    Connection Strength How well do Alice people know each other? Bob Carol 26
  • 27.
    Connection Strength How well do Alice people know each other? Bob Carol Triangle closing 27
  • 28.
    Connection Strength How well do Alice people know each other? Bob Carol Triangle closing Prob(Bob knows Carol) ~ the # of common connections 28
  • 29.
    Triangle Closing inPig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 29
  • 30.
    Pig Overview Load: loaddata, specify format Store: store data, specify format Foreach, Generate: Projections, similar to select Group by: group by column(s) Join, Filter, Limit, Order, ... User Defined Functions (UDFs) 30
  • 31.
    Triangle Closing inPig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 31
  • 32.
    Triangle Closing inPig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 32
  • 33.
    Triangle Closing inPig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 33
  • 34.
    Triangle Closing inPig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 34
  • 35.
    Triangle Closing inPig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 35
  • 36.
    Triangle Closing Example Alice Bob Carol connections = LOAD `connections` USING 1.(A,B),(B,A),(A,C),(C,A) PigStorage(); 2.(A,{B,C}),(B,{A}),(C,{A}) 3.(A,{B,C}),(A,{C,B}) 4.(B,C,1), (C,B,1) 36
  • 37.
    Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) group_conn = GROUP connections BY 2.(A,{B,C}),(B,{A}),(C,{A}) source_id; 3.(A,{B,C}),(A,{C,B}) 4.(B,C,1), (C,B,1) 37
  • 38.
    Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) 2.(A,{B,C}),(B,{A}),(C,{A}) pairs = FOREACH group_conn GENERATE 3.(A,{B,C}),(A,{C,B}) generatePair(connections.dest_id) as (id1, id2); 4.(B,C,1), (C,B,1) 38
  • 39.
    Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) 2.(A,{B,C}),(B,{A}),(C,{A}) common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn 3.(A,{B,C}),(A,{C,B}) GENERATE flatten(group) as (source_id, dest_id), 4.(B,C,1), (C,B,1) COUNT(pairs) as common_connections; 39
  • 40.
  • 41.
    Related Searches Let’s build“Related Searches” Systems and Tools we use Hadoop MapReduce Managing workflow Serving data in production 41
  • 42.
    Systems and Tools Kafka(LinkedIn) Hadoop (Apache) Azkaban (LinkedIn) Voldemort (LinkedIn) 42
  • 43.
  • 44.
    Systems and Tools Kafka publish-subscribe messaging system transfer data from production to HDFS Hadoop Azkaban Voldemort 44
  • 45.
    Systems and Tools Kafka Hadoop Java MapReduce and Pig process data Azkaban Voldemort 45
  • 46.
    Systems and Tools Kafka Hadoop Azkaban Hadoop workflow management tool to manage hundreds of Hadoop jobs Voldemort 46
  • 47.
    Systems and Tools Kafka Hadoop Azkaban Voldemort Key-value store store output of Hadoop jobs and serve in production 47
  • 48.
    Related Searches Let’s build“Related Searches” Systems and Tools we use Hadoop MapReduce Managing workflow Serving data in production 48
  • 49.
    Hadoop MapReduce Large-scale distributeddata processing Map phase: partition the problem in smaller steps Reduce phase: aggregate the output of smaller steps Allows parallelism and fault-tolerance 49
  • 50.
    Related Searches Let’s build“Related Searches” Systems and Tools we use Hadoop MapReduce Managing workflow Serving data in production 50
  • 51.
    Our Workflow Triangle Age School Overlap Closing Union push-to-prod 51
  • 52.
    Our Workflow Triangle Age School Overlap Closing Union push-to-qa push-to-prod 52
  • 53.
  • 54.
    Azkaban Workflow Management Dependency management Regular Scheduling Monitoring Diverse jobs: Java, Pig, Clojure Configuration/Parameters Resource control/locking Restart/Stop/Retry Visualization History Logs 54
  • 55.
    Our Workflow Triangle Age School Overlap Closing Union push-to-prod 55
  • 56.
    Our Workflow Triangle Age School Overlap Closing Union push-to-prod 56
  • 57.
    Related Searches Let’s build“Related Searches” Systems and Tools we use Hadoop MapReduce Managing workflow Serving data in production 57
  • 58.
    Voldemort Large amount ofdata/Scalable Quick lookup/low latency Versioning and Rollback Fault tolerance through replication Read only Offline index building 58
  • 59.
  • 60.
    Outline Analytics products atLinkedIn Deep Dive - Connection Strength Deep Dive - Related Searches Fun Facts: Sequencing the Startup DNA 60
  • 61.
    Related Searches Millions ofSearches everyday Help users to explore and refine their queries 61
  • 62.
    How to buildRelated Searches? Collaborative Filtering: searches done in the same session Q1 Q2 Q3 Q4 Time 62
  • 63.
    How to buildRelated Searches? Searches correlated by results clicks Q1 R1 Qn Rm 63
  • 64.
    How to buildRelated Searches? Searches with overlapping terms Q1 Jeff LinkedIn Q2 LinkedIn CEO 64
  • 65.
    Outline Analytics products atLinkedIn Deep Dive - Connection Strength Deep Dive - Related Searches Fun Facts: Sequencing the Startup DNA 65
  • 66.
    Sequencing the StartupDNA Done by Monica Rogati at LinkedIn 66
  • 67.
    Sequencing the StartupDNA •Top majors: Entrepreneurship, Computer Engineering •Bottom: Nursing, Administration 67
  • 68.
    Sequencing the StartupDNA •Most founders age at first startup between 20 and 40 68
  • 69.
    Sequencing the StartupDNA •Top regions: San Francisco, NYC 69
  • 70.
    Sequencing the StartupDNA •Top Business Schools: Stanford, Harvard, MIT Sloan 70
  • 71.
    Things Covered Analytics productsat LinkedIn Deep Dive - Connection Strength Deep Dive - Related Searches Fun Facts: Sequencing the Startup DNA 71
  • 72.
    SNA Team Thanks toSNA Team at LinkedIn http://sna-projects.com We are hiring! 72
  • 73.