SlideShare a Scribd company logo
1 of 39
Counting Fast
                               (Part II)

                                   Sergei Vassilvitskii
                                 Columbia University
                         Computational Social Science
                                       March 8, 2013



Thursday, March 14, 13
Last time

              Counting fast:
                – Quadratic time doesn’t scale
                – Sorting is slightly more than linear
                – Hashing allows you to do membership queries in constant time




                                                   2                      Sergei Vassilvitskii

Thursday, March 14, 13
Today

              Counting on Networks:
                – Large Graphs: Internet, Facebook, Twitter
                – Recommendation Graphs: Netflix, Amazon, etc.




                                                 3              Sergei Vassilvitskii

Thursday, March 14, 13
Friends & Followers

              Given a network:
                – When do people become friends?
                – What factors influence this?




                                                4   Sergei Vassilvitskii

Thursday, March 14, 13
Friends & Followers

              Given a network:
                – When do people become friends?
                – What factors influence this?


              Products:
                – People You May Know (PYMK). Reconnect people, help new users




                                                5                       Sergei Vassilvitskii

Thursday, March 14, 13
Friends & Followers

              Given a network:
                – When do people become friends?
                – What factors influence this?


              Products:
                – People You May Know (PYMK). Reconnect people, help new users
                – Twitter’s who to follow?




                                                6                       Sergei Vassilvitskii

Thursday, March 14, 13
Friends & Followers

              Given a network:
                – When do people become friends?
                – What factors influence this?


              Products:
                – People You May Know (PYMK). Reconnect people, help new users
                – Twitter’s who to follow?


              Recommendations:
                – Netflix, Amazon, etc. (Future lectures)




                                                  7                     Sergei Vassilvitskii

Thursday, March 14, 13
Triadic Closure

              Likely to become friends with:
                – People in similar groups
                – Friends of friends




                                             8   Sergei Vassilvitskii

Thursday, March 14, 13
Defining Tight Knit Circles

              Looking for tight-knit circles:
                – People whose friends are friends themselves


              Why?
                – Network Cohesion: Tightly knit communities foster more trust, social
                  norms. [Coleman ’88, Portes ’88]
                – Structural Holes: Individuals benefit form bridging [Burt ’04, ’07]




                                                   9                          Sergei Vassilvitskii

Thursday, March 14, 13
Clustering Coefficient




                         vs.




                         10    Sergei Vassilvitskii

Thursday, March 14, 13
Clustering Coefficient



       cc (        ) = 0.5                            cc (   ) = 0.1


                                                vs.




             Given an undirected graph
             - For each node, it’s the fraction of v’s neighbors who are neighbors
             themselves
             - Identical to the number of triangles containing the node


                                                11                          Sergei Vassilvitskii

Thursday, March 14, 13
How to Count Triangles

              Sequential Version:
                    foreach v in V
                        foreach u,w in Adjacency(v)
                           if (u,w) in E
                              Triangles[v]++




                                  v

                                                      Triangles[v]=0




                                           12                     Sergei Vassilvitskii

Thursday, March 14, 13
How to Count Triangles

              Sequential Version:
                    foreach v in V
                        foreach u,w in Adjacency(v)
                           if (u,w) in E
                              Triangles[v]++




                                  v

                                                      Triangles[v]=1
                                                w

                                      u
                                           13                     Sergei Vassilvitskii

Thursday, March 14, 13
How to Count Triangles

              Sequential Version:
                    foreach v in V
                        foreach u,w in Adjacency(v)
                           if (u,w) in E
                              Triangles[v]++




                                  v

                                                      Triangles[v]=1

                         w
                                      u
                                           14                     Sergei Vassilvitskii

Thursday, March 14, 13
How to Count Triangles

              Sequential Version:
                    foreach v in V
                        foreach u,w in Adjacency(v)
                           if (u,w) in E
                              Triangles[v]++


              Running time:
                – For each vertex, look at all pairs of neighbors
                – Number of pairs ~ quadratic in the degree of the vertex


                – What happens if the degree is very large?




                                                  15                        Sergei Vassilvitskii

Thursday, March 14, 13
Parallel Version

              But use 1,000 machines!
                – Quadratic algorithms still don’t scale
                – Simple parallelization: process each vertex separately




              Naive parallelization does not help with data skew
                – Some nodes will have very high degree
                – Example. 3.2 Million followers, must generate 10 Trillion (10^13)
                  potential edges to check.
                – Even if generating 100M edges per second this is 100K seconds ~ 27
                  hours for one vertex!




                                                  16                        Sergei Vassilvitskii

Thursday, March 14, 13
“Just 5 more minutes”

              On the LiveJournal Graph (5M nodes, 70M edges)
                – 80% of vertices are done after 5 min
                – 99% done after 35 min




                                                 17            Sergei Vassilvitskii

Thursday, March 14, 13
Adapting the Algorithm

              Approach 1: Dealing with skew directly
                – currently every triangle counted 3 times (once per vertex)
                – Running time quadratic in the degree of the vertex
                – Idea: Count each once, from the perspective of lowest degree vertex
                – Does this heuristic work?




                                                 18                            Sergei Vassilvitskii

Thursday, March 14, 13
How to Count Triangles Better

              Idea [Schank ’07]
                – Only pivot on nodes who have smaller degrees than both neighbors.
                – Neighbors of high degree nodes tend to have small degrees




                                                19                        Sergei Vassilvitskii

Thursday, March 14, 13
How to Count Triangles Better

              foreach v in V
                 foreach u in Adjacency(v) with deg(u) > deg(v):
                     foreach w in Adjacency(v) with deg(w) > deg(v):
                       if (u,w) is an edge:
                          Triangles[v]++
                          Triangles[w]++
                          Triangles[u]++




                                       20                   Sergei Vassilvitskii

Thursday, March 14, 13
Does it make a difference?




                         21      Sergei Vassilvitskii

Thursday, March 14, 13
Why does it help?

              Look at two different kinds of nodes:
                – Few friends:
                         • OK to be quadratic on small instances
                – Lots of friends
                         • Only care about number of friends with even more friends!
                         • Cannot have too many (can make this formal)




                                                            22                         Sergei Vassilvitskii

Thursday, March 14, 13
Break




                         23   Sergei Vassilvitskii

Thursday, March 14, 13
Working in Parallel

              MapReduce (review):


              Map:
                – Decide how to group the data for computation

              Reduce:
                – Given the grouping, perform the computation




                                               24                Sergei Vassilvitskii

Thursday, March 14, 13
Building People You May Know

              Friendships are undirected:
                – If Alice knows Bob, Bob knows Alice
                – Data stored as a list of all edges
                – Find all friends of friends
                – Score the possible pairs




                                                   25   Sergei Vassilvitskii

Thursday, March 14, 13
Data

              Suppose you have edges and degrees of each vertex:


              Joe        56    Mary       78
              Alice 398        Bob    198
              Dan        983   Justin 11,985,234
              ...


              An alternate view may be data stored as adjacency list:
              Joe         56    Mary 78        Don   99   Bill 1
              Alice 398         Kate 55        Bob 198    Mary 78
              ...


                                                26                  Sergei Vassilvitskii

Thursday, March 14, 13
Previous Algorithm

              Adjacency list input.
                – Map:
                         • For each node and its neighbors, output all paths through the node
                – Reduce:
                         • none




                – Map: [          |         ]
                – Output:
                – Map: [          |    ]
                – Output: None
                                                            27                                  Sergei Vassilvitskii

Thursday, March 14, 13
How to Count Triangles Better

              Idea [Schank ’07]
                – Only pivot on nodes who have smaller degrees than both neighbors.
                – Neighbors of high degree nodes tend to have small degrees




                                                28                        Sergei Vassilvitskii

Thursday, March 14, 13
Want to compute all open triads

              Data Needed:
                – Central node
                – Neighbors that have higher degree




                                                29    Sergei Vassilvitskii

Thursday, March 14, 13
Want to compute all open triads

              Data Needed:
                – Central node
                – Neighbors that have higher degree




                                                30    Sergei Vassilvitskii

Thursday, March 14, 13
Want to compute all open triads

              Data Needed:
                – Central node
                – Neighbors that have higher degree




                                                31    Sergei Vassilvitskii

Thursday, March 14, 13
Want to compute all open triads

              Data Needed:
                – Central node
                – Neighbors that have higher degree




                – Orient each edge to point to a node of higher degree, breaking ties
                  arbitrarily but consistently




                                                 32                          Sergei Vassilvitskii

Thursday, March 14, 13
Want to compute all open triads

              Map:
                – Orient each edge to point to a node of higher degree, breaking ties
                  arbitrarily but consistently
                – Given: Joe      56    Mary         78
                – Output: <Key    = Joe, Value      = Mary>
                – Given: Alice    398   Bob         198
                – Output: <Key    = Bob, Value      = Alice>

                     map(key, value):
                        split = value.split()
                        if split[3] > split[1] or
                          (split[3] == split[1] and split[0] < split[2]):
                             emit(split[0], split[2])
                        if split[3] < split[1] or
                          (split[3] == split[1] and split[0] > split[2]):
                             emit(split[2], split[0])


                                                 33                          Sergei Vassilvitskii

Thursday, March 14, 13
Want to compute all open triads

              Aggregate (Shuffle):
                – Collect all values with same key (nodes with higher degree)




              Computation:
                – Generate all 2-paths (friend of a friend relationships):




                                                   34                           Sergei Vassilvitskii

Thursday, March 14, 13
Want to compute all open triads

              Aggregate (Shuffle):
                – Collect all values with same key (nodes with higher degree)




              Computation:
                – Generate all 2-paths (friend of a friend relationships):
                – Generate all 2-paths:       ,         ,




                                                   35                           Sergei Vassilvitskii

Thursday, March 14, 13
Want to compute all open triads

              Aggregate (Shuffle):
                – Collect all values with same key (nodes with higher degree)

              Computation:
                – Generate all 2-paths (friend of a friend relationships)
                – Given: key= Joe, value={Mary, Justin, Alice}
                – Output:
                         • (key = Joe, Value = (Mary, Justin))
                         • (key = Joe, Value = (Mary, Alice))
                         • (key = Joe, Value = (Justin, Alice))

                    reduce(key, values):
                       for friend1 : values
                          for friend2 : values
                             emit(key, (friend1, friend2))


                                                    36                          Sergei Vassilvitskii

Thursday, March 14, 13
Comparing Algorithms

              Edgelist MapOnly Algorithm:
                – MapOnly
                – Output from some nodes is quadratic


              Edge at a time Algroithm:
                – Map & Reduce
                – More balanced output from each node




                                               37       Sergei Vassilvitskii

Thursday, March 14, 13
Scoring

              Some suggestions are better than others:
                – Some people are already friends!
                – Or they used to be friends...
                – Connected through a friend with 1000s of friends
                – Connected through multiple friends
                – ...




                                                  38                 Sergei Vassilvitskii

Thursday, March 14, 13
Spring Break!




                         39   Sergei Vassilvitskii

Thursday, March 14, 13

More Related Content

Viewers also liked

Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part II
jakehofman
 
Computational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to CountingComputational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to Counting
jakehofman
 
Auto Elektrikoen Aurkezpena
Auto Elektrikoen AurkezpenaAuto Elektrikoen Aurkezpena
Auto Elektrikoen Aurkezpena
guest40206e1d
 

Viewers also liked (20)

Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part II
 
Computational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to CountingComputational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to Counting
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Auto Elektrikoen Aurkezpena
Auto Elektrikoen AurkezpenaAuto Elektrikoen Aurkezpena
Auto Elektrikoen Aurkezpena
 
Iso 9001 2008 - qms
Iso 9001  2008 - qmsIso 9001  2008 - qms
Iso 9001 2008 - qms
 
лабар5
лабар5лабар5
лабар5
 
Proyecto De FormacióN E InvestigacióNúCleo Social Cultural
Proyecto De FormacióN E InvestigacióNúCleo Social CulturalProyecto De FormacióN E InvestigacióNúCleo Social Cultural
Proyecto De FormacióN E InvestigacióNúCleo Social Cultural
 
Presentación del centro Don Juan I
Presentación del centro Don  Juan IPresentación del centro Don  Juan I
Presentación del centro Don Juan I
 
звіти
звітизвіти
звіти
 
Student textual analysis
Student textual analysisStudent textual analysis
Student textual analysis
 
Tic ted bravo nicolás
Tic ted bravo nicolásTic ted bravo nicolás
Tic ted bravo nicolás
 
Evaluation Q7
Evaluation Q7Evaluation Q7
Evaluation Q7
 
Поетична свічка
Поетична свічкаПоетична свічка
Поетична свічка
 
Rukovodstvo po montazhu_i_nastrojke_touch_board_i_
Rukovodstvo po montazhu_i_nastrojke_touch_board_i_Rukovodstvo po montazhu_i_nastrojke_touch_board_i_
Rukovodstvo po montazhu_i_nastrojke_touch_board_i_
 
Virus y antivirus
Virus y antivirusVirus y antivirus
Virus y antivirus
 
SEPM Outsourcing
SEPM OutsourcingSEPM Outsourcing
SEPM Outsourcing
 
Lake maggiore
Lake maggioreLake maggiore
Lake maggiore
 

More from jakehofman

NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
jakehofman
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10
jakehofman
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09
jakehofman
 

More from jakehofman (16)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09
 
Using Data to Understand the Brain
Using Data to Understand the BrainUsing Data to Understand the Brain
Using Data to Understand the Brain
 

Recently uploaded

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Recently uploaded (20)

Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

Computational Social Science, Lecture 08: Counting Fast, Part II

  • 1. Counting Fast (Part II) Sergei Vassilvitskii Columbia University Computational Social Science March 8, 2013 Thursday, March 14, 13
  • 2. Last time Counting fast: – Quadratic time doesn’t scale – Sorting is slightly more than linear – Hashing allows you to do membership queries in constant time 2 Sergei Vassilvitskii Thursday, March 14, 13
  • 3. Today Counting on Networks: – Large Graphs: Internet, Facebook, Twitter – Recommendation Graphs: Netflix, Amazon, etc. 3 Sergei Vassilvitskii Thursday, March 14, 13
  • 4. Friends & Followers Given a network: – When do people become friends? – What factors influence this? 4 Sergei Vassilvitskii Thursday, March 14, 13
  • 5. Friends & Followers Given a network: – When do people become friends? – What factors influence this? Products: – People You May Know (PYMK). Reconnect people, help new users 5 Sergei Vassilvitskii Thursday, March 14, 13
  • 6. Friends & Followers Given a network: – When do people become friends? – What factors influence this? Products: – People You May Know (PYMK). Reconnect people, help new users – Twitter’s who to follow? 6 Sergei Vassilvitskii Thursday, March 14, 13
  • 7. Friends & Followers Given a network: – When do people become friends? – What factors influence this? Products: – People You May Know (PYMK). Reconnect people, help new users – Twitter’s who to follow? Recommendations: – Netflix, Amazon, etc. (Future lectures) 7 Sergei Vassilvitskii Thursday, March 14, 13
  • 8. Triadic Closure Likely to become friends with: – People in similar groups – Friends of friends 8 Sergei Vassilvitskii Thursday, March 14, 13
  • 9. Defining Tight Knit Circles Looking for tight-knit circles: – People whose friends are friends themselves Why? – Network Cohesion: Tightly knit communities foster more trust, social norms. [Coleman ’88, Portes ’88] – Structural Holes: Individuals benefit form bridging [Burt ’04, ’07] 9 Sergei Vassilvitskii Thursday, March 14, 13
  • 10. Clustering Coefficient vs. 10 Sergei Vassilvitskii Thursday, March 14, 13
  • 11. Clustering Coefficient cc ( ) = 0.5 cc ( ) = 0.1 vs. Given an undirected graph - For each node, it’s the fraction of v’s neighbors who are neighbors themselves - Identical to the number of triangles containing the node 11 Sergei Vassilvitskii Thursday, March 14, 13
  • 12. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=0 12 Sergei Vassilvitskii Thursday, March 14, 13
  • 13. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=1 w u 13 Sergei Vassilvitskii Thursday, March 14, 13
  • 14. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=1 w u 14 Sergei Vassilvitskii Thursday, March 14, 13
  • 15. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ Running time: – For each vertex, look at all pairs of neighbors – Number of pairs ~ quadratic in the degree of the vertex – What happens if the degree is very large? 15 Sergei Vassilvitskii Thursday, March 14, 13
  • 16. Parallel Version But use 1,000 machines! – Quadratic algorithms still don’t scale – Simple parallelization: process each vertex separately Naive parallelization does not help with data skew – Some nodes will have very high degree – Example. 3.2 Million followers, must generate 10 Trillion (10^13) potential edges to check. – Even if generating 100M edges per second this is 100K seconds ~ 27 hours for one vertex! 16 Sergei Vassilvitskii Thursday, March 14, 13
  • 17. “Just 5 more minutes” On the LiveJournal Graph (5M nodes, 70M edges) – 80% of vertices are done after 5 min – 99% done after 35 min 17 Sergei Vassilvitskii Thursday, March 14, 13
  • 18. Adapting the Algorithm Approach 1: Dealing with skew directly – currently every triangle counted 3 times (once per vertex) – Running time quadratic in the degree of the vertex – Idea: Count each once, from the perspective of lowest degree vertex – Does this heuristic work? 18 Sergei Vassilvitskii Thursday, March 14, 13
  • 19. How to Count Triangles Better Idea [Schank ’07] – Only pivot on nodes who have smaller degrees than both neighbors. – Neighbors of high degree nodes tend to have small degrees 19 Sergei Vassilvitskii Thursday, March 14, 13
  • 20. How to Count Triangles Better foreach v in V foreach u in Adjacency(v) with deg(u) > deg(v): foreach w in Adjacency(v) with deg(w) > deg(v): if (u,w) is an edge: Triangles[v]++ Triangles[w]++ Triangles[u]++ 20 Sergei Vassilvitskii Thursday, March 14, 13
  • 21. Does it make a difference? 21 Sergei Vassilvitskii Thursday, March 14, 13
  • 22. Why does it help? Look at two different kinds of nodes: – Few friends: • OK to be quadratic on small instances – Lots of friends • Only care about number of friends with even more friends! • Cannot have too many (can make this formal) 22 Sergei Vassilvitskii Thursday, March 14, 13
  • 23. Break 23 Sergei Vassilvitskii Thursday, March 14, 13
  • 24. Working in Parallel MapReduce (review): Map: – Decide how to group the data for computation Reduce: – Given the grouping, perform the computation 24 Sergei Vassilvitskii Thursday, March 14, 13
  • 25. Building People You May Know Friendships are undirected: – If Alice knows Bob, Bob knows Alice – Data stored as a list of all edges – Find all friends of friends – Score the possible pairs 25 Sergei Vassilvitskii Thursday, March 14, 13
  • 26. Data Suppose you have edges and degrees of each vertex: Joe 56 Mary 78 Alice 398 Bob 198 Dan 983 Justin 11,985,234 ... An alternate view may be data stored as adjacency list: Joe 56 Mary 78 Don 99 Bill 1 Alice 398 Kate 55 Bob 198 Mary 78 ... 26 Sergei Vassilvitskii Thursday, March 14, 13
  • 27. Previous Algorithm Adjacency list input. – Map: • For each node and its neighbors, output all paths through the node – Reduce: • none – Map: [ | ] – Output: – Map: [ | ] – Output: None 27 Sergei Vassilvitskii Thursday, March 14, 13
  • 28. How to Count Triangles Better Idea [Schank ’07] – Only pivot on nodes who have smaller degrees than both neighbors. – Neighbors of high degree nodes tend to have small degrees 28 Sergei Vassilvitskii Thursday, March 14, 13
  • 29. Want to compute all open triads Data Needed: – Central node – Neighbors that have higher degree 29 Sergei Vassilvitskii Thursday, March 14, 13
  • 30. Want to compute all open triads Data Needed: – Central node – Neighbors that have higher degree 30 Sergei Vassilvitskii Thursday, March 14, 13
  • 31. Want to compute all open triads Data Needed: – Central node – Neighbors that have higher degree 31 Sergei Vassilvitskii Thursday, March 14, 13
  • 32. Want to compute all open triads Data Needed: – Central node – Neighbors that have higher degree – Orient each edge to point to a node of higher degree, breaking ties arbitrarily but consistently 32 Sergei Vassilvitskii Thursday, March 14, 13
  • 33. Want to compute all open triads Map: – Orient each edge to point to a node of higher degree, breaking ties arbitrarily but consistently – Given: Joe 56 Mary 78 – Output: <Key = Joe, Value = Mary> – Given: Alice 398 Bob 198 – Output: <Key = Bob, Value = Alice> map(key, value): split = value.split() if split[3] > split[1] or (split[3] == split[1] and split[0] < split[2]): emit(split[0], split[2]) if split[3] < split[1] or (split[3] == split[1] and split[0] > split[2]): emit(split[2], split[0]) 33 Sergei Vassilvitskii Thursday, March 14, 13
  • 34. Want to compute all open triads Aggregate (Shuffle): – Collect all values with same key (nodes with higher degree) Computation: – Generate all 2-paths (friend of a friend relationships): 34 Sergei Vassilvitskii Thursday, March 14, 13
  • 35. Want to compute all open triads Aggregate (Shuffle): – Collect all values with same key (nodes with higher degree) Computation: – Generate all 2-paths (friend of a friend relationships): – Generate all 2-paths: , , 35 Sergei Vassilvitskii Thursday, March 14, 13
  • 36. Want to compute all open triads Aggregate (Shuffle): – Collect all values with same key (nodes with higher degree) Computation: – Generate all 2-paths (friend of a friend relationships) – Given: key= Joe, value={Mary, Justin, Alice} – Output: • (key = Joe, Value = (Mary, Justin)) • (key = Joe, Value = (Mary, Alice)) • (key = Joe, Value = (Justin, Alice)) reduce(key, values): for friend1 : values for friend2 : values emit(key, (friend1, friend2)) 36 Sergei Vassilvitskii Thursday, March 14, 13
  • 37. Comparing Algorithms Edgelist MapOnly Algorithm: – MapOnly – Output from some nodes is quadratic Edge at a time Algroithm: – Map & Reduce – More balanced output from each node 37 Sergei Vassilvitskii Thursday, March 14, 13
  • 38. Scoring Some suggestions are better than others: – Some people are already friends! – Or they used to be friends... – Connected through a friend with 1000s of friends – Connected through multiple friends – ... 38 Sergei Vassilvitskii Thursday, March 14, 13
  • 39. Spring Break! 39 Sergei Vassilvitskii Thursday, March 14, 13