Upcoming SlideShare
×

# Scoda project companygraph

1,163 views
1,071 views

Published on

Published in: Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,163
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
5
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Scoda project companygraph

1. 1. This  tutorial  describes  how  to  use  network  analysis  tools  to  visually  explore  the  links   between  companies  working  on  the  same  contract.   1
2. 2. The  example  dataset  we  will  use  comes  from  the  World  Bank.     Each  row  represents  a  contract.  Inspec@ng  the  column  names  tells  us  what  data  we   have  available  about  each  contract.     Looking  at  the  data,  we  can  see  how  we  could  order  the  companies  based  on  the   value  of  the  total  contract  amount;  or  we  might  order  the  contracts  by  @me;  or  we   might  look  to  see  which  contracts  were  awarded  in  a  par@cular  project,  or  to  a   par@cular  company  in  the  event  of  the  same  company  being  awarded  more  than  one   contract.   2
3. 3. We  might  also  wish  to  look  for  paFerns  in  the  data  that  show  us  how  the  things   described  in  one  row  might  connect  to  things  described  in  other  rows.     For  example,  can  we  organise  the  data  somehow  to  see  which  companies  are   associated  with  which  projects?  Could  a  network  style  visualisa@on  help  us  do  this?     3
4. 4. But  if  we  were  to  draw  a  network,  what  sort  of  thing  should  we  connect  to  what?   And  how  would  would  know  what  to  connect  to  each  other?     One  way  is  to  look  at  the  data…  at  which  point  we  might  no@ce  that  some  of  entries   within  a  column  take  on  the  same  value.  This  means  that  we  can  “connect”  the  data   that  appears  in  diﬀerent  rows  using  these  common  elements…   4
5. 5. So  what  columns  have  usefully  repea@ng  elements?  The  projects  column  certainly   has  repea@ng  elements,  so  if  we  should  be  able  to  draw  diagrams  that  show  all  the   companies  that  connect  to  each  project.  And  if  a  company  is  associated  with  more   than  one  project,  it  should  in  a  certain  sense  be  seen  to  join  those  projects  together…     5
6. 6. A  few  of  the  contract  numbers  repeat,  so  it  might  be  interes@ng  to  explore  the  extent   to  which  companies  connect  to  contracts.  If  two  diﬀerent  companies  are  associated   with  the  same  contracts,  that  might  be  interes@ng.     6
7. 7. Let’s  get  some  data  so  we  can  start  to  explore  the  network…   7
8. 8. We  just  need  to  do  a  liFle  bit  of  @dying  of  the  data  before  we  make  use  of  it.     The  major  problem  is  that  the  Total  Contract  Amount  column  does  not  contain   numbers,  as  such…  In  par@cular,  we  need  to  get  rid  of  the  dollar  sign.  Let’s  create  a   new  column  into  which  we  can  put  the  cleaned  values.   8
9. 9. This  liFle  bit  of  code  says:  take  the  value  of  each  cell  in  the  original  column  and   replace  the  \$  symbol  with  nothing  (that  is,  an  empty  string).  In  other  words,  delete   the  dollar  sign…  Put  this  value  in  the  corresponding  cell  of  the  new  column,  and  make   the  cell  a  number  type.   9
10. 10. Now  we  can  export  the  data  using  the  Custom  Tabular  Exporter,  which  allows  us  to   select  just  those  columns  we  want  to  export.  (This  can  be  very  handy  when  a  table   has  a  large  number  of  columns  that  we  are  not  interested  in!)     I  have  rearranged  the  cells  in  the  Custom  Tabular  Exporter  simply  by    clicking  on   them  and  dragging  them  around.  We  just  want  three  columns  for  now:  Project  ID,   Supplier,  and  our  new  Amount  column.     Now  that  you  know  how  to  export  the  data  just  a  few  columns  at  a  @me,  once  you   are  comfortable  with  the  process  of  visualising  the  data,  you  should  be  able  to  take   other  slices  through  the  data  (such  as  companies  related  to  contracts)  and  visualise   them  yourself.     You  might  also  like  to  try  using  a  similar  method  on  a  data  set  of  your  own…   10
11. 11. There’s  a  ﬁnal  bit  of  @dying  to  do  before  we  can  use  this  data  in  Gephi,  the   applica@on  we’ll  be  using  to  visualise  the  network.     In  par@cular,  Gephi  expects  the  data  to  be  presented  to  it  with  par@cular  column   names.     Open  the  exported  CSV  data  in  a  text  editor  and  rename  the  columns:   Source,Target,Weight  (no  spaces?)     Note  –  you  could  have  also  renamed  the  columns  in  OpenReﬁne  before  expor@ng   them…   11
12. 12. We  might  also  wish  to  look  for  paFerns  in  the  data  that  show  us  how  the  things   described  in  one  row  might  connect  to  things  described  in  other  rows.     For  example,  can  we  organise  the  data  somehow  to  see  which  companies  are   associated  with  which  projects?  Could  a  network  style  visualisa@on  help  us  do  this?     12
13. 13. Network  diagrams  allow  us  to  show  rela@onships  between  diﬀerent  things.  Networks   are  referred  to  in  mathema@cal  terms  as  graph  structures,  or  graphs.  You  may  be   more  familiar  with  thinking  of  things  like  line  charts  and  bar  charts  as  graphs,  but   when  it  comes  to  network,  we  use  the  term  graph  to  describe  the  mathema@cal   structure  that  deﬁnes  the  network.     The  circles  –  or  nodes  –  represent  “things”  in  the  network,  in  this  case,  par@cular   companies  or  projects.     The  lines  –  or  edges  –  represent  rela@onships  between  the  things  in  the  network.  In   this  example,  the  edges  represent  contracts  that  associate  a  par@cular  company  with   one  or  more  projects,  (or  conversely,  associate  a  project  with  one  or  more   companies).     Where  nodes  are  placed  in  the  diagram  can  be  used  to  convey  informa@on  about  the   structure  of  the  network.  Many  diﬀerent  algorithms  exist  to  lay  out  (that  is,  place,  or   posi@on)  the  nodes  at  speciﬁc  points  in  the  diagram.  Typically,  we  try  to  place  nodes   that  are  heavily  interconnected  by  edges  close  to  each  other.  Nodes  that  are  grouped   closely  together  on  the  page  might  then  be  assumed  to  be  associated  in  some  way   because  of  the  increasing  number  of  links  that  connect  them  to  each  other.     13
14. 14. Launch  Gephi  and  from  the  File  menu  select  New  Project.  Click  on  the  Data   Laboratory  tab,  and  then  Import  Spreadsheet.     Load  in  the  ﬁle  (with  amended  column  names)  as  an  Edges  Table.  The  default  seings   should  be  ﬁne…   14
15. 15. Click  on  the  Overview  tab  –  you  should  see  the  network  that  connects  Companies  to   Project  IDs  displayed  there…     But  what  does  it  mean?  And  can  we  @dy  it  up  a  liFle?!   15
16. 16. I  used  the  Yifan  Hu  layout  to  generate  this  view  over  the  network.     Yifan  Hu  is  a  good  all  round  layout  engine  that  works  par@cularly  well  when  the  data   is  hierarchically  structured.     Another  good  general  purpose  layout  algorithm  is  ForeceAtlas2.   16
17. 17. Whilst  we  might  get  a  feeling  for  the  structure  and  shape  of  the  dataset  as  a  whole   from  the  overall  visualisa@on,  we  oken  want  to  inspect  one  or  more  of  the  nodes  in   detail.     The  quickest  way  of  doing  this  is  to  look  at  the  labels…     You  may  also  have  no@ced  that  the  edge  thickness  is  thicker  for  some  lines  than   others.  In  this  case,  the  line  thicknesses  are  propor@onal  to  the  contract  value,  which   we  set  in  the  weight  column.       If  a  company  is  associated  with  more  than  a  single  contract  on  a  par@cular  project,   the  edge  weight  well  be  propor@onal  to  the  overall  (total)  sum  of  values  of  all  the   contracts  rela@ng  that  company  to  that  project.     17
18. 18. As  well  as  using  space  (or  posi@on)  and  colour  to  represent  structural  elements  of  the   network,  we  can  also  use  edge  weight  (that  is  the  thickness,  or  width)  of  the  lines   connec@ng  nodes  to  each  other  to  represent  some  feature  of  the  network.     In  this  case,  we  might  use  edge  weight  to  represent  the  value  of  contract  that   connects  a  company  with  a  project,  or  the  number  of  contracts  that  a  company  has   on  a  par@cular  project.     When  placing  nodes,  we  might  also  use  edge  weight  to  contribute  to  the   determina@on  of  how  closely  two  connected  nodes  should  be  placed  to  each  other.  If   you  think  of  the  edge  thickness  in  terms  of  the  size,  thickness  or  strength  of  a   mechanical  spring,  you  might  perhaps  start  to  imagine  how  nodes  connected  by  thick   springs  will  be  pulled  closer  to  each  other  than  nodes  connected  by  much  weaker   springs.         18
19. 19. As  well  as  edge  thickness,  we  might  also  make  use  of  node  size  to  highlight  some   feature  of  the  network.     In  this  example,  we  use  node  size  to  represent  the  degree  of  each  node,  that  is,  the   number  of  edges  connected  to  it.  Some@mes,  we  might  want  to  highlight  nodes  that   have  small  numbers  of  connec@ons,  for  example  to  iden@fy  projects  with  very  few   companies  contracted  to  them.  In  this  case,  we  might  make  nodes  with  only  a  single   incoming  edge  very  large,  and  nodes  with  large  number  of  edges  much  smaller.     The  node  size  thus  represents  how  well  connected  a  node  is.  In  this  case,  the  size  of   the  project  nodes  indicates  how  many  companies  are  associated  with  it,  and  the  size   of  the  company  nodes  depicts  how  many  project  contracts  the  company  is  engaged   with.     Note  that  we  can  combine  edge  weight  and  node  size,  for  example,  by  seing  node   size  propor@onal  to  the  summed  weights  of  edges  that  are  connected  to  the  node.     Hopefully,  you  are  already  star@ng  to  see  how  a  network  diagram  can  provide  a   range  of  powerful  visual  representa@ons  for  helping  us  explore  the  structure  of   network  and  iden@fy  key  elements  of  it.   19
20. 20. We  can  size  the  nodes  according  to  sta@s@cal  values  calculated  over  the  network.     In  this  case,  we  might  want  to  highlight  nodes  according  to  the  total  value  of   contracts  ﬂowing  into  them  (for  companies)  or  out  of  them  (for  projects).  The   weighted  average  sta@s@c  calculates  the  corresponding  value  for  each  node  in  the   network.     The  spline  operator  in  the  Ranking  tab  –  where  we  set  the  node  size  –  allows  us  to   tweak  the  rela@onship  between  the  value  used  to  size  the  node  and  the  node  size.   The  default  is  a  simple  linear  propor@onal  map.  However,  we  may  ﬁnd  that  the  range   of  values  we  want  to  map  are  “clumped”  together  (for  example,  one  very  large  value   and  a  range  of  smaller  values  clumped  together  at  the  other  end  of  the  overall   range).  In  such  a  case,  we  might  want  to  tweak  the  mapping  to  provide  a  liFle  more   salience  when  it  comes  to  dis@nguishing  between  the  values  that  are  otherwise   clumped  together.     As  well  as  making  node  size  propor@onal  to  some  quan@ty,  we  can  also  set  the  label   size  to  be  propor@onal  to  the  node  size.   20
21. 21. There  are  several  other  tools  available  to  us  that  allow  us  to  explore  other  proper@es   of  the  network.  For  example,  there  is  a  wide  selec@on  of  ﬁlters  that  allow  us  to  select   par@cular  ﬁltered  views  of  the  network.     In  this  case,  we  use  the  degree  range  ﬁlter  to  show  only  nodes  that  have  degree  of   two  or  more.  This  ﬁlters  out  nodes  that  have  degree  1  –  for  example,  companies  that   are  only  associated  with  a  single  project.  The  result  is  a  view  over  the  network  that   shows  which  companies  are  associated  with  two  or  more  projects,  and  which   projects  they  are.  The  node  sizes  are  indica@ve  of  the  total  overall  vale  of  contracts   associated  with  each  par@cular  node.     So  for  example,  we  see  that  Siemens  AG  is  associated  with  contracts  from  projects   P072018  and  P090104.  The  large  node  size  suggests  that  the  sum  total  of  contracts   Siemens  AG  has  received  via  this  projects  is  quite  signiﬁcant.  In  addi@on,  the  line   from  P072018  to  Siemens  AG  suggests  that  the  total  value  of  contracts  (or  maybe  just   a  single  contract)  Siemens  AG  has  received  from  that  project  is  quite  large.   21
22. 22. So  far,  out  network  diagram  has  shown  us  how  companies  relate  to  projects,  and   conversely,  how  projects  relate  to  companies.     But  some@mes  we  may  want  to  know  rather  more  directly  the  extent  to  which  two   things  are  connected  by  virtue  of  having  a  common  partner  –  for  example,  which   companies  worked  on  the  same  projects  together,  or  which  projects  are  linked  by   virtue  of  having  used  the  same  companies.     When  the  data  is  represented  as  a  graph,  we  can  manipulate  the  graph  in  order  to   generate  derived  graphs  that  can  capture  these  sorts  of  rela@onship  directly.   22
23. 23. When  we  have  a  dataset  represented  in  the  form  of  a  network,  we  can  start  to   analyse  it  by  looking  at  addi@onal  network  proper@es.     For  example,  for  the  projects  and  companies  graph,  we  might  process  the  graph  so  as   to  remove  project  nodes  and  replace  the  edges  with  edges  that  connect  companies   that  were  on  one  or  more  project  with  each  other.  We  might  even  use  edge  weight   to  depict  how  many  projects  there  were  in  common  between  two  companies.   23
24. 24. From  the  workspace  menu,  duplicate  the  original  network  (remember  to  turn  oﬀ  all   the  ﬁlters!  We  want  the  whole  network.)     You  will  automa@cally  be  moved  to  a  new  workspace  containing  a  copy  of  the  original   network.  (Navigate  between  workspaces  from  the  workspace  selector  at  the  boFom   right  hand  corner  of  the  whole  applica@on  window.)     In  the  Mul@mode  Networks  Projec@on  panel,  click  on  Graph  Coloring  to  try  to  split   the  network  into  complementary  types  of  node  (companies  and  projects).  Hopefully,   the  tool  will  return  with  the  report  that  Bipar22e:true.  That  is,  two  complementary   sets  of  nodes  have  been  found  (nodes  in  the  ﬁrst  group  are  only  ever  connected  to   nodes  in  the  second  group.)Click  on  Load  aFributes  and  select  the  Node  Color   Mul@mode  op@on.     24
25. 25. To  check  what  the  mul@mode  tool  has  called  nodes  of  each  type,  click  on  the  edit   buFon  in  the  paleFe  toolbar,  and  click  on  a  project  node.  An  edit  panel  will  appear  –   make  a  note  of  what  colour  the  project  type  node  has  been  labeled.     We  can  now  use  the  mul@mode  network  projec@on  tool  to  process  the  network  by   joining  together  company  nodes  that  are  connected  by  a  common  project,  and   dele@ng  the  project  nodes.     That  is,  we  want  to  connect  blue  company  nodes  to  blue  company  nodes  if  they  are   connected  by  edges  that  pass  through  a  common  red  project  node.  One  we  have   made  the  mapping,  we  can  delete  the  inner  red  project  nodes.     Running  the  projec@on  results  in  several  dis@nct  clusters  of  companies  that  are   connected  to  each  other  by  virtue  of  being  associated  with  the  same  project,  as  well   as  some  companies  that  bridge  diﬀerent  clusters  by  virtueof  being  associated  with   companies  from  diﬀerent  projects.   25
26. 26. Conversely,  we  might  remove  the  company  nodes,  and  iden@fy  a  new  set  of  edges   that  connect  projects  that  shared  one  or  more  common  contracted  companies.   Again,  edge  thickness  might  be  use  to  show  how  @ghtly  connected  two  projects  were   by  virtue  of  increasing  numbers  of  common  contracted  companies.   26
27. 27. By  projec@ng  the  original  network  onto  the  network  that  shows  links  between   projects  that  arise  from  common  companies,  we  get  a  much  clearer  picture  about   how  many  projects  there  are,  as  well  as  possible  linkages  between  them.   27
28. 28. Here  are  some  of  the  things  you  have  hopefully  learned…feel  free  to  add  anything   else  you  might  have  learned  to  the  list…   28
29. 29. For  more  informa@on,  and  a  wide  range  of  further  tutorials  on  all  maFers  data   related,  visit  the  School  Of  Data  at  SchoolOfData.org,  or  on  TwiFer  via   @SchoolOfData.   29