SlideShare a Scribd company logo
1 of 31
Download to read offline
Divvy	
  Bike	
  Challenge	
  Visualizations	
  
	
  
CSC	
  465	
  	
  -­‐	
  DATA	
  VISUALIZATION:	
  FINAL	
  PROJECT	
  
	
  
GROUP	
  #1:	
  
HASSAN	
  AL	
  ALAIWI,	
  RICARDO	
  LOURENÇO,	
  AND	
  MATT	
  SIEDLECKI	
  
	
  
	
  
	
  
	
  March	
  
2015	
  
 
	
   2	
  
Contents  
Abstract	
  .........................................................................................................................................................................	
  3	
  
Description	
  ....................................................................................................................................................................	
  3	
  
Scope	
  ............................................................................................................................................................................	
  3	
  
Dataset	
  ..........................................................................................................................................................................	
  3	
  
Dataset	
  Variables	
  ..........................................................................................................................................................	
  4	
  
Final	
  Visualzaitions	
  ........................................................................................................................................................	
  5	
  
Usage	
  by	
  Weekday/Weekend	
  and	
  Time	
  of	
  Day	
  .......................................................................................................	
  5	
  
Network	
  Map	
  of	
  Chicago	
  Loop	
  .................................................................................................................................	
  6	
  
Circular	
  Network	
  Visualization	
  .................................................................................................................................	
  7	
  
Discussion	
  .....................................................................................................................................................................	
  8	
  
Usage	
  by	
  Weekday/Weekend	
  and	
  Time	
  of	
  Day	
  .......................................................................................................	
  8	
  
Network	
  Map	
  of	
  Chicago	
  Loop	
  ...............................................................................................................................	
  13	
  
Circular	
  Network	
  Visualization	
  and	
  Related	
  Analysis	
  .............................................................................................	
  19	
  
Summary	
  of	
  Team	
  Member	
  Contributions	
  .................................................................................................................	
  31	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
   	
  
 
	
   3	
  
Abstract	
  
In	
  this	
  final	
  project	
  report,	
  we	
  are	
  trying	
  to	
  shed	
  light	
  on	
  some	
  of	
  our	
  final	
  data	
  visualizations	
  of	
  the	
  
chosen	
  Divvy	
  bikes	
  dataset.	
  The	
  report	
  details	
  each	
  visualization	
  technique	
  used	
  to	
  display	
  information	
  
about	
  the	
  dataset	
  and	
  possible	
  its	
  correlation	
  implications.	
  In	
  order	
  to	
  have	
  a	
  very	
  coherent	
  and	
  
concrete	
  report,	
  a	
  few	
  segments	
  in	
  this	
  report	
  have	
  been	
  extracted	
  from	
  our	
  project	
  milestones	
  that	
  
were	
  previously	
  submitted	
  as	
  part	
  of	
  the	
  project	
  progress.	
  Higher	
  resolution	
  files	
  of	
  the	
  final	
  
visualizations	
  are	
  enclosed.	
  
	
  
Description	
  
Every	
  year,	
  Divvy	
  launches	
  a	
  data	
  challenge	
  providing	
  their	
  magnificent	
  dataset	
  in	
  the	
  purpose	
  of	
  
scrutinizing	
  and	
  visualizing	
  the	
  data	
  under	
  different	
  categories.	
  This	
  year,	
  Divvy	
  celebrates	
  its	
  first	
  time	
  
full-­‐year	
  dataset	
  (2014)	
  with	
  over	
  3.2	
  Million	
  rows	
  of	
  data	
  where	
  it	
  is	
  even	
  more	
  challenging	
  and	
  more	
  
enticing	
  for	
  data	
  scientists	
  and	
  other	
  participants.	
  
Scope	
  
We	
  are	
  tasked	
  to	
  fulfill	
  our	
  CSC	
  465	
  project’s	
  objectives	
  of	
  visualizing	
  a	
  dataset	
  through	
  the	
  best	
  
visualization	
  techniques	
  that	
  were	
  discussed	
  throughout	
  the	
  course.	
  Since	
  both	
  mapping	
  and	
  
geographical	
  data	
  is	
  available	
  in	
  this	
  dataset,	
  we	
  have	
  completed	
  multiple	
  graphs	
  that	
  visualize	
  the	
  
dataset	
  statistically	
  and	
  geographically.	
  The	
  software	
  that	
  were	
  used	
  for	
  this	
  purpose	
  are	
  R-­‐Studio,	
  
Tableau,	
  JMP	
  and	
  ArcGIS	
  which	
  provided	
  a	
  sufficient	
  platform	
  for	
  our	
  objective.	
  Those	
  multiple	
  
visualizations	
  provide	
  answers	
  to	
  the	
  following	
  questions1
	
  in	
  the	
  most	
  clear	
  and	
  accurate	
  methods	
  that	
  
we	
  could	
  have	
  achieved:	
  
Ø When	
  &	
  Where	
  are	
  riders	
  going?	
  	
  
Ø What	
  are	
  the	
  most	
  and	
  least	
  busy	
  stations?	
  	
  
Ø What	
  interesting	
  usage	
  patterns	
  emerge?	
  	
  
Ø How	
  the	
  bikers’	
  demography	
  can	
  be	
  presented?	
  
	
  
Dataset	
  
The	
  full-­‐year	
  dataset	
  is	
  broken	
  down	
  by	
  quarters	
  with	
  a	
  total	
  of	
  2.4	
  million	
  records.	
  However,	
  due	
  to	
  the	
  
size	
  of	
  the	
  dataset,	
  we	
  have	
  mainly	
  used	
  a	
  Simple	
  Random	
  Selection	
  of	
  the	
  data	
  in	
  order	
  to	
  have	
  a	
  
manageable	
  dataset	
  size	
  of	
  100,000	
  records	
  that	
  will	
  still	
  be	
  sufficient	
  for	
  our	
  educational	
  purposes.	
  
Source:	
  Chicago	
  Divvy	
  Bikes	
  website	
  (Annual	
  data	
  challenge)	
  
Website:	
  	
  http://www.divvybikes.com/datachallenge	
  
	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
1
Some	
  of	
  these	
  inquiries	
  are	
  part	
  of	
  the	
  2015	
  Divvy	
  Data	
  Challenge
2
Description	
  of	
  the	
  dataset	
  variables	
  was	
  provided	
  by	
  Divvy	
  Data	
  Challenge	
  	
  
 
	
   4	
  
Dataset	
  Variables	
  
The	
  dataset	
  consists	
  of	
  two	
  tables	
  (sub-­‐datasets)	
  which	
  will	
  be	
  used	
  in	
  the	
  project2
.	
  	
  
Ø Trips	
  dataset	
  (the	
  main	
  dataset):	
  
This	
  dataset	
  is	
  the	
  main	
  project	
  dataset	
  which	
  includes	
  all	
  trips’	
  records	
  and	
  transactions	
  
whenever	
  a	
  bike	
  is	
  rented	
  from	
  a	
  station.	
  The	
  following	
  12	
  variables	
  are	
  captured	
  in	
  every	
  data	
  
row:	
  
	
  	
   	
   trip_id:	
  ID	
  attached	
  to	
  each	
  trip	
  taken	
  –	
  (Type:	
  Category	
  –	
  serial	
  key)	
  
starttime:	
  day	
  and	
  time	
  trip	
  started,	
  in	
  CST	
  –	
  (Type:	
  Date&Time)	
  
stoptime:	
  day	
  and	
  time	
  trip	
  ended,	
  in	
  CST	
  –	
  (Type:	
  Date&Time)	
  
bikeid:	
  ID	
  attached	
  to	
  each	
  bike	
  –	
  (Type:	
  Category)	
  
tripduration:	
  time	
  of	
  trip	
  in	
  seconds	
  –	
  (Type:	
  Numeric)	
  
from_station_name:	
  name	
  of	
  station	
  where	
  trip	
  originated	
  –	
  (Type:	
  Category)	
  
to_station_name:	
  name	
  of	
  station	
  where	
  trip	
  terminated	
  –	
  (Type:	
  Category)	
  
from_station_id:	
  ID	
  of	
  station	
  where	
  trip	
  originated	
  –	
  (Type:	
  Category)	
  
to_station_id:	
  ID	
  of	
  station	
  where	
  trip	
  terminated	
  –	
  (Type:	
  Category)	
  
usertype:	
  "Customer"	
  is	
  a	
  rider	
  who	
  purchased	
  a	
  24-­‐Hour	
  Pass;	
  "Subscriber"	
  is	
  a	
  rider	
  
who	
  purchased	
  an	
  Annual	
  Membership	
  –	
  (Type:	
  Category)	
  
gender:	
  gender	
  of	
  rider	
  –	
  (Type:	
  Binary)	
  
birthyear:	
  birth	
  year	
  of	
  rider	
  –	
  (Type:	
  Numeric)	
  
	
  
Ø Stations	
  dataset	
  (Table	
  relationship	
  dataset):	
  
This	
  relations	
  dataset	
  includes	
  the	
  location	
  details	
  of	
  Divvy	
  stations	
  which	
  will	
  be	
  used	
  in	
  the	
  
project	
  to	
  map	
  the	
  start	
  and	
  end	
  locations	
  of	
  each	
  bike	
  trip	
  used	
  in	
  the	
  main	
  table.	
  The	
  5	
  
variables	
  are:	
  
name:	
  station	
  name	
  –	
  (Type:	
  Category)	
  
latitude:	
  station	
  latitude	
  –	
  (Type:	
  GPS	
  location)	
  
longitude:	
  station	
  longitude	
  –	
  (Type:	
  GPS	
  location)	
  
dpcapacity:	
  number	
  of	
  total	
  docks	
  at	
  each	
  station	
  as	
  of	
  12/31/2014	
  –	
  (Type:	
  Numeric)	
  
online	
  date:	
  date	
  the	
  station	
  went	
  live	
  in	
  the	
  system	
  –	
  (Type:	
  Date&Time)	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
2
Description	
  of	
  the	
  dataset	
  variables	
  was	
  provided	
  by	
  Divvy	
  Data	
  Challenge	
  	
  
 
	
   5	
  
Final	
  Visualizations	
  	
  
	
  
Visualization	
  #1:	
  Divvy	
  Bike	
  Usage	
  by	
  Weekday/Weekend	
  and	
  Time	
  of	
  Day	
  	
  
	
  
	
  
	
  
	
  
	
  
 
	
   6	
  
Visualization	
  #2:	
  Network	
  Map	
  of	
  Chicago	
  Loop	
  
	
  
	
  
	
  
 
	
   7	
  
Visualization	
  #3:	
  Circular	
  Network	
  Visualization	
  
	
  
 
	
   8	
  
Divvy	
  Bike	
  Usage	
  by	
  Time	
  of	
  Day	
  and	
  Day	
  of	
  Week:	
  Discussion	
  
	
  
Overview	
  
For	
  this	
  visualization	
  small	
  multiple	
  maps	
  were	
  combined	
  with	
  histograms	
  to	
  display	
  information	
  about	
  
Divvy	
  bike	
  usage	
  by	
  time	
  of	
  day	
  and	
  day	
  of	
  week.	
  Looking	
  horizontally,	
  the	
  maps	
  show	
  data	
  in	
  4-­‐hour	
  
blocks	
  starting	
  at	
  midnight.	
  The	
  histograms	
  on	
  the	
  top	
  and	
  bottom	
  display	
  a	
  histogram	
  of	
  the	
  total	
  
system	
  usage.	
  
	
  
The	
  segments	
  drawn	
  on	
  the	
  map	
  illustrate	
  the	
  popular	
  routes	
  selected	
  using	
  a	
  combination	
  of	
  
thresholds	
  for	
  number	
  of	
  segments	
  on	
  a	
  map	
  and	
  minimum	
  usage	
  needed	
  to	
  be	
  considered	
  for	
  the	
  map.	
  
Additional	
  discussion	
  on	
  the	
  design	
  decisions	
  made	
  to	
  select	
  which	
  bike	
  trip	
  segments	
  were	
  included	
  
follows	
  in	
  the	
  design	
  consideration	
  section.	
  
	
  
	
  
The	
  viewer	
  is	
  able	
  to	
  discern	
  a	
  number	
  of	
  pieces	
  of	
  information	
  from	
  this	
  graphic,	
  including:	
  
• Usage	
  is	
  much	
  higher	
  during	
  the	
  week	
  (than	
  the	
  weekend)	
  
• Weekday	
  usage	
  has	
  a	
  bimodal	
  distribution	
  with	
  peaks	
  during	
  morning	
  and	
  evening	
  commuting	
  
times.	
  
 
	
   9	
  
• More	
  commuters	
  use	
  the	
  bikes	
  in	
  the	
  evening	
  than	
  in	
  the	
  morning	
  
• Weekend	
  usage	
  is	
  much	
  less	
  than	
  weekday	
  
• Weekend	
  usage	
  has	
  a	
  unimodal	
  distribution	
  centered	
  in	
  the	
  early	
  afternoon.	
  
• In	
  general,	
  off-­‐peak	
  hours	
  have	
  riders	
  scattered	
  throughout	
  the	
  city,	
  especially	
  near	
  train	
  
stations,	
  while	
  usage	
  is	
  more	
  concentrated	
  during	
  the	
  day	
  
• Weekend	
  is	
  heavily	
  concentrated	
  along	
  the	
  lakeshore,	
  Lincoln	
  Park,	
  Navy	
  Pier,	
  and	
  some	
  smaller	
  
tourist	
  locations	
  such	
  as	
  the	
  Hyde	
  Park	
  Museum	
  Campus	
  
• Lakeshore	
  path	
  is	
  more	
  prominent	
  in	
  evening	
  commute	
  hours	
  than	
  morning	
  commute	
  hours,	
  
possibly	
  due	
  to	
  higher	
  system	
  usage	
  during	
  that	
  time	
  
	
  
Design	
  Considerations	
  
Number	
  of	
  Small	
  Multiple	
  Maps	
  
The	
  final	
  visualization	
  splits	
  the	
  entire	
  day	
  up	
  into	
  4-­‐hour	
  blocks	
  and	
  shows	
  6	
  maps	
  for	
  weekday	
  and	
  6	
  
for	
  weekend	
  rides.	
  We	
  chose	
  to	
  employ	
  an	
  equal	
  number	
  of	
  hours	
  in	
  each	
  map	
  to	
  make	
  it	
  clear	
  to	
  
viewer.	
  That	
  gave	
  us	
  the	
  option	
  of	
  2,	
  4,	
  6,	
  or	
  12-­‐hour	
  blocks.	
  	
  Twelve-­‐hour	
  blocks	
  were	
  not	
  seriously	
  
considered	
  because	
  it	
  would	
  not	
  show	
  very	
  much	
  interesting	
  patters	
  in	
  the	
  data.	
  On	
  the	
  other	
  hand,	
  2-­‐
hour	
  blocks	
  would	
  have	
  created	
  double	
  the	
  maps	
  in	
  the	
  final	
  visualization,	
  and	
  we	
  concluded	
  that	
  would	
  
be	
  too	
  much.	
  Ultimately,	
  4-­‐hour	
  blocks	
  were	
  convenient	
  because	
  it	
  clearly	
  differentiated	
  the	
  afternoon	
  
(Noon-­‐4PM)	
  from	
  the	
  commuting	
  hours	
  after	
  4PM.	
  
	
  
Trip	
  Segments	
  
One	
  thing	
  that	
  became	
  very	
  clear	
  initially	
  was	
  that	
  a	
  key	
  design	
  decision	
  was	
  how	
  to	
  display	
  the	
  trip	
  
segments.	
  
Showing	
  all	
  segments	
  was	
  a	
  jumbled	
  mess	
  that	
  yielded	
  minimal	
  useful	
  information.	
  We	
  attempted	
  
including	
  all	
  segments,	
  but	
  making	
  the	
  lines	
  very	
  thin	
  on	
  low	
  traffic	
  routes,	
  and	
  bigger	
  on	
  the	
  higher	
  
traffic	
  routes,	
  but	
  still	
  found	
  that	
  this	
  was	
  very	
  cluttered.	
  Additionally,	
  scaling	
  the	
  line	
  thickness	
  was	
  
problematic	
  because	
  the	
  data	
  was	
  highly	
  skewed	
  and	
  we	
  did	
  not	
  want	
  a	
  small	
  number	
  of	
  stations	
  to	
  
dominate	
  the	
  visualization.	
  Arguably	
  that	
  is	
  an	
  aspect	
  of	
  the	
  data	
  that	
  could	
  be	
  better	
  highlighted	
  in	
  my	
  
visualization;	
  however,	
  in	
  the	
  final	
  visualization	
  all	
  lines	
  are	
  the	
  same	
  (relatively	
  thin,	
  but	
  viewable)	
  size,	
  
which	
  adds	
  clarity	
  by	
  better	
  allowing	
  you	
  to	
  see	
  the	
  trips	
  that	
  are	
  included.	
  
We	
  experimented	
  with	
  both	
  thresholds	
  of	
  the	
  number	
  of	
  trips	
  per	
  4-­‐hour	
  block,	
  and	
  a	
  ranking	
  of	
  the	
  top	
  
n	
  trips	
  for	
  that	
  threshold.	
  The	
  tradeoff	
  here	
  was	
  that	
  if	
  you	
  set	
  an	
  absolute	
  trip	
  cutoff	
  then	
  the	
  off-­‐hours	
  
have	
  literally	
  no	
  data	
  (unless	
  you	
  overwhelm	
  the	
  peak	
  charts),	
  and	
  if	
  you	
  choose	
  to	
  show	
  the	
  top	
  n	
  trips	
  
then	
  it	
  appears	
  at	
  a	
  glance	
  that	
  traffic	
  is	
  equal	
  at	
  all	
  times	
  of	
  day	
  when	
  it	
  is	
  actually	
  highly	
  skewed	
  to	
  
certain	
  times	
  of	
  day.	
  
Ultimately	
  a	
  two-­‐pronged	
  approach	
  was	
  employed	
  to	
  deal	
  with	
  this.	
  First,	
  we	
  compromised	
  and	
  chose	
  
up	
  to	
  75	
  trips	
  for	
  each	
  map,	
  but	
  only	
  included	
  them	
  if	
  they	
  averaged	
  at	
  least	
  5	
  trips	
  per	
  hour.	
  This	
  
allowed	
  us	
  to	
  both	
  show	
  data	
  for	
  non-­‐peak	
  times,	
  but	
  also	
  not	
  make	
  it	
  appear	
  visually	
  that	
  traffic	
  was	
  
comparable	
  at	
  all	
  hours	
  across	
  the	
  day.	
  Secondly,	
  the	
  histograms	
  (which	
  will	
  be	
  explained	
  in	
  greater	
  
 
	
   10	
  
detail	
  in	
  a	
  subsequent	
  section)	
  add	
  additional	
  context	
  around	
  which	
  times	
  of	
  day	
  experience	
  the	
  most	
  
traffic.	
  
Background	
  Tile	
  Image	
  
We	
  experimented	
  with	
  a	
  number	
  of	
  different	
  backgrounds.	
  There	
  was	
  a	
  compromise	
  to	
  be	
  made	
  
between	
  showing	
  additional	
  context	
  in	
  the	
  background	
  layer	
  with	
  more	
  detail,	
  and	
  allocating	
  more	
  of	
  
the	
  available	
  pixels	
  to	
  show	
  data	
  about	
  the	
  Divvy	
  bikes.	
  In	
  the	
  first	
  iteration	
  a	
  very	
  plain	
  background	
  that	
  
did	
  not	
  show	
  additional	
  data	
  beyond	
  the	
  lake	
  and	
  station	
  locations	
  was	
  used.	
  	
  
	
  
But	
  after	
  getting	
  feedback,	
  subsequent	
  iterations	
  included	
  a	
  street	
  map.	
  To	
  select	
  a	
  background	
  a	
  
number	
  of	
  options	
  from	
  the	
  OpenStreetMap	
  package	
  in	
  R	
  were	
  attempted.	
  The	
  options	
  that	
  were	
  
considered	
  are	
  displayed	
  below	
  with	
  the	
  Divvy	
  bike	
  locations	
  plotted	
  on	
  them.	
  We	
  attempted	
  to	
  strike	
  a	
  
balance	
  between	
  showing	
  some	
  context	
  to	
  allow	
  the	
  viewer	
  to	
  contextualize	
  an	
  individual	
  point	
  on	
  the	
  
map,	
  and	
  a	
  need	
  not	
  to	
  make	
  the	
  background	
  dominate.	
  
	
  
 
	
   11	
  
	
  
	
  
Plot	
  Area	
  
There	
  are	
  a	
  total	
  of	
  300	
  Divvy	
  bike	
  stations	
  in	
  the	
  city,	
  but	
  they	
  are	
  not	
  distributed	
  evenly	
  across	
  the	
  city.	
  
In	
  particular,	
  most	
  of	
  the	
  station	
  locations	
  and	
  an	
  overwhelming	
  number	
  of	
  the	
  most	
  popular	
  locations	
  
are	
  found	
  in	
  the	
  Loop	
  and	
  north	
  side	
  neighborhoods.	
  	
  
We	
  considered	
  several	
  approaches	
  for	
  cropping	
  the	
  map.	
  	
  
1. City	
  limits	
  
• One	
  approach	
  would	
  be	
  to	
  include	
  the	
  entire	
  city	
  in	
  the	
  map.	
  This	
  would	
  highlight	
  the	
  
discrepancies	
  that	
  exist	
  where	
  certain	
  parts	
  of	
  the	
  city	
  have	
  no	
  Divvy	
  stations.	
  The	
  lack	
  
of	
  Divvy	
  stations	
  in	
  certain	
  areas	
  in	
  the	
  city	
  was	
  an	
  interesting	
  aspect	
  of	
  the	
  data.	
  The	
  
City	
  sponsors	
  divvy,	
  but	
  stations	
  are	
  not	
  available	
  in	
  all	
  neighborhoods.	
  However,	
  this	
  
approach	
  was	
  not	
  used	
  in	
  the	
  final	
  visualization,	
  because	
  we	
  concluded	
  that	
  cropping	
  
that	
  map	
  around	
  the	
  entire	
  city	
  led	
  to	
  a	
  lot	
  of	
  “blank”	
  space	
  on	
  the	
  map	
  that	
  could	
  have	
  
been	
  more	
  effectively	
  utilized	
  by	
  zooming	
  in	
  on	
  existing	
  locations.	
  
	
  
2. Divvy	
  station	
  locations	
  
• This	
  was	
  my	
  chosen	
  approach.	
  This	
  was	
  a	
  compromise	
  between	
  utilizing	
  space	
  to	
  show	
  
the	
  Divvy	
  data,	
  while	
  still	
  showing	
  all	
  Divvy	
  stations	
  that	
  exist	
  for	
  this	
  analysis.	
  
3. Zooming	
  in	
  only	
  on	
  a	
  popularity/usage	
  threshold	
  and	
  using	
  those	
  stations	
  or	
  focusing	
  on	
  specific	
  
neighborhood(s)	
  such	
  as	
  the	
  loop	
  
• While	
  a	
  more	
  focused	
  analysis	
  of	
  a	
  neighborhood	
  could	
  have	
  been	
  interesting,	
  it	
  also	
  
could	
  have	
  masked	
  interesting	
  patterns,	
  such	
  as	
  how	
  on	
  the	
  weekends	
  there	
  is	
  more	
  
activity	
  in	
  the	
  Hyde	
  Park	
  Museum	
  campus,	
  and	
  further	
  north	
  along	
  the	
  lakefront	
  trail	
  on	
  
the	
  weekend	
  would	
  be	
  lost	
  if	
  the	
  map	
  zoomed	
  in.	
  	
  
	
  
 
	
   12	
  
Direction	
  
It	
  would	
  have	
  been	
  very	
  interesting	
  to	
  show	
  the	
  direction	
  of	
  the	
  trips	
  on	
  the	
  maps;	
  however,	
  the	
  
approaches	
  that	
  were	
  explored	
  did	
  not	
  improve	
  the	
  visualization.	
  We	
  experimented	
  with	
  using	
  color	
  to	
  
show	
  direction,	
  but	
  ran	
  into	
  a	
  couple	
  of	
  challenges	
  with	
  that.	
  The	
  first	
  challenge	
  was	
  defining	
  what	
  
colors	
  to	
  group	
  together.	
  Initially	
  started	
  by	
  using	
  two	
  colors	
  and	
  then	
  adjusting	
  the	
  color	
  of	
  the	
  line	
  
depending	
  on	
  whether	
  or	
  not	
  the	
  user	
  was	
  heading	
  north	
  or	
  south.	
  That	
  added	
  some	
  information,	
  but	
  I	
  
concluded	
  that	
  was	
  confusing	
  because	
  in	
  many	
  cases	
  (i.e.	
  commuting	
  time)	
  the	
  more	
  relevant	
  direction	
  
is	
  whether	
  or	
  not	
  you	
  are	
  headed	
  towards	
  the	
  Loop.	
  So	
  a	
  next	
  iteration	
  I	
  used	
  color	
  to	
  indicate	
  whether	
  
or	
  not	
  you	
  were	
  headed	
  to	
  or	
  from	
  the	
  loop	
  based	
  on	
  a	
  cutoff	
  of	
  Madison	
  Street.	
  This	
  was	
  okay,	
  but	
  I	
  
felt	
  that	
  it	
  was	
  not	
  effective	
  for	
  the	
  popular	
  bike	
  stations	
  in	
  the	
  West	
  Loop	
  near	
  the	
  train	
  stations	
  and	
  I	
  
felt	
  that	
  it	
  did	
  not	
  add	
  much	
  value	
  for	
  some	
  of	
  the	
  weekend	
  locations	
  where	
  people	
  are	
  less	
  likely	
  to	
  be	
  
commuting.	
  
Although	
  direction	
  was	
  explored,	
  it	
  was	
  not	
  a	
  dimension	
  included	
  in	
  the	
  final	
  visualization.	
  
	
  
Histograms	
  
The	
  initial	
  iteration	
  of	
  this	
  visualization	
  included	
  only	
  the	
  small	
  multiple	
  maps	
  without	
  the	
  histograms	
  
that	
  are	
  included	
  at	
  the	
  top	
  and	
  bottom.	
  The	
  maps	
  illustrate	
  a	
  number	
  of	
  interesting	
  aspects	
  of	
  the	
  data;	
  
however,	
  they	
  do	
  not	
  explicitly	
  answer	
  simple	
  questions	
  that	
  you	
  might	
  be	
  interested	
  in	
  if	
  you	
  were	
  
curious	
  about	
  Divvy	
  bike	
  usage	
  by	
  day	
  of	
  week	
  and	
  time	
  of	
  day.	
  Specifically,	
  since	
  this	
  visualization	
  
attempted	
  to	
  illustrate	
  patterns	
  around	
  usage	
  by	
  weekday/weekend	
  and	
  time	
  of	
  day,	
  we	
  wanted	
  to	
  
make	
  it	
  very	
  easy	
  for	
  the	
  viewer	
  to	
  discern	
  a	
  few	
  key	
  facts	
  about	
  the	
  data:	
  
• Usage	
  is	
  much	
  higher	
  during	
  the	
  week	
  (than	
  the	
  weekend)	
  
• Weekday	
  usage	
  has	
  a	
  bimodal	
  distribution	
  with	
  peaks	
  during	
  morning	
  and	
  evening	
  commuting	
  
times.	
  
• More	
  commuters	
  use	
  the	
  bikes	
  in	
  the	
  evening	
  than	
  in	
  the	
  morning	
  
• Weekend	
  usage	
  is	
  much	
  less	
  than	
  weekday	
  
• Weekend	
  usage	
  has	
  a	
  unimodal	
  distribution	
  centered	
  in	
  the	
  early	
  afternoon.	
  
	
  
Our	
  approach	
  for	
  illustrating	
  those	
  key	
  aspects	
  of	
  the	
  data	
  was	
  to	
  make	
  histograms	
  and	
  include	
  them	
  in	
  
the	
  chart.	
  There	
  was	
  some	
  thought	
  as	
  to	
  where	
  to	
  place	
  the	
  histograms.	
  It	
  was	
  a	
  tradeoff	
  between	
  
making	
  the	
  histogram	
  shows	
  as	
  much	
  information	
  as	
  possible	
  while	
  preventing	
  confusion	
  and	
  possibly	
  
distracting	
  from	
  the	
  graphs.	
  Ultimately,	
  I	
  erred	
  on	
  the	
  side	
  of	
  clarity	
  and	
  simplicity	
  by	
  putting	
  the	
  
histograms	
  above	
  and	
  below	
  their	
  respective	
  maps.	
  The	
  histograms	
  do	
  not	
  provide	
  exact	
  values	
  (there	
  
are	
  no	
  labels),	
  but	
  they	
  do	
  illustrate	
  the	
  main	
  themes	
  (outlined	
  above).	
  Adding	
  labels	
  and	
  putting	
  the	
  
two	
  histograms	
  on	
  the	
  same	
  axis	
  would	
  likely	
  have	
  yielded	
  improvements	
  to	
  the	
  histograms	
  as	
  
standalone	
  visuals,	
  but	
  as	
  part	
  of	
  the	
  larger	
  visual,	
  maintaining	
  clarity	
  was	
  the	
  driving	
  design	
  principle.	
  
We	
  attempted	
  to	
  align	
  the	
  histogram	
  with	
  the	
  small	
  multiples	
  so	
  that	
  the	
  four-­‐hour	
  time	
  periods	
  in	
  the	
  
small	
  multiples	
  had	
  the	
  corresponding	
  to	
  add	
  consistency	
  across	
  the	
  individual	
  components	
  of	
  the	
  visual	
  
and	
  facilitate	
  the	
  histogram	
  aiding	
  in	
  the	
  interpretation	
  of	
  the	
  maps.	
  	
  	
  
 
	
   13	
  
Network	
  Analysis:	
  Discussion	
  
Overview	
  
For	
  this	
  analysis,	
  the	
  idea	
  was	
  to	
  represent	
  the	
  overall	
  flow	
  between	
  Divvy	
  stations	
  using	
  the	
  whole	
  
dataset.	
  There	
  are	
  many	
  possible	
  representations	
  for	
  the	
  Divvy	
  dataset,	
  but	
  being	
  a	
  georeferenced	
  data	
  
allows	
  you	
  to	
  see	
  how	
  this	
  bike	
  sharing	
  system	
  is	
  related	
  to	
  the	
  city,	
  and	
  it’s	
  infrastructure,	
  in	
  a	
  compact	
  
and	
  accurate	
  view.	
  	
  
Data	
  and	
  Systems	
  Used	
  
For	
  this	
  visualization	
  we	
  was	
  used	
  the	
  whole	
  2014	
  Divvy	
  Challenge	
  Dataset,	
  after	
  making	
  some	
  data	
  
transformation.	
  First	
  by	
  summarizing	
  all	
  of	
  the	
  routes,	
  grouping	
  them	
  by	
  their	
  origin	
  and	
  destination	
  
station,	
  and	
  ID’s.	
  This	
  generated	
  a	
  calculated	
  variable	
  that	
  has	
  the	
  record	
  count	
  for	
  each	
  route.	
  After	
  
this,	
  each	
  record	
  was	
  georeferenced	
  by	
  merging	
  the	
  origin	
  and	
  destination	
  fields	
  (related	
  to	
  stations)	
  
with	
  their	
  respective	
  geographical	
  coordinates.	
  This	
  procedure	
  was	
  done	
  using	
  SPSS	
  Modeler,	
  due	
  to	
  the	
  
data	
  size,	
  and	
  having	
  as	
  input	
  and	
  output	
  CSV	
  files.	
  
Once	
  the	
  preprocessing	
  was	
  done,	
  the	
  dataset	
  was	
  opened	
  into	
  ArcGIS	
  10.2.2.	
  After	
  this	
  was	
  loaded	
  
separately,	
  the	
  dataset	
  containing	
  each	
  Divvy	
  Station,	
  and	
  we	
  also	
  loaded	
  a	
  georeferenced	
  CTA	
  Stations	
  
dataset,	
  obtained	
  from	
  the	
  City	
  of	
  Chicago	
  Data	
  Portal.	
  There	
  was	
  a	
  hypothesis	
  on	
  a	
  possible	
  
relationship	
  between	
  CTA	
  and	
  Divvy,	
  because	
  of	
  people	
  commuting	
  may	
  use	
  both	
  forms	
  of	
  
transportation,	
  and	
  so	
  this	
  is	
  why	
  we	
  also	
  explored	
  this	
  supplemental	
  data	
  source.	
  
Methodology	
  
Displaying	
  movement	
  data	
  using	
  maps	
  is	
  tricky,	
  because	
  in	
  addition	
  to	
  the	
  two	
  dimensional	
  data	
  that	
  
maps	
  usually	
  display,	
  we	
  also	
  have	
  additional	
  dimensions	
  including	
  movement	
  and	
  time.	
  
Time-­‐lapse	
  cartography	
  is	
  a	
  direct	
  option,	
  and	
  you	
  can	
  use	
  a	
  sequence	
  of	
  overlaid	
  maps	
  of	
  same	
  region,	
  
to	
  try	
  to	
  figure	
  out	
  differences	
  on	
  space.	
  When	
  you	
  are	
  dealing	
  with	
  small	
  changes,	
  not	
  on	
  the	
  whole	
  
map,	
  it’s	
  better	
  to	
  use	
  a	
  Flow	
  Map,	
  or	
  a	
  Network	
  Map.	
  
A	
  Flow	
  Map	
  is	
  designed	
  to	
  represent	
  a	
  relation	
  of	
  one	
  (or	
  a	
  few)	
  source(s)	
  to	
  many.	
  Its	
  usage	
  comes	
  
from	
  early	
  representation	
  between	
  countries	
  in	
  the	
  colonial	
  period	
  of	
  history.	
  
 
	
   14	
  
	
  
Figure	
  1-­‐	
  Example	
  of	
  a	
  Flow	
  Map.	
  Charles	
  Minard	
  -­‐	
  Minard,	
  C.	
  J.	
  "Carte	
  figurative	
  et	
  approximative	
  des	
  quantités	
  de	
  vin	
  français	
  
exportés	
  par	
  mer	
  en	
  1864".	
  lith.	
  (835	
  x	
  547),	
  1865.	
  Copy	
  [from	
  http://en.wikipedia.org/wiki/Flow_map	
  	
  here].	
  
A	
  Network	
  Map,	
  on	
  the	
  other	
  hand	
  has	
  the	
  objective	
  to	
  show	
  relations	
  of	
  many	
  to	
  many	
  features	
  on	
  a	
  
map.	
  A	
  popular	
  use	
  is	
  for	
  airline	
  routes,	
  with	
  connections	
  between	
  local	
  airports,	
  and	
  major	
  hubs:	
  
	
  
Figure	
  2-­‐	
  American	
  Airlines	
  OneWorld	
  Map	
  (http://www.aa.com/content/images/production/generic/onworld-­‐map.jpg)	
  
 
	
   15	
  
Another	
  popular	
  example	
  of	
  a	
  Network	
  Map	
  was	
  recently	
  presented	
  by	
  Facebook,	
  who	
  displayed	
  the	
  
connections	
  between	
  groups	
  of	
  users:	
  
	
  
Figure	
  3-­‐	
  Facebook	
  User	
  Connections	
  (Obtained	
  on	
  Facebook.com)	
  
In	
  this	
  case,	
  rather	
  than	
  just	
  displaying	
  the	
  connections,	
  they	
  were	
  overlaid	
  among	
  themselves,	
  and	
  with	
  
transparency	
  usage	
  it	
  was	
  possible	
  to	
  have	
  an	
  accumulated	
  view	
  of	
  these	
  relations,	
  allowing	
  the	
  viewer	
  
to	
  see	
  clearly	
  where	
  the	
  traffic	
  comes	
  from	
  and	
  goes	
  to,	
  and	
  also	
  about	
  it’s	
  intensity.	
  
For	
  our	
  case,	
  the	
  goal	
  was	
  to	
  properly	
  display	
  the	
  routes	
  that	
  are	
  on	
  the	
  whole	
  dataset,	
  with	
  differential	
  
scaling	
  and	
  color	
  grading	
  to	
  pop	
  up	
  the	
  most	
  used	
  routes,	
  even	
  with	
  an	
  estimate	
  of	
  the	
  usage	
  level,	
  but	
  
without	
  sampling	
  the	
  data,	
  keeping	
  all	
  routes	
  shown.	
  This	
  is	
  a	
  good	
  effort,	
  considering	
  that	
  almost	
  
2,400,000	
  trips	
  were	
  described	
  on	
  the	
  dataset,	
  complicating	
  the	
  georeferencing,	
  the	
  load,	
  and	
  the	
  
transformation	
  of	
  the	
  data	
  to	
  lines,	
  and	
  find	
  proper	
  representation	
  on	
  color,	
  transparencies,	
  and	
  most	
  
important	
  scaling.	
  Multiscale	
  representation	
  on	
  maps	
  is	
  an	
  old	
  challenge,	
  because	
  they	
  are	
  normally	
  
interactive.	
  Using	
  GIS,	
  it’s	
  different	
  because	
  each	
  view	
  could	
  be	
  rendered	
  separately.	
  
Also	
  it	
  was	
  a	
  challenge	
  to	
  represent	
  the	
  CTA	
  dataset.	
  The	
  idea	
  was	
  to	
  see	
  its	
  influence	
  on	
  the	
  Divvy	
  
system.	
  So,	
  the	
  design	
  was	
  created	
  by	
  calculating	
  a	
  buffer	
  from	
  a	
  certain	
  distance	
  of	
  a	
  CTA	
  station,	
  to	
  
see	
  where	
  it	
  is	
  located	
  a	
  possible	
  influence	
  zone	
  between	
  CTA	
  and	
  Divvy	
  stations,	
  suggesting	
  
commutation	
  among	
  those	
  systems.	
  
For	
  this	
  two	
  maps	
  were	
  created,	
  one	
  with	
  a	
  whole	
  view	
  of	
  the	
  Divvy	
  Stations,	
  and	
  other,	
  in	
  a	
  more	
  detail	
  
level,	
  to	
  show	
  specificities	
  of	
  this	
  interaction	
  in	
  the	
  Loop.	
  
	
  
	
  
	
  
 
	
   16	
  
	
  
 
	
   17	
  
	
  
	
  
 
	
   18	
  
	
  
The	
  presented	
  maps	
  shows	
  the	
  flow	
  among	
  Divvy	
  Stations	
  (red	
  dots),	
  with	
  connections	
  using	
  just	
  two	
  
data	
  categories,	
  one	
  with	
  thinner	
  lines	
  represented	
  as	
  green,	
  with	
  routes	
  with	
  record	
  counts	
  between	
  
log101	
  (0)	
  and	
  log101000	
  (3),	
  and	
  thicker	
  red	
  lines	
  describing	
  routes	
  with	
  record	
  counts	
  between	
  
log101000	
  (3)	
  and	
  log1010000	
  (4).	
  These	
  categories	
  were	
  normalized	
  with	
  logarithmic	
  scaling	
  because	
  of	
  
the	
  difference	
  on	
  magnitude	
  between	
  routes,	
  as	
  a	
  way	
  to	
  proper	
  represent	
  different	
  dimension	
  levels	
  
on	
  a	
  same	
  graph.	
  
	
  
Final	
  Considerations	
  
Looking	
  at	
  those	
  maps,	
  some	
  inferences	
  could	
  be	
  made.	
  However,	
  it	
  is	
  important	
  to	
  remember	
  that	
  this	
  
analysis	
  does	
  not	
  suggest	
  causation	
  on	
  these	
  relationships.	
  	
  
On	
  the	
  first	
  map,	
  with	
  a	
  broader	
  view	
  of	
  the	
  Divvy	
  Stations	
  in	
  Chicago,	
  it’s	
  possible	
  to	
  see	
  a	
  high	
  
concentration	
  of	
  routes	
  along	
  the	
  lakeshore,	
  with	
  grading	
  to	
  countryside,	
  clearly	
  shown	
  by	
  the	
  
predominance	
  of	
  red	
  flows	
  on	
  east	
  grading	
  to	
  green.	
  On	
  this	
  map	
  it	
  is	
  also	
  possible	
  to	
  see	
  that	
  the	
  
southern	
  and	
  western	
  stations,	
  as	
  well	
  as	
  many	
  northern	
  ones,	
  there	
  are	
  no	
  flow	
  lines	
  being	
  draw.	
  Flow	
  
exists	
  on	
  those	
  stations,	
  but	
  it’s	
  not	
  represented	
  on	
  this	
  map	
  because	
  the	
  origin	
  and	
  destination	
  station	
  
were	
  often	
  the	
  same	
  place.	
  Considering	
  the	
  buffers	
  of	
  half	
  a	
  mile	
  drawn	
  centered	
  on	
  each	
  CTA	
  station,	
  it	
  
is	
  possible	
  to	
  see	
  that,	
  despite	
  the	
  loop,	
  much	
  traffic	
  and	
  stations	
  overlapping,	
  there	
  is	
  a	
  relation	
  of	
  
green	
  flow	
  and	
  CTA	
  stations,	
  even	
  with	
  more	
  radial	
  distribution	
  of	
  lines,	
  into	
  countryside	
  and	
  also	
  a	
  few	
  
to	
  the	
  shore.	
  Perhaps	
  that	
  could	
  be	
  the	
  commuters	
  connecting	
  from	
  their	
  houses,	
  work,	
  or	
  leisure	
  places	
  
to	
  the	
  CTA.	
  And	
  then	
  taking	
  the	
  Divvy	
  bikes	
  for	
  the	
  remainder	
  of	
  their	
  trip.	
  Anonymization	
  	
  also	
  limits	
  
our	
  ability	
  to	
  merging	
  these	
  datasets	
  and	
  perform	
  a	
  deeper	
  analysis	
  of	
  commuting	
  patterns.	
  
The	
  second	
  map	
  focuses	
  on	
  the	
  Loop.	
  The	
  main	
  concentration	
  of	
  flows	
  is	
  on	
  the	
  surroundings	
  of	
  the	
  
Grand	
  Central	
  Station	
  (Metra),	
  within	
  the	
  Loop,	
  Merchandise	
  Mart	
  and	
  The	
  Magnificent	
  Mile.	
  Most	
  of	
  
these	
  should	
  be	
  inferred	
  to	
  be	
  people	
  going	
  from	
  and	
  to	
  work,	
  because	
  these	
  three	
  areas	
  are	
  highly	
  
related	
  on	
  the	
  main	
  commuter	
  train	
  stations.	
  A	
  second	
  major	
  trend	
  is	
  verified	
  on	
  the	
  Adams/Wabash	
  
CTA	
  station	
  and	
  the	
  Navy	
  Pier,	
  Millennium	
  Park,	
  The	
  Chicago	
  Yacht	
  Club,	
  and	
  southern	
  to	
  Adler	
  and	
  Field	
  
Museums.	
  With	
  these	
  characteristics,	
  it’s	
  also	
  possible	
  to	
  suggest	
  that	
  these	
  high	
  intensity	
  routes	
  have	
  
more	
  relation	
  on	
  tourism,	
  rather	
  than	
  work,	
  by	
  the	
  use	
  and	
  occupation	
  of	
  space.	
  Also,	
  the	
  high	
  traffic	
  
near	
  the	
  lakeshore	
  stations	
  reinforces	
  this,	
  as	
  those	
  are	
  places	
  that	
  many	
  people	
  go	
  for	
  leisure	
  activities	
  
at	
  the	
  beach	
  and	
  parks	
  
A	
  learning	
  that	
  we	
  took	
  away	
  from	
  this	
  part	
  of	
  the	
  project	
  was	
  that	
  it	
  is	
  possible	
  to	
  improve	
  map	
  
visualizations	
  and	
  interpretations	
  by	
  using	
  a	
  full	
  a	
  GIS	
  system,	
  rather	
  than	
  just	
  a	
  map	
  plot.	
  It	
  aggregates	
  
interactivity	
  capabilities,	
  and	
  also	
  tools	
  designed	
  for	
  spatial	
  analysis,	
  allowing	
  the	
  end-­‐user	
  to	
  explore	
  
the	
  initial	
  dataset,	
  but	
  also	
  integrating	
  this	
  with	
  others,	
  amplifying	
  the	
  spatial	
  analysis.	
  
	
  
	
  
 
	
   19	
  
Circular	
  Network	
  Visualization:	
  Discussion	
  and	
  Related	
  Analysis	
  
	
  
Visualization	
  techniques:	
  
Below	
  are	
  the	
  top	
  visualizations	
  that	
  were	
  created	
  as	
  part	
  of	
  the	
  same	
  analysis	
  that	
  led	
  to	
  the	
  circular	
  
network	
  map	
  visualization:	
  
	
  
1. Divvy	
  bikes	
  rush	
  hours	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
   	
  
	
  
	
  
	
  
	
  
	
  
	
   	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Ø Description:	
  
These	
  visualizations	
  highlight	
  the	
  utilization	
  time	
  of	
  Divvy	
  Bikes.	
  	
  
The	
  heatmap	
  in	
  plots	
  the	
  “hours	
  of	
  the	
  day”	
  in	
  the	
  x-­‐axis	
  and	
  the	
  “days	
  of	
  the	
  week”	
  in	
  the	
  y-­‐
axis.	
  The	
  count	
  of	
  the	
  rented	
  bikes	
  is	
  represented	
  through	
  the	
  heatmap	
  matrix.	
  Colors	
  of	
  the	
  
heatmap	
  varies	
  between:	
  (Green	
  –	
  Yellow	
  -­‐	
  Red)	
  in	
  response	
  to	
  the	
  bikes	
  “count”	
  levels	
  which	
  I	
  
believe	
  clearly	
  draw	
  our	
  eyes	
  to	
  the	
  peak	
  hours	
  plotted	
  in	
  the	
  red/orange	
  colors	
  of	
  the	
  heatmap.	
  
Throughout	
  my	
  analysis	
  and	
  examining	
  the	
  data,	
  I	
  release	
  the	
  importance	
  of	
  segregating	
  users’	
  
types:	
  (Subscribers	
  &	
  Customers)	
  in	
  separate	
  plots	
  for	
  almost	
  all	
  my	
  visualizations.	
  
On	
  the	
  left	
  side,	
  there	
  is	
  a	
  plot	
  of	
  data	
  clustering	
  of	
  (day	
  of	
  the	
  week)	
  based	
  on	
  which	
  levels	
  on	
  
the	
  y-­‐axis	
  are	
  sorted.	
  
 
	
   20	
  
	
  
Ø Data	
  Analysis:	
  
Subscribers’	
  heatmap:	
  
It	
  appears	
  that	
  the	
  bikes’	
  highest	
  demand	
  during	
  the	
  weekdays	
  moves	
  along	
  the	
  rush	
  hours	
  
(7:00	
  –	
  8:00	
  and	
  16:00	
  –	
  18:00).	
  There	
  is	
  also	
  a	
  small	
  -­‐-­‐	
  but	
  worth	
  mentioning	
  -­‐-­‐	
  and	
  spread	
  
demand	
  of	
  the	
  bikes	
  between	
  10:00	
  and	
  15:00	
  during	
  the	
  weekends.	
  
Subscribers	
  would	
  also	
  tend	
  to	
  rent/return	
  bikes	
  at	
  relatively	
  late	
  times	
  on	
  Friday	
  night	
  and	
  
Saturday	
  night	
  -­‐-­‐	
  displayed	
  through	
  the	
  lighter	
  green	
  color.	
  
We	
  can	
  also	
  see	
  that	
  subscribers	
  seem	
  to	
  be	
  leaving	
  their	
  work	
  a	
  little	
  early	
  (or	
  on	
  time)	
  on	
  
Friday	
  and	
  therefore	
  return	
  home	
  a	
  little	
  early	
  probably	
  for	
  some	
  weekend	
  plans.	
  
On	
  a	
  similar	
  note,	
  we	
  can	
  use	
  the	
  scale	
  to	
  approximately	
  count	
  number	
  of	
  rentals	
  per	
  hour.	
  It	
  
seems	
  that	
  more	
  people	
  use	
  the	
  bikes	
  to	
  return	
  home	
  rather	
  than	
  going	
  to	
  work.	
  Probably,	
  
people	
  avoid	
  arriving	
  to	
  work	
  sweating	
  and	
  tired	
  or	
  they	
  avoid	
  arriving	
  late	
  to	
  work	
  and	
  
therefore	
  they	
  prefer	
  to	
  arrive	
  refreshed	
  and	
  on	
  time.	
  
Customers’	
  heatmap:	
  
Non-­‐subscribers’	
  (casual	
  customers)	
  have	
  an	
  inverse	
  demand.	
  Their	
  highest	
  demand	
  is	
  during	
  
the	
  weekends	
  between	
  (10:00	
  –	
  19:00).	
  Most	
  demand	
  demand	
  occurs	
  between	
  11:00	
  –	
  18:00	
  
during	
  the	
  weekdays.	
  	
  
It’s	
  also	
  worth	
  mentioning	
  -­‐-­‐	
  using	
  cluster	
  analysis	
  -­‐-­‐	
  that	
  more	
  customers	
  use	
  divvy	
  bikes	
  in	
  the	
  
first	
  and	
  last	
  working	
  days	
  of	
  the	
  week	
  (Monday	
  and	
  Friday	
  respectively).	
  I	
  would	
  only	
  assume	
  
that	
  tourists	
  who	
  are	
  visiting	
  the	
  city	
  tend	
  to	
  take	
  one	
  day	
  off	
  work	
  (Monday/Friday)	
  to	
  have	
  a	
  
longer	
  weekend,	
  which	
  therefore	
  explains	
  the	
  busier	
  traffic	
  during	
  first/last	
  day	
  of	
  weekdays.	
  
	
  
Ø History	
  of	
  revisions:	
  
The	
  revision	
  of	
  this	
  final	
  heatmap	
  evolved	
  over	
  time.	
  I	
  started	
  with	
  the	
  simple	
  heatmap	
  function	
  
that	
  was	
  covered	
  in	
  class.	
  Then,	
  I	
  made	
  some	
  more	
  research	
  about	
  other	
  available	
  heatmaps	
  in	
  
R	
  to	
  discover	
  the	
  newly	
  created	
  heatmap3	
  package	
  -­‐-­‐	
  launched	
  on	
  June	
  2014.	
  
Some	
  further	
  revisions	
  were	
  implemented	
  on	
  the	
  map	
  including:	
  colors,	
  data	
  cluster,	
  scale	
  and	
  
axis.	
  	
  
One	
  very	
  tricky	
  part	
  was	
  to	
  reformat	
  the	
  data	
  in	
  a	
  matrix-­‐style,	
  which	
  is	
  necessary	
  for	
  heatmaps.	
  
Re-­‐grouping	
  the	
  data	
  rows	
  by	
  their	
  corresponding	
  hours	
  and	
  days	
  took	
  a	
  lot	
  of	
  time	
  and	
  
research.	
  	
  
In	
  the	
  final	
  graph,	
  the	
  two	
  users-­‐types	
  were	
  separated	
  to	
  enable	
  more	
  in-­‐depth	
  analysis.	
  
	
  
Version	
  1	
   Version	
  2	
  
 
	
   21	
  
	
  
	
  
Version	
  3	
  
	
  
	
  
	
  
Version	
  4	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
 
	
   22	
  
2. Divvy	
  bikes	
  traffic	
  flow	
  among	
  Chicago	
  districts	
  
	
  
 
	
   23	
  
	
  
Ø Description:	
  
These	
  two	
  visualizations	
  highlight	
  the	
  traffic	
  flow	
  of	
  Divvy	
  bikes	
  between	
  Chicago	
  Districts	
  for	
  
both	
  subscribers	
  and	
  customers.	
  	
  
We	
  can	
  see	
  an	
  inner	
  arc	
  and	
  three	
  outer	
  arcs.	
  The	
  inner	
  arc	
  represent	
  Chicago	
  districts	
  (six	
  
districts	
  are	
  present	
  in	
  this	
  database)	
  each	
  with	
  a	
  different	
  color.	
  Each	
  district	
  has	
  two	
  different	
  
sets	
  of	
  arrows/lines:	
  the	
  set	
  that	
  has	
  the	
  same	
  color	
  as	
  the	
  district	
  represents	
  outgoing	
  traffic	
  
(divvy	
  bikes)	
  starting	
  from	
  that	
  location	
  -­‐-­‐	
  whereas	
  the	
  set	
  with	
  different	
  color	
  than	
  the	
  district	
  
represents	
  incoming	
  traffic	
  arriving	
  to	
  that	
  location.	
  The	
  overall	
  magnitude	
  of	
  district	
  traffic	
  is	
  
represented	
  through	
  the	
  scale	
  in	
  the	
  inner	
  arc	
  -­‐-­‐	
  whereas	
  the	
  magnitude	
  of	
  each	
  arrow/route	
  is	
  
represented	
  by	
  the	
  thickness	
  of	
  the	
  arrow.	
  
Each	
  one	
  of	
  the	
  outer	
  arcs	
  represents	
  a	
  percentage	
  of	
  traffic	
  flow.	
  The	
  first	
  arc	
  (the	
  very	
  outer	
  
one)	
  shows	
  the	
  percentage	
  weight	
  of	
  the	
  overall	
  (incoming	
  and	
  outgoing)	
  traffic	
  in	
  that	
  
particular	
  one	
  district.	
  The	
  second	
  outer	
  arc	
  shows	
  the	
  percentage	
  of	
  the	
  incoming	
  traffic.	
  The	
  
third	
  arc	
  shows	
  the	
  percentage	
  of	
  the	
  outgoing	
  traffic.	
  These	
  arcs	
  are	
  mainly	
  used	
  for	
  
comparison	
  purposes.	
  
 
	
   24	
  
	
  
Ø Data	
  Analysis:	
  
Subscribers’	
  network	
  diagram:	
  
From	
  the	
  traffic	
  scale,	
  it	
  appears	
  that	
  the	
  North	
  Side	
  is	
  the	
  busiest	
  location	
  with	
  the	
  largest	
  
traffic	
  whereas	
  the	
  South	
  Side	
  is	
  the	
  least	
  busy	
  area.	
  A	
  lot	
  of	
  users	
  commute	
  to	
  stations	
  within	
  
the	
  North	
  Side	
  Area	
  or	
  to	
  North	
  Loop.	
  Chicago	
  North	
  Side	
  is	
  considered	
  to	
  be	
  the	
  most	
  densely	
  
populated	
  residential	
  area3
,	
  which	
  explains	
  the	
  heavy	
  Divvy	
  traffic.	
  
Interestingly,	
  both	
  the	
  Loop	
  and	
  West	
  Loop	
  areas	
  have	
  almost	
  the	
  same	
  magnitude	
  of	
  traffic	
  
flow,	
  although	
  I	
  was	
  expecting	
  a	
  busier	
  traffic	
  being	
  in	
  the	
  city	
  center.	
  In	
  addition,	
  traffic	
  in	
  West	
  
Loop	
  and	
  South	
  Loop	
  looks	
  almost	
  symmetrical.	
  When	
  comparing	
  the	
  two	
  outer	
  arcs	
  (incoming	
  
and	
  outgoing),	
  we	
  can	
  see	
  that	
  both	
  are	
  identical.	
  They	
  have	
  the	
  same	
  magnitude	
  and	
  even	
  the	
  
same	
  colors	
  order,	
  which	
  means	
  that	
  we	
  have	
  a	
  very	
  consistence	
  traffic	
  flow	
  in	
  these	
  two	
  areas.	
  
We	
  can	
  also	
  note	
  minimal	
  traffic	
  between	
  far-­‐apart	
  areas	
  such	
  as:	
  Southern	
  and	
  Northern	
  areas.	
  
Customers’	
  network	
  diagram:	
  
Similar	
  to	
  the	
  subscribers’	
  diagram,	
  the	
  North	
  Side	
  is	
  still	
  the	
  busiest	
  location	
  for	
  the	
  casual	
  
customers.	
  What	
  is	
  interesting	
  is	
  that	
  the	
  two	
  diagrams	
  form	
  exactly	
  the	
  same	
  trend	
  in	
  the	
  
North	
  Side	
  -­‐-­‐	
  Some	
  major	
  traffic	
  occurs	
  within	
  the	
  North	
  Side	
  and	
  to	
  the	
  North	
  Loop.	
  
Another	
  interesting	
  observation	
  is	
  that	
  the	
  West	
  Loop	
  has	
  significantly	
  shrunk;	
  yet	
  it	
  is	
  still	
  very	
  
symmetrical!	
  Unlike	
  the	
  subscriber	
  users,	
  casual	
  customers	
  are	
  less	
  interested	
  in	
  using	
  Union	
  
and	
  Ogilvie	
  Train	
  Stations,	
  which	
  are	
  accountable	
  for	
  heavy	
  traffic	
  flow	
  for	
  the	
  suburbs.	
  
	
  
In	
  both	
  models,	
  trips	
  from/to	
  North	
  Side	
  and	
  North	
  Loop	
  hold	
  between	
  50%	
  -­‐	
  70%	
  of	
  the	
  overall	
  
traffic	
  in	
  Divvy	
  Stations.	
  	
  
	
  
	
  
Ø History	
  of	
  revisions:	
  
At	
  the	
  beginning,	
  I	
  was	
  not	
  sure	
  what	
  the	
  best	
  way	
  to	
  visualize	
  the	
  traffic	
  from/to	
  Divvy	
  stations	
  
in	
  a	
  meaningful	
  way.	
  I	
  started	
  with	
  a	
  simple	
  heatmap	
  to	
  display	
  network	
  in	
  a	
  simple	
  and	
  
effective	
  way.	
  It	
  worked	
  just	
  fine	
  but	
  it	
  was	
  not	
  very	
  evocative	
  and	
  conclusion	
  did	
  not	
  stand	
  out.	
  
Then,	
  I	
  came	
  across	
  the	
  new	
  software	
  and	
  tried	
  to	
  map	
  all	
  the	
  250	
  stations	
  into	
  one	
  network	
  
diagram.	
  The	
  graph	
  was	
  not	
  expressive	
  and	
  had	
  a	
  spaghetti	
  shape.	
  The	
  names	
  of	
  the	
  stations	
  
were	
  overlapping	
  and	
  the	
  thickness	
  of	
  the	
  lines	
  did	
  not	
  have	
  any	
  insights.	
  
An	
  important	
  suggestion	
  rises	
  during	
  the	
  final	
  presentation	
  to	
  group	
  the	
  stations	
  together.	
  So,	
  I	
  
made	
  an	
  attempt	
  to	
  group	
  stations	
  by	
  Chicago	
  76	
  neighborhoods.	
  The	
  diagram	
  became	
  much	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
3
Source:	
  http://en.wikipedia.org/wiki/Community_areas_in_Chicago#North_side
 
	
   25	
  
better	
  but	
  required	
  some	
  additional	
  grouping.	
  The	
  final	
  visualization	
  did	
  group	
  the	
  
neighborhoods	
  further	
  by	
  their	
  geographical	
  locations	
  (Chicago	
  districts4
).	
  
	
  
The	
  trick	
  (and	
  most	
  time-­‐consuming)	
  part	
  was	
  grouping	
  stations	
  by	
  their	
  locations.	
  Being	
  an	
  
international	
  student,	
  it	
  was	
  a	
  very	
  fun	
  exercise	
  to	
  get	
  to	
  know	
  the	
  different	
  neighborhoods	
  of	
  
ChicagoJ.	
  I	
  used	
  Wikipedia	
  and	
  Chicago	
  Portal	
  to	
  precisely	
  go	
  through	
  the	
  locations	
  and	
  build	
  
the	
  final	
  version.	
  
	
  
Version	
  1	
  
	
  
	
  
	
  
Version	
  2	
  
	
  
Version	
  3	
  
	
  
	
  
Version	
  4	
  
	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
4
Source:	
  http://en.wikipedia.org/wiki/Community_areas_in_Chicago
 
	
   26	
  
Version	
  5	
  
	
  
	
  
	
  
	
  
	
   	
  
 
	
   27	
  
3. Divvy	
  bikes	
  rentals	
  over	
  the	
  year	
  seasons	
  
	
  
Ø Description:	
  
I	
  have	
  been	
  trying	
  to	
  plot	
  a	
  time	
  series	
  visualization	
  as	
  this	
  idea	
  seems	
  unique	
  and	
  came	
  to	
  me	
  
suddenly	
  while	
  I	
  was	
  doing	
  some	
  research.	
  The	
  x-­‐axis	
  represents	
  the	
  timeline	
  of	
  2014.	
  The	
  y-­‐axis	
  
has	
  dual	
  axes.	
  The	
  left	
  axis	
  represents	
  the	
  total	
  bikes	
  rentals	
  whereas	
  the	
  right	
  axis	
  represents	
  
the	
  mean	
  temperature	
  in	
  (ᴏ
F).	
  The	
  two	
  lines	
  are	
  differentiated	
  by	
  two	
  different	
  colors.	
  
	
  
Ø Data	
  Analysis:	
  
We	
  are	
  here	
  trying	
  to	
  see	
  the	
  correlation	
  between	
  the	
  temperature	
  and	
  number	
  of	
  rented	
  bikes	
  
over	
  time.	
  	
  
	
  
We	
  know	
  that	
  it	
  is	
  difficult	
  to	
  use	
  bikes	
  during	
  rainy	
  or	
  snowy	
  seasons,	
  which	
  is	
  in	
  general	
  
associated	
  with	
  temperature.	
  During	
  the	
  months	
  of	
  December	
  through	
  March	
  where	
  
temperature	
  is	
  around	
  20	
  ᴏ
F,	
  we	
  have	
  the	
  least	
  bikes’	
  rental	
  activities.	
  However,	
  when	
  
temperature	
  starts	
  warming	
  up,	
  rentals	
  start	
  to	
  pick	
  up	
  till	
  it	
  reaches	
  its	
  peak	
  season:	
  June	
  
through	
  September	
  where	
  temperature	
  is	
  around	
  70	
  ᴏ
F.	
  
	
  
We	
  can	
  see	
  that	
  both	
  curves	
  are	
  almost	
  identical	
  with	
  the	
  exception	
  of	
  a	
  few	
  outlier	
  days	
  where	
  
we	
  had	
  a	
  dramatic	
  drop	
  or	
  rise	
  of	
  rentals	
  against	
  the	
  trend.	
  These	
  outliers	
  can	
  be	
  further	
  
explored	
  using	
  a	
  new	
  dimension	
  of	
  dataset	
  specialized	
  in	
  Chicago	
  events	
  possibly.	
  	
  
	
  
Ø History	
  of	
  revisions:	
  
The	
  most	
  challenging	
  part	
  here	
  was	
  the	
  necessity	
  of	
  having	
  a	
  continuous	
  timeline	
  in	
  order	
  to	
  
have	
  an	
  accurate	
  time	
  series.	
  Since	
  I	
  was	
  working	
  on	
  a	
  sample	
  data,	
  I	
  had	
  to	
  go	
  back	
  and	
  work	
  
 
	
   28	
  
with	
  the	
  original	
  and	
  complete	
  dataset	
  (2.4MM	
  data	
  rows)	
  for	
  this	
  visualization.	
  I	
  tried	
  every	
  
time	
  to	
  plot	
  the	
  time	
  data	
  against	
  different	
  statistical	
  variables	
  in	
  order	
  to	
  explore	
  different	
  
aspect	
  of	
  the	
  data.	
  
	
  
Version	
  1	
  
	
  
	
  
	
  
Version	
  2	
  
	
  
	
  
	
  
Version	
  3	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
   	
  
 
	
   29	
  
4. Divvy	
  bikes	
  stations	
  capacities	
  and	
  concentrations	
  
	
  
	
  
Ø Description:	
  
The	
  two	
  visualizations	
  associate	
  the	
  geographical	
  variables	
  with	
  numerical	
  variables	
  in	
  order	
  to	
  
have	
  a	
  more	
  meaningful	
  and	
  informative	
  impact.	
  
Number	
  of	
  docks	
  (bikes	
  spaces)	
  is	
  plotted	
  against	
  stations	
  locations	
  in	
  the	
  first	
  graph.	
  A	
  color	
  
scale	
  of	
  the	
  stations’	
  docks	
  size	
  is	
  displayed	
  to	
  quickly	
  identify	
  larger	
  stations.	
  
The	
  second	
  graph	
  plots	
  divvy	
  bikes	
  users	
  in	
  accordance	
  to	
  their	
  stations	
  locations.	
  The	
  frequency	
  
of	
  rentals	
  is	
  plotted	
  using	
  a	
  darker	
  color	
  scale.	
  	
  
 
	
   30	
  
	
  
Ø Data	
  Analysis:	
  
The	
  first	
  graph	
  shows	
  the	
  geographical	
  distribution	
  of	
  the	
  Divvy	
  bike	
  stations	
  around	
  the	
  city	
  of	
  
Chicago.	
  The	
  colors	
  indicate	
  the	
  docks	
  capacity	
  of	
  each	
  station.	
  The	
  red	
  color	
  stations	
  are	
  those	
  
with	
  large	
  capacity	
  of	
  docks	
  namely	
  around:	
  Navy	
  Pier,	
  Millennium	
  Park	
  and	
  Union/Ogilvie	
  
Station.	
  I’ve	
  also	
  added	
  a	
  contour	
  of	
  (level	
  7)	
  to	
  further	
  draw	
  a	
  smoother	
  look	
  to	
  the	
  map.	
  The	
  
locations	
  with	
  larger	
  number	
  of	
  docks	
  indicate	
  more	
  rental	
  transactions	
  occur	
  there	
  and	
  
therefore	
  two	
  possibilities:	
  either	
  stations	
  run	
  out	
  of	
  bikes	
  (high	
  demand	
  as	
  a	
  start	
  off)	
  or	
  
stations	
  run	
  out	
  of	
  space	
  (high	
  returns	
  as	
  final	
  destination).	
  In	
  either	
  case,	
  increase	
  of	
  docks	
  
space	
  was	
  deemed	
  necessary.	
  
The	
  second	
  graph	
  includes	
  designing	
  the	
  geographical	
  map	
  by	
  the	
  “users-­‐type”.	
  This	
  graph	
  
enables	
  us	
  to	
  better	
  explore	
  commuters’	
  final	
  destinations.	
  	
  
It	
  appears	
  that	
  most	
  “customer”	
  bikers	
  commute	
  North	
  to	
  the	
  Lincoln	
  Park	
  Zoo	
  and	
  all	
  the	
  way	
  
South	
  to	
  the	
  Museum	
  Campus	
  passing	
  by	
  the	
  Magnificent	
  Mile	
  and	
  Navy	
  Pier.	
  The	
  “Subscribers”	
  
bikers	
  however	
  cluster	
  around	
  the	
  Loop	
  busy	
  area	
  and	
  mainly	
  around	
  the	
  Union/Ogilvie	
  
Stations.	
  This	
  graph	
  helps	
  us	
  in	
  identifying	
  dense	
  locations	
  of	
  certain	
  divvy	
  users	
  and	
  therefore	
  
might	
  utilize	
  this	
  information	
  for	
  marketing	
  or	
  customer	
  advocacy	
  purposes.	
  
	
  
Ø History	
  of	
  revisions:	
  
I	
  have	
  started	
  plotting	
  this	
  data	
  in	
  Tableau	
  where	
  it	
  nicely	
  plotted	
  the	
  data	
  points	
  (longitude	
  and	
  
latitude)	
  on	
  the	
  map.	
  However,	
  I	
  could	
  not	
  move	
  much	
  further	
  from	
  there.	
  Therefore,	
  I	
  used	
  
another	
  software	
  that	
  worked	
  better	
  with	
  GIS	
  data	
  in	
  order	
  to	
  build	
  a	
  more	
  detailed	
  map	
  using	
  
OpenStreetMap.	
  The	
  2nd
	
  version	
  included	
  some	
  unexpected	
  users-­‐type	
  (dependent),	
  which	
  
appeared	
  to	
  be	
  a	
  special	
  case.	
  A	
  recommendation	
  rose	
  during	
  the	
  final	
  presentation	
  to	
  exclude	
  
this	
  category	
  as	
  it	
  was	
  supposed	
  to	
  be	
  cleaned	
  by	
  Divvy	
  Technical	
  Team.	
  
	
  
Version	
  1	
  
	
  
Version	
  2	
  
	
  	
   	
  
 
	
   31	
  
Summary	
  of	
  Team	
  Member	
  Contributions	
  
	
  
Throughout	
  the	
  project,	
  the	
  team	
  collaborated	
  on	
  many	
  pieces	
  of	
  the	
  project,	
  including:	
  
• Sharing	
  output	
  of	
  data	
  pre-­‐processing	
  steps	
  
• Suggesting	
  software	
  that	
  could	
  be	
  relevant	
  for	
  various	
  tasks	
  
• Providing	
  feedback	
  on	
  visuals	
  
• Reviewing	
  final	
  analysis	
  
Each	
  of	
  the	
  three	
  team	
  members	
  was	
  the	
  primary	
  author	
  of	
  one	
  of	
  the	
  final	
  visualizations	
  (and	
  each	
  
main	
  section	
  discussing	
  it).	
  
Matt	
  Siedlecki:	
  made	
  the	
  traffic	
  by	
  weekday/weekend	
  and	
  time	
  of	
  week	
  patterns	
  using	
  small	
  
multiples	
  
Ricardo B. Lourenço: created the network map focused on Loop utilizing log-scaling to
summarize the full dataset	
  
Hassan	
  A	
  Al	
  Alaiwi:	
  prepared	
  the	
  circular	
  network	
  visual,	
  as	
  well	
  as	
  several	
  related	
  pieces	
  of	
  
analysis
	
  
	
  
	
  
	
  

More Related Content

Similar to Divvy Bike Visualizations

IRJET- Explore the World
IRJET- 	  Explore the WorldIRJET- 	  Explore the World
IRJET- Explore the WorldIRJET Journal
 
Automatic Itinerary Voyage Suggestion using SoNet in Big Data
Automatic Itinerary Voyage Suggestion using SoNet in Big DataAutomatic Itinerary Voyage Suggestion using SoNet in Big Data
Automatic Itinerary Voyage Suggestion using SoNet in Big DataIRJET Journal
 
Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Arpita Majumder
 
Project on nypd accident analysis using hadoop environment
Project on nypd accident analysis using hadoop environmentProject on nypd accident analysis using hadoop environment
Project on nypd accident analysis using hadoop environmentSiddharth Chaudhary
 
SFBay Area Bike Share Analysis
SFBay Area Bike Share AnalysisSFBay Area Bike Share Analysis
SFBay Area Bike Share AnalysisSwapnil Patil
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with sparkMarissa Saunders
 
IRJET - A Framework for Tourist Identification and Analytics using Transport ...
IRJET - A Framework for Tourist Identification and Analytics using Transport ...IRJET - A Framework for Tourist Identification and Analytics using Transport ...
IRJET - A Framework for Tourist Identification and Analytics using Transport ...IRJET Journal
 
Investigating Geographic Information System Technologies A Global Positioning...
Investigating Geographic Information System Technologies A Global Positioning...Investigating Geographic Information System Technologies A Global Positioning...
Investigating Geographic Information System Technologies A Global Positioning...Simon Sweeney
 
Spatial_Data_Analysis_with_open_source_softwares[1]
Spatial_Data_Analysis_with_open_source_softwares[1]Spatial_Data_Analysis_with_open_source_softwares[1]
Spatial_Data_Analysis_with_open_source_softwares[1]Joachim Nkendeys
 
Review of need for grade separations five locations - statement of work
Review of need for grade separations   five locations - statement of workReview of need for grade separations   five locations - statement of work
Review of need for grade separations five locations - statement of workStittsvilleCentral.ca
 
Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingChawanat Nakasan
 
Event Visualization with OpenStreetMap Data, Interdisciplinary Project
Event Visualization with OpenStreetMap Data, Interdisciplinary ProjectEvent Visualization with OpenStreetMap Data, Interdisciplinary Project
Event Visualization with OpenStreetMap Data, Interdisciplinary ProjectBibek Shrestha
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alRazzaqe
 

Similar to Divvy Bike Visualizations (20)

IRJET- Explore the World
IRJET- 	  Explore the WorldIRJET- 	  Explore the World
IRJET- Explore the World
 
Automatic Itinerary Voyage Suggestion using SoNet in Big Data
Automatic Itinerary Voyage Suggestion using SoNet in Big DataAutomatic Itinerary Voyage Suggestion using SoNet in Big Data
Automatic Itinerary Voyage Suggestion using SoNet in Big Data
 
Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1Predictive modeling Paper-Team8 V0.1
Predictive modeling Paper-Team8 V0.1
 
Project on nypd accident analysis using hadoop environment
Project on nypd accident analysis using hadoop environmentProject on nypd accident analysis using hadoop environment
Project on nypd accident analysis using hadoop environment
 
SFBay Area Bike Share Analysis
SFBay Area Bike Share AnalysisSFBay Area Bike Share Analysis
SFBay Area Bike Share Analysis
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with spark
 
INTERNSHIP PPT.pptx
INTERNSHIP PPT.pptxINTERNSHIP PPT.pptx
INTERNSHIP PPT.pptx
 
Case Study
Case StudyCase Study
Case Study
 
IRJET - A Framework for Tourist Identification and Analytics using Transport ...
IRJET - A Framework for Tourist Identification and Analytics using Transport ...IRJET - A Framework for Tourist Identification and Analytics using Transport ...
IRJET - A Framework for Tourist Identification and Analytics using Transport ...
 
Vivarana literature survey
Vivarana literature surveyVivarana literature survey
Vivarana literature survey
 
Investigating Geographic Information System Technologies A Global Positioning...
Investigating Geographic Information System Technologies A Global Positioning...Investigating Geographic Information System Technologies A Global Positioning...
Investigating Geographic Information System Technologies A Global Positioning...
 
Benchmarking_ML_Tools
Benchmarking_ML_ToolsBenchmarking_ML_Tools
Benchmarking_ML_Tools
 
Spatial_Data_Analysis_with_open_source_softwares[1]
Spatial_Data_Analysis_with_open_source_softwares[1]Spatial_Data_Analysis_with_open_source_softwares[1]
Spatial_Data_Analysis_with_open_source_softwares[1]
 
Review of need for grade separations five locations - statement of work
Review of need for grade separations   five locations - statement of workReview of need for grade separations   five locations - statement of work
Review of need for grade separations five locations - statement of work
 
Dacota sunflower
Dacota sunflowerDacota sunflower
Dacota sunflower
 
Pollution
PollutionPollution
Pollution
 
Visualizing CDR Data
Visualizing CDR DataVisualizing CDR Data
Visualizing CDR Data
 
Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research Meeting
 
Event Visualization with OpenStreetMap Data, Interdisciplinary Project
Event Visualization with OpenStreetMap Data, Interdisciplinary ProjectEvent Visualization with OpenStreetMap Data, Interdisciplinary Project
Event Visualization with OpenStreetMap Data, Interdisciplinary Project
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et al
 

Divvy Bike Visualizations

  • 1. Divvy  Bike  Challenge  Visualizations     CSC  465    -­‐  DATA  VISUALIZATION:  FINAL  PROJECT     GROUP  #1:   HASSAN  AL  ALAIWI,  RICARDO  LOURENÇO,  AND  MATT  SIEDLECKI          March   2015  
  • 2.     2   Contents   Abstract  .........................................................................................................................................................................  3   Description  ....................................................................................................................................................................  3   Scope  ............................................................................................................................................................................  3   Dataset  ..........................................................................................................................................................................  3   Dataset  Variables  ..........................................................................................................................................................  4   Final  Visualzaitions  ........................................................................................................................................................  5   Usage  by  Weekday/Weekend  and  Time  of  Day  .......................................................................................................  5   Network  Map  of  Chicago  Loop  .................................................................................................................................  6   Circular  Network  Visualization  .................................................................................................................................  7   Discussion  .....................................................................................................................................................................  8   Usage  by  Weekday/Weekend  and  Time  of  Day  .......................................................................................................  8   Network  Map  of  Chicago  Loop  ...............................................................................................................................  13   Circular  Network  Visualization  and  Related  Analysis  .............................................................................................  19   Summary  of  Team  Member  Contributions  .................................................................................................................  31                      
  • 3.     3   Abstract   In  this  final  project  report,  we  are  trying  to  shed  light  on  some  of  our  final  data  visualizations  of  the   chosen  Divvy  bikes  dataset.  The  report  details  each  visualization  technique  used  to  display  information   about  the  dataset  and  possible  its  correlation  implications.  In  order  to  have  a  very  coherent  and   concrete  report,  a  few  segments  in  this  report  have  been  extracted  from  our  project  milestones  that   were  previously  submitted  as  part  of  the  project  progress.  Higher  resolution  files  of  the  final   visualizations  are  enclosed.     Description   Every  year,  Divvy  launches  a  data  challenge  providing  their  magnificent  dataset  in  the  purpose  of   scrutinizing  and  visualizing  the  data  under  different  categories.  This  year,  Divvy  celebrates  its  first  time   full-­‐year  dataset  (2014)  with  over  3.2  Million  rows  of  data  where  it  is  even  more  challenging  and  more   enticing  for  data  scientists  and  other  participants.   Scope   We  are  tasked  to  fulfill  our  CSC  465  project’s  objectives  of  visualizing  a  dataset  through  the  best   visualization  techniques  that  were  discussed  throughout  the  course.  Since  both  mapping  and   geographical  data  is  available  in  this  dataset,  we  have  completed  multiple  graphs  that  visualize  the   dataset  statistically  and  geographically.  The  software  that  were  used  for  this  purpose  are  R-­‐Studio,   Tableau,  JMP  and  ArcGIS  which  provided  a  sufficient  platform  for  our  objective.  Those  multiple   visualizations  provide  answers  to  the  following  questions1  in  the  most  clear  and  accurate  methods  that   we  could  have  achieved:   Ø When  &  Where  are  riders  going?     Ø What  are  the  most  and  least  busy  stations?     Ø What  interesting  usage  patterns  emerge?     Ø How  the  bikers’  demography  can  be  presented?     Dataset   The  full-­‐year  dataset  is  broken  down  by  quarters  with  a  total  of  2.4  million  records.  However,  due  to  the   size  of  the  dataset,  we  have  mainly  used  a  Simple  Random  Selection  of  the  data  in  order  to  have  a   manageable  dataset  size  of  100,000  records  that  will  still  be  sufficient  for  our  educational  purposes.   Source:  Chicago  Divvy  Bikes  website  (Annual  data  challenge)   Website:    http://www.divvybikes.com/datachallenge                                                                                                                                 1 Some  of  these  inquiries  are  part  of  the  2015  Divvy  Data  Challenge 2 Description  of  the  dataset  variables  was  provided  by  Divvy  Data  Challenge    
  • 4.     4   Dataset  Variables   The  dataset  consists  of  two  tables  (sub-­‐datasets)  which  will  be  used  in  the  project2 .     Ø Trips  dataset  (the  main  dataset):   This  dataset  is  the  main  project  dataset  which  includes  all  trips’  records  and  transactions   whenever  a  bike  is  rented  from  a  station.  The  following  12  variables  are  captured  in  every  data   row:         trip_id:  ID  attached  to  each  trip  taken  –  (Type:  Category  –  serial  key)   starttime:  day  and  time  trip  started,  in  CST  –  (Type:  Date&Time)   stoptime:  day  and  time  trip  ended,  in  CST  –  (Type:  Date&Time)   bikeid:  ID  attached  to  each  bike  –  (Type:  Category)   tripduration:  time  of  trip  in  seconds  –  (Type:  Numeric)   from_station_name:  name  of  station  where  trip  originated  –  (Type:  Category)   to_station_name:  name  of  station  where  trip  terminated  –  (Type:  Category)   from_station_id:  ID  of  station  where  trip  originated  –  (Type:  Category)   to_station_id:  ID  of  station  where  trip  terminated  –  (Type:  Category)   usertype:  "Customer"  is  a  rider  who  purchased  a  24-­‐Hour  Pass;  "Subscriber"  is  a  rider   who  purchased  an  Annual  Membership  –  (Type:  Category)   gender:  gender  of  rider  –  (Type:  Binary)   birthyear:  birth  year  of  rider  –  (Type:  Numeric)     Ø Stations  dataset  (Table  relationship  dataset):   This  relations  dataset  includes  the  location  details  of  Divvy  stations  which  will  be  used  in  the   project  to  map  the  start  and  end  locations  of  each  bike  trip  used  in  the  main  table.  The  5   variables  are:   name:  station  name  –  (Type:  Category)   latitude:  station  latitude  –  (Type:  GPS  location)   longitude:  station  longitude  –  (Type:  GPS  location)   dpcapacity:  number  of  total  docks  at  each  station  as  of  12/31/2014  –  (Type:  Numeric)   online  date:  date  the  station  went  live  in  the  system  –  (Type:  Date&Time)                                                                                                                               2 Description  of  the  dataset  variables  was  provided  by  Divvy  Data  Challenge    
  • 5.     5   Final  Visualizations       Visualization  #1:  Divvy  Bike  Usage  by  Weekday/Weekend  and  Time  of  Day              
  • 6.     6   Visualization  #2:  Network  Map  of  Chicago  Loop        
  • 7.     7   Visualization  #3:  Circular  Network  Visualization    
  • 8.     8   Divvy  Bike  Usage  by  Time  of  Day  and  Day  of  Week:  Discussion     Overview   For  this  visualization  small  multiple  maps  were  combined  with  histograms  to  display  information  about   Divvy  bike  usage  by  time  of  day  and  day  of  week.  Looking  horizontally,  the  maps  show  data  in  4-­‐hour   blocks  starting  at  midnight.  The  histograms  on  the  top  and  bottom  display  a  histogram  of  the  total   system  usage.     The  segments  drawn  on  the  map  illustrate  the  popular  routes  selected  using  a  combination  of   thresholds  for  number  of  segments  on  a  map  and  minimum  usage  needed  to  be  considered  for  the  map.   Additional  discussion  on  the  design  decisions  made  to  select  which  bike  trip  segments  were  included   follows  in  the  design  consideration  section.       The  viewer  is  able  to  discern  a  number  of  pieces  of  information  from  this  graphic,  including:   • Usage  is  much  higher  during  the  week  (than  the  weekend)   • Weekday  usage  has  a  bimodal  distribution  with  peaks  during  morning  and  evening  commuting   times.  
  • 9.     9   • More  commuters  use  the  bikes  in  the  evening  than  in  the  morning   • Weekend  usage  is  much  less  than  weekday   • Weekend  usage  has  a  unimodal  distribution  centered  in  the  early  afternoon.   • In  general,  off-­‐peak  hours  have  riders  scattered  throughout  the  city,  especially  near  train   stations,  while  usage  is  more  concentrated  during  the  day   • Weekend  is  heavily  concentrated  along  the  lakeshore,  Lincoln  Park,  Navy  Pier,  and  some  smaller   tourist  locations  such  as  the  Hyde  Park  Museum  Campus   • Lakeshore  path  is  more  prominent  in  evening  commute  hours  than  morning  commute  hours,   possibly  due  to  higher  system  usage  during  that  time     Design  Considerations   Number  of  Small  Multiple  Maps   The  final  visualization  splits  the  entire  day  up  into  4-­‐hour  blocks  and  shows  6  maps  for  weekday  and  6   for  weekend  rides.  We  chose  to  employ  an  equal  number  of  hours  in  each  map  to  make  it  clear  to   viewer.  That  gave  us  the  option  of  2,  4,  6,  or  12-­‐hour  blocks.    Twelve-­‐hour  blocks  were  not  seriously   considered  because  it  would  not  show  very  much  interesting  patters  in  the  data.  On  the  other  hand,  2-­‐ hour  blocks  would  have  created  double  the  maps  in  the  final  visualization,  and  we  concluded  that  would   be  too  much.  Ultimately,  4-­‐hour  blocks  were  convenient  because  it  clearly  differentiated  the  afternoon   (Noon-­‐4PM)  from  the  commuting  hours  after  4PM.     Trip  Segments   One  thing  that  became  very  clear  initially  was  that  a  key  design  decision  was  how  to  display  the  trip   segments.   Showing  all  segments  was  a  jumbled  mess  that  yielded  minimal  useful  information.  We  attempted   including  all  segments,  but  making  the  lines  very  thin  on  low  traffic  routes,  and  bigger  on  the  higher   traffic  routes,  but  still  found  that  this  was  very  cluttered.  Additionally,  scaling  the  line  thickness  was   problematic  because  the  data  was  highly  skewed  and  we  did  not  want  a  small  number  of  stations  to   dominate  the  visualization.  Arguably  that  is  an  aspect  of  the  data  that  could  be  better  highlighted  in  my   visualization;  however,  in  the  final  visualization  all  lines  are  the  same  (relatively  thin,  but  viewable)  size,   which  adds  clarity  by  better  allowing  you  to  see  the  trips  that  are  included.   We  experimented  with  both  thresholds  of  the  number  of  trips  per  4-­‐hour  block,  and  a  ranking  of  the  top   n  trips  for  that  threshold.  The  tradeoff  here  was  that  if  you  set  an  absolute  trip  cutoff  then  the  off-­‐hours   have  literally  no  data  (unless  you  overwhelm  the  peak  charts),  and  if  you  choose  to  show  the  top  n  trips   then  it  appears  at  a  glance  that  traffic  is  equal  at  all  times  of  day  when  it  is  actually  highly  skewed  to   certain  times  of  day.   Ultimately  a  two-­‐pronged  approach  was  employed  to  deal  with  this.  First,  we  compromised  and  chose   up  to  75  trips  for  each  map,  but  only  included  them  if  they  averaged  at  least  5  trips  per  hour.  This   allowed  us  to  both  show  data  for  non-­‐peak  times,  but  also  not  make  it  appear  visually  that  traffic  was   comparable  at  all  hours  across  the  day.  Secondly,  the  histograms  (which  will  be  explained  in  greater  
  • 10.     10   detail  in  a  subsequent  section)  add  additional  context  around  which  times  of  day  experience  the  most   traffic.   Background  Tile  Image   We  experimented  with  a  number  of  different  backgrounds.  There  was  a  compromise  to  be  made   between  showing  additional  context  in  the  background  layer  with  more  detail,  and  allocating  more  of   the  available  pixels  to  show  data  about  the  Divvy  bikes.  In  the  first  iteration  a  very  plain  background  that   did  not  show  additional  data  beyond  the  lake  and  station  locations  was  used.       But  after  getting  feedback,  subsequent  iterations  included  a  street  map.  To  select  a  background  a   number  of  options  from  the  OpenStreetMap  package  in  R  were  attempted.  The  options  that  were   considered  are  displayed  below  with  the  Divvy  bike  locations  plotted  on  them.  We  attempted  to  strike  a   balance  between  showing  some  context  to  allow  the  viewer  to  contextualize  an  individual  point  on  the   map,  and  a  need  not  to  make  the  background  dominate.    
  • 11.     11       Plot  Area   There  are  a  total  of  300  Divvy  bike  stations  in  the  city,  but  they  are  not  distributed  evenly  across  the  city.   In  particular,  most  of  the  station  locations  and  an  overwhelming  number  of  the  most  popular  locations   are  found  in  the  Loop  and  north  side  neighborhoods.     We  considered  several  approaches  for  cropping  the  map.     1. City  limits   • One  approach  would  be  to  include  the  entire  city  in  the  map.  This  would  highlight  the   discrepancies  that  exist  where  certain  parts  of  the  city  have  no  Divvy  stations.  The  lack   of  Divvy  stations  in  certain  areas  in  the  city  was  an  interesting  aspect  of  the  data.  The   City  sponsors  divvy,  but  stations  are  not  available  in  all  neighborhoods.  However,  this   approach  was  not  used  in  the  final  visualization,  because  we  concluded  that  cropping   that  map  around  the  entire  city  led  to  a  lot  of  “blank”  space  on  the  map  that  could  have   been  more  effectively  utilized  by  zooming  in  on  existing  locations.     2. Divvy  station  locations   • This  was  my  chosen  approach.  This  was  a  compromise  between  utilizing  space  to  show   the  Divvy  data,  while  still  showing  all  Divvy  stations  that  exist  for  this  analysis.   3. Zooming  in  only  on  a  popularity/usage  threshold  and  using  those  stations  or  focusing  on  specific   neighborhood(s)  such  as  the  loop   • While  a  more  focused  analysis  of  a  neighborhood  could  have  been  interesting,  it  also   could  have  masked  interesting  patterns,  such  as  how  on  the  weekends  there  is  more   activity  in  the  Hyde  Park  Museum  campus,  and  further  north  along  the  lakefront  trail  on   the  weekend  would  be  lost  if  the  map  zoomed  in.      
  • 12.     12   Direction   It  would  have  been  very  interesting  to  show  the  direction  of  the  trips  on  the  maps;  however,  the   approaches  that  were  explored  did  not  improve  the  visualization.  We  experimented  with  using  color  to   show  direction,  but  ran  into  a  couple  of  challenges  with  that.  The  first  challenge  was  defining  what   colors  to  group  together.  Initially  started  by  using  two  colors  and  then  adjusting  the  color  of  the  line   depending  on  whether  or  not  the  user  was  heading  north  or  south.  That  added  some  information,  but  I   concluded  that  was  confusing  because  in  many  cases  (i.e.  commuting  time)  the  more  relevant  direction   is  whether  or  not  you  are  headed  towards  the  Loop.  So  a  next  iteration  I  used  color  to  indicate  whether   or  not  you  were  headed  to  or  from  the  loop  based  on  a  cutoff  of  Madison  Street.  This  was  okay,  but  I   felt  that  it  was  not  effective  for  the  popular  bike  stations  in  the  West  Loop  near  the  train  stations  and  I   felt  that  it  did  not  add  much  value  for  some  of  the  weekend  locations  where  people  are  less  likely  to  be   commuting.   Although  direction  was  explored,  it  was  not  a  dimension  included  in  the  final  visualization.     Histograms   The  initial  iteration  of  this  visualization  included  only  the  small  multiple  maps  without  the  histograms   that  are  included  at  the  top  and  bottom.  The  maps  illustrate  a  number  of  interesting  aspects  of  the  data;   however,  they  do  not  explicitly  answer  simple  questions  that  you  might  be  interested  in  if  you  were   curious  about  Divvy  bike  usage  by  day  of  week  and  time  of  day.  Specifically,  since  this  visualization   attempted  to  illustrate  patterns  around  usage  by  weekday/weekend  and  time  of  day,  we  wanted  to   make  it  very  easy  for  the  viewer  to  discern  a  few  key  facts  about  the  data:   • Usage  is  much  higher  during  the  week  (than  the  weekend)   • Weekday  usage  has  a  bimodal  distribution  with  peaks  during  morning  and  evening  commuting   times.   • More  commuters  use  the  bikes  in  the  evening  than  in  the  morning   • Weekend  usage  is  much  less  than  weekday   • Weekend  usage  has  a  unimodal  distribution  centered  in  the  early  afternoon.     Our  approach  for  illustrating  those  key  aspects  of  the  data  was  to  make  histograms  and  include  them  in   the  chart.  There  was  some  thought  as  to  where  to  place  the  histograms.  It  was  a  tradeoff  between   making  the  histogram  shows  as  much  information  as  possible  while  preventing  confusion  and  possibly   distracting  from  the  graphs.  Ultimately,  I  erred  on  the  side  of  clarity  and  simplicity  by  putting  the   histograms  above  and  below  their  respective  maps.  The  histograms  do  not  provide  exact  values  (there   are  no  labels),  but  they  do  illustrate  the  main  themes  (outlined  above).  Adding  labels  and  putting  the   two  histograms  on  the  same  axis  would  likely  have  yielded  improvements  to  the  histograms  as   standalone  visuals,  but  as  part  of  the  larger  visual,  maintaining  clarity  was  the  driving  design  principle.   We  attempted  to  align  the  histogram  with  the  small  multiples  so  that  the  four-­‐hour  time  periods  in  the   small  multiples  had  the  corresponding  to  add  consistency  across  the  individual  components  of  the  visual   and  facilitate  the  histogram  aiding  in  the  interpretation  of  the  maps.      
  • 13.     13   Network  Analysis:  Discussion   Overview   For  this  analysis,  the  idea  was  to  represent  the  overall  flow  between  Divvy  stations  using  the  whole   dataset.  There  are  many  possible  representations  for  the  Divvy  dataset,  but  being  a  georeferenced  data   allows  you  to  see  how  this  bike  sharing  system  is  related  to  the  city,  and  it’s  infrastructure,  in  a  compact   and  accurate  view.     Data  and  Systems  Used   For  this  visualization  we  was  used  the  whole  2014  Divvy  Challenge  Dataset,  after  making  some  data   transformation.  First  by  summarizing  all  of  the  routes,  grouping  them  by  their  origin  and  destination   station,  and  ID’s.  This  generated  a  calculated  variable  that  has  the  record  count  for  each  route.  After   this,  each  record  was  georeferenced  by  merging  the  origin  and  destination  fields  (related  to  stations)   with  their  respective  geographical  coordinates.  This  procedure  was  done  using  SPSS  Modeler,  due  to  the   data  size,  and  having  as  input  and  output  CSV  files.   Once  the  preprocessing  was  done,  the  dataset  was  opened  into  ArcGIS  10.2.2.  After  this  was  loaded   separately,  the  dataset  containing  each  Divvy  Station,  and  we  also  loaded  a  georeferenced  CTA  Stations   dataset,  obtained  from  the  City  of  Chicago  Data  Portal.  There  was  a  hypothesis  on  a  possible   relationship  between  CTA  and  Divvy,  because  of  people  commuting  may  use  both  forms  of   transportation,  and  so  this  is  why  we  also  explored  this  supplemental  data  source.   Methodology   Displaying  movement  data  using  maps  is  tricky,  because  in  addition  to  the  two  dimensional  data  that   maps  usually  display,  we  also  have  additional  dimensions  including  movement  and  time.   Time-­‐lapse  cartography  is  a  direct  option,  and  you  can  use  a  sequence  of  overlaid  maps  of  same  region,   to  try  to  figure  out  differences  on  space.  When  you  are  dealing  with  small  changes,  not  on  the  whole   map,  it’s  better  to  use  a  Flow  Map,  or  a  Network  Map.   A  Flow  Map  is  designed  to  represent  a  relation  of  one  (or  a  few)  source(s)  to  many.  Its  usage  comes   from  early  representation  between  countries  in  the  colonial  period  of  history.  
  • 14.     14     Figure  1-­‐  Example  of  a  Flow  Map.  Charles  Minard  -­‐  Minard,  C.  J.  "Carte  figurative  et  approximative  des  quantités  de  vin  français   exportés  par  mer  en  1864".  lith.  (835  x  547),  1865.  Copy  [from  http://en.wikipedia.org/wiki/Flow_map    here].   A  Network  Map,  on  the  other  hand  has  the  objective  to  show  relations  of  many  to  many  features  on  a   map.  A  popular  use  is  for  airline  routes,  with  connections  between  local  airports,  and  major  hubs:     Figure  2-­‐  American  Airlines  OneWorld  Map  (http://www.aa.com/content/images/production/generic/onworld-­‐map.jpg)  
  • 15.     15   Another  popular  example  of  a  Network  Map  was  recently  presented  by  Facebook,  who  displayed  the   connections  between  groups  of  users:     Figure  3-­‐  Facebook  User  Connections  (Obtained  on  Facebook.com)   In  this  case,  rather  than  just  displaying  the  connections,  they  were  overlaid  among  themselves,  and  with   transparency  usage  it  was  possible  to  have  an  accumulated  view  of  these  relations,  allowing  the  viewer   to  see  clearly  where  the  traffic  comes  from  and  goes  to,  and  also  about  it’s  intensity.   For  our  case,  the  goal  was  to  properly  display  the  routes  that  are  on  the  whole  dataset,  with  differential   scaling  and  color  grading  to  pop  up  the  most  used  routes,  even  with  an  estimate  of  the  usage  level,  but   without  sampling  the  data,  keeping  all  routes  shown.  This  is  a  good  effort,  considering  that  almost   2,400,000  trips  were  described  on  the  dataset,  complicating  the  georeferencing,  the  load,  and  the   transformation  of  the  data  to  lines,  and  find  proper  representation  on  color,  transparencies,  and  most   important  scaling.  Multiscale  representation  on  maps  is  an  old  challenge,  because  they  are  normally   interactive.  Using  GIS,  it’s  different  because  each  view  could  be  rendered  separately.   Also  it  was  a  challenge  to  represent  the  CTA  dataset.  The  idea  was  to  see  its  influence  on  the  Divvy   system.  So,  the  design  was  created  by  calculating  a  buffer  from  a  certain  distance  of  a  CTA  station,  to   see  where  it  is  located  a  possible  influence  zone  between  CTA  and  Divvy  stations,  suggesting   commutation  among  those  systems.   For  this  two  maps  were  created,  one  with  a  whole  view  of  the  Divvy  Stations,  and  other,  in  a  more  detail   level,  to  show  specificities  of  this  interaction  in  the  Loop.        
  • 16.     16    
  • 17.     17      
  • 18.     18     The  presented  maps  shows  the  flow  among  Divvy  Stations  (red  dots),  with  connections  using  just  two   data  categories,  one  with  thinner  lines  represented  as  green,  with  routes  with  record  counts  between   log101  (0)  and  log101000  (3),  and  thicker  red  lines  describing  routes  with  record  counts  between   log101000  (3)  and  log1010000  (4).  These  categories  were  normalized  with  logarithmic  scaling  because  of   the  difference  on  magnitude  between  routes,  as  a  way  to  proper  represent  different  dimension  levels   on  a  same  graph.     Final  Considerations   Looking  at  those  maps,  some  inferences  could  be  made.  However,  it  is  important  to  remember  that  this   analysis  does  not  suggest  causation  on  these  relationships.     On  the  first  map,  with  a  broader  view  of  the  Divvy  Stations  in  Chicago,  it’s  possible  to  see  a  high   concentration  of  routes  along  the  lakeshore,  with  grading  to  countryside,  clearly  shown  by  the   predominance  of  red  flows  on  east  grading  to  green.  On  this  map  it  is  also  possible  to  see  that  the   southern  and  western  stations,  as  well  as  many  northern  ones,  there  are  no  flow  lines  being  draw.  Flow   exists  on  those  stations,  but  it’s  not  represented  on  this  map  because  the  origin  and  destination  station   were  often  the  same  place.  Considering  the  buffers  of  half  a  mile  drawn  centered  on  each  CTA  station,  it   is  possible  to  see  that,  despite  the  loop,  much  traffic  and  stations  overlapping,  there  is  a  relation  of   green  flow  and  CTA  stations,  even  with  more  radial  distribution  of  lines,  into  countryside  and  also  a  few   to  the  shore.  Perhaps  that  could  be  the  commuters  connecting  from  their  houses,  work,  or  leisure  places   to  the  CTA.  And  then  taking  the  Divvy  bikes  for  the  remainder  of  their  trip.  Anonymization    also  limits   our  ability  to  merging  these  datasets  and  perform  a  deeper  analysis  of  commuting  patterns.   The  second  map  focuses  on  the  Loop.  The  main  concentration  of  flows  is  on  the  surroundings  of  the   Grand  Central  Station  (Metra),  within  the  Loop,  Merchandise  Mart  and  The  Magnificent  Mile.  Most  of   these  should  be  inferred  to  be  people  going  from  and  to  work,  because  these  three  areas  are  highly   related  on  the  main  commuter  train  stations.  A  second  major  trend  is  verified  on  the  Adams/Wabash   CTA  station  and  the  Navy  Pier,  Millennium  Park,  The  Chicago  Yacht  Club,  and  southern  to  Adler  and  Field   Museums.  With  these  characteristics,  it’s  also  possible  to  suggest  that  these  high  intensity  routes  have   more  relation  on  tourism,  rather  than  work,  by  the  use  and  occupation  of  space.  Also,  the  high  traffic   near  the  lakeshore  stations  reinforces  this,  as  those  are  places  that  many  people  go  for  leisure  activities   at  the  beach  and  parks   A  learning  that  we  took  away  from  this  part  of  the  project  was  that  it  is  possible  to  improve  map   visualizations  and  interpretations  by  using  a  full  a  GIS  system,  rather  than  just  a  map  plot.  It  aggregates   interactivity  capabilities,  and  also  tools  designed  for  spatial  analysis,  allowing  the  end-­‐user  to  explore   the  initial  dataset,  but  also  integrating  this  with  others,  amplifying  the  spatial  analysis.      
  • 19.     19   Circular  Network  Visualization:  Discussion  and  Related  Analysis     Visualization  techniques:   Below  are  the  top  visualizations  that  were  created  as  part  of  the  same  analysis  that  led  to  the  circular   network  map  visualization:     1. Divvy  bikes  rush  hours                                                     Ø Description:   These  visualizations  highlight  the  utilization  time  of  Divvy  Bikes.     The  heatmap  in  plots  the  “hours  of  the  day”  in  the  x-­‐axis  and  the  “days  of  the  week”  in  the  y-­‐ axis.  The  count  of  the  rented  bikes  is  represented  through  the  heatmap  matrix.  Colors  of  the   heatmap  varies  between:  (Green  –  Yellow  -­‐  Red)  in  response  to  the  bikes  “count”  levels  which  I   believe  clearly  draw  our  eyes  to  the  peak  hours  plotted  in  the  red/orange  colors  of  the  heatmap.   Throughout  my  analysis  and  examining  the  data,  I  release  the  importance  of  segregating  users’   types:  (Subscribers  &  Customers)  in  separate  plots  for  almost  all  my  visualizations.   On  the  left  side,  there  is  a  plot  of  data  clustering  of  (day  of  the  week)  based  on  which  levels  on   the  y-­‐axis  are  sorted.  
  • 20.     20     Ø Data  Analysis:   Subscribers’  heatmap:   It  appears  that  the  bikes’  highest  demand  during  the  weekdays  moves  along  the  rush  hours   (7:00  –  8:00  and  16:00  –  18:00).  There  is  also  a  small  -­‐-­‐  but  worth  mentioning  -­‐-­‐  and  spread   demand  of  the  bikes  between  10:00  and  15:00  during  the  weekends.   Subscribers  would  also  tend  to  rent/return  bikes  at  relatively  late  times  on  Friday  night  and   Saturday  night  -­‐-­‐  displayed  through  the  lighter  green  color.   We  can  also  see  that  subscribers  seem  to  be  leaving  their  work  a  little  early  (or  on  time)  on   Friday  and  therefore  return  home  a  little  early  probably  for  some  weekend  plans.   On  a  similar  note,  we  can  use  the  scale  to  approximately  count  number  of  rentals  per  hour.  It   seems  that  more  people  use  the  bikes  to  return  home  rather  than  going  to  work.  Probably,   people  avoid  arriving  to  work  sweating  and  tired  or  they  avoid  arriving  late  to  work  and   therefore  they  prefer  to  arrive  refreshed  and  on  time.   Customers’  heatmap:   Non-­‐subscribers’  (casual  customers)  have  an  inverse  demand.  Their  highest  demand  is  during   the  weekends  between  (10:00  –  19:00).  Most  demand  demand  occurs  between  11:00  –  18:00   during  the  weekdays.     It’s  also  worth  mentioning  -­‐-­‐  using  cluster  analysis  -­‐-­‐  that  more  customers  use  divvy  bikes  in  the   first  and  last  working  days  of  the  week  (Monday  and  Friday  respectively).  I  would  only  assume   that  tourists  who  are  visiting  the  city  tend  to  take  one  day  off  work  (Monday/Friday)  to  have  a   longer  weekend,  which  therefore  explains  the  busier  traffic  during  first/last  day  of  weekdays.     Ø History  of  revisions:   The  revision  of  this  final  heatmap  evolved  over  time.  I  started  with  the  simple  heatmap  function   that  was  covered  in  class.  Then,  I  made  some  more  research  about  other  available  heatmaps  in   R  to  discover  the  newly  created  heatmap3  package  -­‐-­‐  launched  on  June  2014.   Some  further  revisions  were  implemented  on  the  map  including:  colors,  data  cluster,  scale  and   axis.     One  very  tricky  part  was  to  reformat  the  data  in  a  matrix-­‐style,  which  is  necessary  for  heatmaps.   Re-­‐grouping  the  data  rows  by  their  corresponding  hours  and  days  took  a  lot  of  time  and   research.     In  the  final  graph,  the  two  users-­‐types  were  separated  to  enable  more  in-­‐depth  analysis.     Version  1   Version  2  
  • 21.     21       Version  3         Version  4                                          
  • 22.     22   2. Divvy  bikes  traffic  flow  among  Chicago  districts    
  • 23.     23     Ø Description:   These  two  visualizations  highlight  the  traffic  flow  of  Divvy  bikes  between  Chicago  Districts  for   both  subscribers  and  customers.     We  can  see  an  inner  arc  and  three  outer  arcs.  The  inner  arc  represent  Chicago  districts  (six   districts  are  present  in  this  database)  each  with  a  different  color.  Each  district  has  two  different   sets  of  arrows/lines:  the  set  that  has  the  same  color  as  the  district  represents  outgoing  traffic   (divvy  bikes)  starting  from  that  location  -­‐-­‐  whereas  the  set  with  different  color  than  the  district   represents  incoming  traffic  arriving  to  that  location.  The  overall  magnitude  of  district  traffic  is   represented  through  the  scale  in  the  inner  arc  -­‐-­‐  whereas  the  magnitude  of  each  arrow/route  is   represented  by  the  thickness  of  the  arrow.   Each  one  of  the  outer  arcs  represents  a  percentage  of  traffic  flow.  The  first  arc  (the  very  outer   one)  shows  the  percentage  weight  of  the  overall  (incoming  and  outgoing)  traffic  in  that   particular  one  district.  The  second  outer  arc  shows  the  percentage  of  the  incoming  traffic.  The   third  arc  shows  the  percentage  of  the  outgoing  traffic.  These  arcs  are  mainly  used  for   comparison  purposes.  
  • 24.     24     Ø Data  Analysis:   Subscribers’  network  diagram:   From  the  traffic  scale,  it  appears  that  the  North  Side  is  the  busiest  location  with  the  largest   traffic  whereas  the  South  Side  is  the  least  busy  area.  A  lot  of  users  commute  to  stations  within   the  North  Side  Area  or  to  North  Loop.  Chicago  North  Side  is  considered  to  be  the  most  densely   populated  residential  area3 ,  which  explains  the  heavy  Divvy  traffic.   Interestingly,  both  the  Loop  and  West  Loop  areas  have  almost  the  same  magnitude  of  traffic   flow,  although  I  was  expecting  a  busier  traffic  being  in  the  city  center.  In  addition,  traffic  in  West   Loop  and  South  Loop  looks  almost  symmetrical.  When  comparing  the  two  outer  arcs  (incoming   and  outgoing),  we  can  see  that  both  are  identical.  They  have  the  same  magnitude  and  even  the   same  colors  order,  which  means  that  we  have  a  very  consistence  traffic  flow  in  these  two  areas.   We  can  also  note  minimal  traffic  between  far-­‐apart  areas  such  as:  Southern  and  Northern  areas.   Customers’  network  diagram:   Similar  to  the  subscribers’  diagram,  the  North  Side  is  still  the  busiest  location  for  the  casual   customers.  What  is  interesting  is  that  the  two  diagrams  form  exactly  the  same  trend  in  the   North  Side  -­‐-­‐  Some  major  traffic  occurs  within  the  North  Side  and  to  the  North  Loop.   Another  interesting  observation  is  that  the  West  Loop  has  significantly  shrunk;  yet  it  is  still  very   symmetrical!  Unlike  the  subscriber  users,  casual  customers  are  less  interested  in  using  Union   and  Ogilvie  Train  Stations,  which  are  accountable  for  heavy  traffic  flow  for  the  suburbs.     In  both  models,  trips  from/to  North  Side  and  North  Loop  hold  between  50%  -­‐  70%  of  the  overall   traffic  in  Divvy  Stations.         Ø History  of  revisions:   At  the  beginning,  I  was  not  sure  what  the  best  way  to  visualize  the  traffic  from/to  Divvy  stations   in  a  meaningful  way.  I  started  with  a  simple  heatmap  to  display  network  in  a  simple  and   effective  way.  It  worked  just  fine  but  it  was  not  very  evocative  and  conclusion  did  not  stand  out.   Then,  I  came  across  the  new  software  and  tried  to  map  all  the  250  stations  into  one  network   diagram.  The  graph  was  not  expressive  and  had  a  spaghetti  shape.  The  names  of  the  stations   were  overlapping  and  the  thickness  of  the  lines  did  not  have  any  insights.   An  important  suggestion  rises  during  the  final  presentation  to  group  the  stations  together.  So,  I   made  an  attempt  to  group  stations  by  Chicago  76  neighborhoods.  The  diagram  became  much                                                                                                                             3 Source:  http://en.wikipedia.org/wiki/Community_areas_in_Chicago#North_side
  • 25.     25   better  but  required  some  additional  grouping.  The  final  visualization  did  group  the   neighborhoods  further  by  their  geographical  locations  (Chicago  districts4 ).     The  trick  (and  most  time-­‐consuming)  part  was  grouping  stations  by  their  locations.  Being  an   international  student,  it  was  a  very  fun  exercise  to  get  to  know  the  different  neighborhoods  of   ChicagoJ.  I  used  Wikipedia  and  Chicago  Portal  to  precisely  go  through  the  locations  and  build   the  final  version.     Version  1         Version  2     Version  3       Version  4                                                                                                                                 4 Source:  http://en.wikipedia.org/wiki/Community_areas_in_Chicago
  • 26.     26   Version  5              
  • 27.     27   3. Divvy  bikes  rentals  over  the  year  seasons     Ø Description:   I  have  been  trying  to  plot  a  time  series  visualization  as  this  idea  seems  unique  and  came  to  me   suddenly  while  I  was  doing  some  research.  The  x-­‐axis  represents  the  timeline  of  2014.  The  y-­‐axis   has  dual  axes.  The  left  axis  represents  the  total  bikes  rentals  whereas  the  right  axis  represents   the  mean  temperature  in  (ᴏ F).  The  two  lines  are  differentiated  by  two  different  colors.     Ø Data  Analysis:   We  are  here  trying  to  see  the  correlation  between  the  temperature  and  number  of  rented  bikes   over  time.       We  know  that  it  is  difficult  to  use  bikes  during  rainy  or  snowy  seasons,  which  is  in  general   associated  with  temperature.  During  the  months  of  December  through  March  where   temperature  is  around  20  ᴏ F,  we  have  the  least  bikes’  rental  activities.  However,  when   temperature  starts  warming  up,  rentals  start  to  pick  up  till  it  reaches  its  peak  season:  June   through  September  where  temperature  is  around  70  ᴏ F.     We  can  see  that  both  curves  are  almost  identical  with  the  exception  of  a  few  outlier  days  where   we  had  a  dramatic  drop  or  rise  of  rentals  against  the  trend.  These  outliers  can  be  further   explored  using  a  new  dimension  of  dataset  specialized  in  Chicago  events  possibly.       Ø History  of  revisions:   The  most  challenging  part  here  was  the  necessity  of  having  a  continuous  timeline  in  order  to   have  an  accurate  time  series.  Since  I  was  working  on  a  sample  data,  I  had  to  go  back  and  work  
  • 28.     28   with  the  original  and  complete  dataset  (2.4MM  data  rows)  for  this  visualization.  I  tried  every   time  to  plot  the  time  data  against  different  statistical  variables  in  order  to  explore  different   aspect  of  the  data.     Version  1         Version  2         Version  3                  
  • 29.     29   4. Divvy  bikes  stations  capacities  and  concentrations       Ø Description:   The  two  visualizations  associate  the  geographical  variables  with  numerical  variables  in  order  to   have  a  more  meaningful  and  informative  impact.   Number  of  docks  (bikes  spaces)  is  plotted  against  stations  locations  in  the  first  graph.  A  color   scale  of  the  stations’  docks  size  is  displayed  to  quickly  identify  larger  stations.   The  second  graph  plots  divvy  bikes  users  in  accordance  to  their  stations  locations.  The  frequency   of  rentals  is  plotted  using  a  darker  color  scale.    
  • 30.     30     Ø Data  Analysis:   The  first  graph  shows  the  geographical  distribution  of  the  Divvy  bike  stations  around  the  city  of   Chicago.  The  colors  indicate  the  docks  capacity  of  each  station.  The  red  color  stations  are  those   with  large  capacity  of  docks  namely  around:  Navy  Pier,  Millennium  Park  and  Union/Ogilvie   Station.  I’ve  also  added  a  contour  of  (level  7)  to  further  draw  a  smoother  look  to  the  map.  The   locations  with  larger  number  of  docks  indicate  more  rental  transactions  occur  there  and   therefore  two  possibilities:  either  stations  run  out  of  bikes  (high  demand  as  a  start  off)  or   stations  run  out  of  space  (high  returns  as  final  destination).  In  either  case,  increase  of  docks   space  was  deemed  necessary.   The  second  graph  includes  designing  the  geographical  map  by  the  “users-­‐type”.  This  graph   enables  us  to  better  explore  commuters’  final  destinations.     It  appears  that  most  “customer”  bikers  commute  North  to  the  Lincoln  Park  Zoo  and  all  the  way   South  to  the  Museum  Campus  passing  by  the  Magnificent  Mile  and  Navy  Pier.  The  “Subscribers”   bikers  however  cluster  around  the  Loop  busy  area  and  mainly  around  the  Union/Ogilvie   Stations.  This  graph  helps  us  in  identifying  dense  locations  of  certain  divvy  users  and  therefore   might  utilize  this  information  for  marketing  or  customer  advocacy  purposes.     Ø History  of  revisions:   I  have  started  plotting  this  data  in  Tableau  where  it  nicely  plotted  the  data  points  (longitude  and   latitude)  on  the  map.  However,  I  could  not  move  much  further  from  there.  Therefore,  I  used   another  software  that  worked  better  with  GIS  data  in  order  to  build  a  more  detailed  map  using   OpenStreetMap.  The  2nd  version  included  some  unexpected  users-­‐type  (dependent),  which   appeared  to  be  a  special  case.  A  recommendation  rose  during  the  final  presentation  to  exclude   this  category  as  it  was  supposed  to  be  cleaned  by  Divvy  Technical  Team.     Version  1     Version  2        
  • 31.     31   Summary  of  Team  Member  Contributions     Throughout  the  project,  the  team  collaborated  on  many  pieces  of  the  project,  including:   • Sharing  output  of  data  pre-­‐processing  steps   • Suggesting  software  that  could  be  relevant  for  various  tasks   • Providing  feedback  on  visuals   • Reviewing  final  analysis   Each  of  the  three  team  members  was  the  primary  author  of  one  of  the  final  visualizations  (and  each   main  section  discussing  it).   Matt  Siedlecki:  made  the  traffic  by  weekday/weekend  and  time  of  week  patterns  using  small   multiples   Ricardo B. Lourenço: created the network map focused on Loop utilizing log-scaling to summarize the full dataset   Hassan  A  Al  Alaiwi:  prepared  the  circular  network  visual,  as  well  as  several  related  pieces  of   analysis