An	
  Introduc+on	
  to	
  	
  
Data	
  Intensive	
  Compu+ng	
  
	
  
Chapter	
  3:	
  Processing	
  Big	
  Data	
  
Robert	
  Grossman	
  
University	
  of	
  Chicago	
  
Open	
  Data	
  Group	
  
	
  
Collin	
  BenneC	
  
Open	
  Data	
  Group	
  
	
  
November	
  14,	
  2011	
  
1	
  
1.  Introduc+on	
  (0830-­‐0900)	
  
a.  Data	
  clouds	
  (e.g.	
  Hadoop)	
  
b.  U+lity	
  clouds	
  (e.g.	
  Amazon)	
  
2.  Managing	
  Big	
  Data	
  (0900-­‐0945)	
  
a.  Databases	
  
b.  Distributed	
  File	
  Systems	
  (e.g.	
  Hadoop)	
  
c.  NoSql	
  databases	
  (e.g.	
  HBase)	
  
3.  Processing	
  Big	
  Data	
  (0945-­‐1000	
  and	
  1030-­‐1100)	
  
a.  Mul+ple	
  Virtual	
  Machines	
  &	
  Message	
  Queues	
  
b.  MapReduce	
  
c.  Streams	
  over	
  distributed	
  file	
  systems	
  
4.  Lab	
  using	
  Amazon’s	
  Elas+c	
  Map	
  Reduce	
  
(1100-­‐1200)	
  
	
  
Sec+on	
  3.1	
  
Processing	
  Big	
  Data	
  Using	
  
U+lity	
  and	
  Data	
  Clouds	
  
A	
  Google	
  produc+on	
  rack	
  of	
  
servers	
  from	
  about	
  1999.	
  
•  How	
  do	
  you	
  do	
  analy+cs	
  over	
  commodity	
  
disks	
  and	
  processors?	
  
•  How	
  do	
  you	
  improve	
  the	
  efficiency	
  of	
  
programmers?	
  
Serial	
  &	
  SMP	
  Algorithms	
  
•  *	
  local	
  disk	
  and	
  memory	
  
local	
  disk*	
  
Task	
  
local	
  disk*	
  
Task	
  Task	
  Task	
  
Serial	
  algorithm	
   Symmetric	
  
Mul+processing	
  
(SMP)	
  algorithm	
  
Pleasantly	
  (=	
  Embarrassingly)	
  Parallel	
  	
  
•  Need	
  to	
  par++on	
  data,	
  start	
  tasks,	
  collect	
  results.	
  	
  
•  Oden	
  tasks	
  organized	
  into	
  DAG.	
  
local	
  disk	
  
Task	
  Task	
  Task	
  
local	
  disk	
  
Task	
  Task	
  Task	
  
local	
  disk	
  
Task	
  Task	
  Task	
  
MPI	
  
How	
  Do	
  You	
  Program	
  A	
  Data	
  Center?	
  
7	
  
The	
  Google	
  Data	
  Stack	
  
•  The	
  Google	
  File	
  System	
  (2003)	
  
•  MapReduce:	
  Simplified	
  Data	
  Processing…	
  (2004)	
  
•  BigTable:	
  A	
  Distributed	
  Storage	
  System…	
  (2006)	
  
8	
  
Google’s	
  Large	
  Data	
  Cloud	
  
9
Google’s	
  Early	
  Data	
  Stack	
  
circa	
  2000	
  
Google	
  File	
  System	
  (GFS)	
  
Google’s	
  MapReduce	
  
Google’s	
  BigTable	
  
Storage	
  Services	
  
Compute	
  Services	
  
Applica+ons	
  
Data	
  Services	
  
Hadoop’s	
  Large	
  Data	
  Cloud	
  	
  
(Open	
  Source)	
  
Storage	
  Services	
  
Compute	
  Services	
  
10
Hadoop’s	
  Stack	
  
Applica+ons	
  
Hadoop	
  Distributed	
  File	
  
System	
  (HDFS)	
  
Hadoop’s	
  MapReduce	
  
Data	
  Services	
   NoSQL,	
  e.g.	
  HBase	
  
A	
  very	
  nice	
  recent	
  book	
  by	
  	
  
Barroso	
  and	
  Holzle	
  
The	
  Amazon	
  Data	
  Stack	
  
Amazon	
  uses	
  a	
  highly	
  
decentralized,	
  loosely	
  coupled,	
  
service	
  oriented	
  architecture	
  
consis+ng	
  of	
  hundreds	
  of	
  
services.	
  In	
  this	
  environment	
  
there	
  is	
  a	
  par+cular	
  need	
  for	
  
storage	
  technologies	
  that	
  are	
  
always	
  available.	
  For	
  example,	
  
customers	
  should	
  be	
  able	
  to	
  
view	
  and	
  add	
  items	
  to	
  their	
  
shopping	
  cart	
  even	
  if	
  disks	
  are	
  
failing,	
  network	
  routes	
  are	
  
flapping,	
  or	
  data	
  centers	
  are	
  
being	
  destroyed	
  by	
  tornados.	
  	
  
SOSP’07	
  
Amazon	
  Style	
  Data	
  Cloud	
  
S3	
  Storage	
  Services	
  
Simple	
  Queue	
  Service	
  
13
Load	
  Balancer	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instances	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instance	
  
EC2	
  Instances	
  
SDB	
  
Open	
  Source	
  Versions	
  
•  Eucalyptus	
  
–  Ability	
  to	
  launch	
  VMs	
  
–  S3	
  like	
  storage	
  
•  Open	
  Stack	
  
–  Ability	
  to	
  launch	
  VMs	
  
–  S3	
  like	
  storage	
  -­‐	
  Swid	
  	
  
•  Cassandra	
  
–  Key-­‐value	
  store	
  like	
  S3	
  
–  Columns	
  like	
  BigTable	
  
•  Many	
  other	
  open	
  source	
  Amazon	
  style	
  services	
  
available.	
  
Some	
  Programming	
  Models	
  for	
  Data	
  Centers	
  
•  Opera+ons	
  over	
  data	
  center	
  of	
  disks	
  
–  MapReduce	
  (“string-­‐based”	
  scans	
  of	
  data)	
  
–  User-­‐Defined	
  Func+ons	
  (UDFs)	
  over	
  data	
  center	
  
–  Launch	
  VMs	
  that	
  all	
  have	
  access	
  to	
  highly	
  scalable	
  and	
  
available	
  disk-­‐based	
  data.	
  
–  SQL	
  and	
  NoSQL	
  over	
  data	
  center	
  
•  Opera+ons	
  over	
  data	
  center	
  of	
  memory	
  
–  Grep	
  over	
  distributed	
  memory	
  
–  UDFs	
  over	
  distributed	
  memory	
  
–  Launch	
  VMs	
  that	
  all	
  have	
  access	
  to	
  highly	
  scalable	
  and	
  
available	
  membory-­‐based	
  data.	
  
–  SQL	
  and	
  NoSQL	
  over	
  distributed	
  memory	
  
Sec+on	
  3.2	
  	
  	
  
	
  
Processing	
  Data	
  By	
  Scaling	
  Out	
  	
  
Virtual	
  Machines	
  
Processing	
  Big	
  Data	
  PaCern	
  1:	
  	
  
Launch	
  Independent	
  Virtual	
  Machines	
  
and	
  Task	
  with	
  a	
  Messaging	
  Service	
  
Task	
  With	
  Messaging	
  Service	
  
&	
  Use	
  S3	
  (Variant	
  1)	
  
S3	
  
Task	
  
VM	
  
Messaging	
  Services	
  (AWS	
  SMS,	
  AMQP	
  Service,	
  etc.)	
  
Task	
  
VM	
  
Task	
  
VM	
  
Task	
  
VM	
  
…	
  
Control	
  VM:	
  Launches	
  and	
  
tasks	
  workers	
  
Worker	
  VMs	
  
Task	
  With	
  Messaging	
  Service	
  
&	
  Use	
  NoSQL	
  DB	
  (Variant	
  2)	
  
AWS	
  SimpleDB	
  
Task	
  
VM	
  
Messaging	
  Services	
  (AWS	
  SMS,	
  AMQP	
  Service,	
  etc.)	
  
Task	
  
VM	
  
Task	
  
VM	
  
Task	
  
VM	
  
…	
  
Control	
  VM:	
  Launches	
  and	
  
tasks	
  workers	
  
Worker	
  VMs	
  
Task	
  With	
  Messaging	
  Service	
  
&	
  Use	
  Clustered	
  FS	
  (Variant	
  3)	
  
GlusterFS	
  
Task	
  
VM	
  
Messaging	
  Services	
  (AWS	
  SMS,	
  AMQP	
  Service,	
  etc.)	
  
Task	
  
VM	
  
Task	
  
VM	
  
Task	
  
VM	
  
…	
  
Control	
  VM:	
  Launches	
  and	
  
tasks	
  workers	
  
Worker	
  VMs	
  
Sec+on	
  3.3	
  
MapReduce	
  
Google	
  2004	
  
Technical	
  Report	
  
Core	
  Concepts	
  
•  Data	
  are	
  (key,	
  value)	
  pairs	
  and	
  that’s	
  it	
  
•  Par++on	
  data	
  over	
  commodity	
  nodes	
  filling	
  racks	
  
in	
  a	
  data	
  center.	
  
•  Sodware	
  handles	
  failures,	
  restarts,	
  etc.	
  This	
  is	
  
the	
  hard	
  part.	
  	
  
•  Basic	
  examples:	
  
– Word	
  Count	
  
– Inverted	
  index	
  
Processing	
  Big	
  Data	
  PaCern	
  2:	
  	
  
MapReduce	
  
HDFS	
  
Map	
  
Task	
  
Task	
  
Tracker	
  
local	
  disk	
  
Map	
  
Task	
  
Map	
  
Task	
  
HDFS	
  
Map	
  
Task	
  
Task	
  
Tracker	
  
local	
  disk	
  
Map	
  
Task	
  
Map	
  
Task	
  
HDFS	
  
Map	
  
Task	
  
Task	
  
Tracker	
  
local	
  disk	
  
Map	
  
Task	
  
Map	
  
Task	
  
local	
  disk	
  
HDFS	
  
Reduce	
  
Task	
  
local	
  disk	
  
HDFS	
  
Reduce	
  
Task	
  
Shuffle	
  &	
  Sort	
  
Example:	
  Word	
  Count	
  &	
  Inverted	
  Index	
  
•  How	
  do	
  you	
  count	
  
the	
  words	
  in	
  a	
  
million	
  books?	
  
– (best,	
  7)	
  
•  Inverted	
  index:	
  
– (best;	
  page	
  1,	
  page	
  
82,	
  …)	
  
– (worst;	
  page	
  1,	
  
page	
  12,	
  …)	
  	
  
Cover	
  of	
  serial	
  Vol.	
  V,	
  1859,	
  London	
  
•  Assume	
  you	
  have	
  a	
  cluster	
  of	
  50	
  computers,	
  each	
  
with	
  an	
  aCached	
  local	
  disk	
  and	
  half	
  full	
  of	
  web	
  
pages.	
  
•  What	
  is	
  a	
  simple	
  parallel	
  programming	
  framework	
  
that	
  would	
  support	
  the	
  computa+on	
  of	
  word	
  counts	
  
and	
  inverted	
  indices?	
  
Basic	
  PaCern:	
  Strings	
  
1.	
  Extract	
  words	
  
from	
  web	
  pages	
  in	
  
parallel.	
  
2.	
  Hash	
  and	
  
sort	
  words.	
  
3.	
  Count	
  (or	
  
construct	
  inverted	
  
index)	
  in	
  parallel.	
  
1.	
  Extract	
  words	
  
from	
  web	
  pages	
  in	
  
parallel.	
  
2.	
  Hash	
  and	
  
sort	
  words.	
  
3.	
  Count	
  (or	
  
construct	
  inverted	
  
index)	
  in	
  parallel.	
  
1.	
  Extract	
  binned	
  
field	
  value	
  from	
  
data	
  records	
  in	
  
parallel.	
  
2.	
  Hash	
  and	
  
sort	
  binned	
  
field	
  values.	
  
3.	
  Count	
  (or	
  
construct	
  inverted	
  
index)	
  in	
  parallel.	
  
What	
  about	
  data	
  records?	
  
Map-­‐Reduce	
  Example	
  
•  Input	
  is	
  files	
  with	
  one	
  document	
  per	
  record	
  
•  User	
  specifies	
  map	
  func+on	
  
–  key	
  =	
  document	
  URL	
  
–  Value	
  =	
  document	
  contents	
  
doc	
  cdickens	
  two	
  ci+es ,	
   it	
  was	
  the	
  best	
  of	
  +mes 	
  
it ,	
  1	
  
was ,	
  1	
  
the ,	
  1	
  
best ,	
  1	
  
Input	
  of	
  map	
  
Output	
  of	
  map	
  
Example	
  (cont d)	
  
•  MapReduce	
  library	
  gathers	
  together	
  all	
  pairs	
  
with	
  the	
  same	
  key	
  value	
  (shuffle/sort	
  phase)	
  
•  The	
  user-­‐defined	
  reduce	
  func+on	
  combines	
  all	
  
the	
  values	
  associated	
  with	
  the	
  same	
  key	
  
key	
  =	
   it 	
  
values	
  =	
  1,	
  1	
  
key	
  =	
   was 	
  
values	
  =	
  1,	
  1	
  
key	
  =	
   best 	
  
values	
  =	
  1	
  
key	
  =	
   worst 	
  
values	
  =	
  1	
  
Input	
  of	
  reduce	
  
it ,	
  2	
  
was ,	
  2	
  
best ,	
  1	
  
worst ,	
  1	
  
	
  Output	
  of	
  reduce	
  
Why	
  Is	
  Word	
  Count	
  Important?	
  
•  It	
  is	
  one	
  of	
  the	
  most	
  important	
  examples	
  for	
  
the	
  type	
  of	
  text	
  processing	
  oden	
  done	
  with	
  
MapReduce.	
  
•  There	
  is	
  an	
  important	
  mapping	
  
	
  	
  	
  	
  	
  	
  	
  document	
  	
  	
  	
  	
  <	
  -­‐-­‐-­‐-­‐-­‐	
  >	
  	
  	
  	
  	
  	
  data	
  record	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  words	
  	
  	
  	
  	
  	
  	
  	
  	
  <	
  -­‐-­‐-­‐-­‐-­‐	
  >	
  	
  	
  	
  	
  	
  	
  (field,	
  value)	
  
Inversion	
  
Pleasantly	
  Parallel	
   MapReduce	
  
Data	
  structure	
   Arbitrary	
   (key,	
  value)	
  pairs	
  
Func+ons	
   Arbitrary	
   Map	
  &	
  Reduce	
  
Middleware	
   MPI	
  (message	
  
passing)	
  
Hadoop	
  
Ease	
  of	
  use	
   Difficult	
   Medium	
  
Scope	
   Wide	
   Narrow	
  
Challenge	
  	
   Geung	
  something	
  
working	
  
Moving	
  to	
  
MapReduce	
  	
  
Common	
  MapReduce	
  Design	
  PaCerns	
  
•  Word	
  count	
  
•  Inversion	
  –	
  inverted	
  index	
  
•  Compu+ng	
  simple	
  sta+s+cs	
  
•  Compu+ng	
  windowed	
  sta+s+cs	
  
•  Sparse	
  matrix	
  (document-­‐term,	
  data	
  record-­‐
FieldBinValue,	
  …)	
  
•  	
  Site-­‐en+ty	
  sta+s+cs	
  
•  PageRank	
  
•  Par++oned	
  and	
  ensemble	
  models	
  
•  EM	
  
Sec+on	
  3.4	
  
User	
  Defined	
  Func+ons	
  over	
  DFS	
  
sector.sf.net	
  
Processing	
  Big	
  Data	
  PaCern	
  3:	
  	
  
User	
  Defined	
  Func+ons	
  over	
  
Distributed	
  File	
  Systems	
  
Sector/Sphere	
  
•  Sector/Sphere	
  is	
  a	
  plaworm	
  for	
  data	
  intensive	
  
compu+ng.	
  	
  
Idea	
  1:	
  Apply	
  User	
  Defined	
  Func+ons	
  
(UDF)	
  to	
  Files	
  in	
  a	
  Distributed	
  File	
  System	
  
map/shuffle reduce
UDFUDF
This	
  generalizes	
  Hadoop’s	
  implementa+on	
  of	
  MapReduce	
  
over	
  the	
  Hadoop	
  Distributed	
  File	
  system.	
  
Idea	
  2:	
  Add	
  Security	
  From	
  the	
  Start	
  
•  Security	
  server	
  maintains	
  
informa+on	
  about	
  users	
  
and	
  slaves.	
  
•  User	
  access	
  control:	
  
password	
  and	
  client	
  IP	
  
address.	
  
•  File	
  level	
  access	
  control.	
  
•  Messages	
  are	
  encrypted	
  
over	
  SSL.	
  Cer+ficate	
  is	
  
used	
  for	
  authen+ca+on.	
  
•  Sector	
  is	
  a	
  good	
  basis	
  for	
  
HIPAA	
  compliant	
  
applica+ons.	
  
Security
Server
Master Client
Slaves
dataAAA
SSLSSL
Idea	
  3:	
  Extend	
  the	
  Stack	
  to	
  Include	
  
Network	
  Transport	
  Services	
  
Storage	
  Services	
  
39	
  
Storage	
  Services	
  
Rou+ng	
  &	
  	
  
Transport	
  Services	
  Google,	
  Hadoop	
  
Sector	
  
Compute	
  Services	
  
Data	
  Services	
  
Compute	
  Services	
  
Data	
  Services	
  
Sec+on	
  3.5	
  
	
  
Compu+ng	
  With	
  Streams:	
  	
  
Warming	
  Up	
  With	
  Means	
  and	
  
Variances	
  
Warm	
  Up:	
  Par++oned	
  Means	
  
•  Means	
  and	
  variances	
  cannot	
  be	
  computed	
  
naively	
  when	
  the	
  data	
  is	
  in	
  distributed	
  
par++ons.	
  
Step	
  1.	
  Compute	
  local	
  
(Σ	
  xi,	
  	
  Σ	
  xi
2,	
  	
  ni)	
  
in	
  parallel	
  for	
  each	
  
par++on.	
  
	
  
Step	
  2.	
  Compute	
  global	
  
mean	
  and	
  variance	
  from	
  
these	
  tuples.	
  	
  
	
  
Trivial	
  Observa+on	
  1	
  
If	
  si	
  =	
  Σ	
  xi	
  is	
  a	
  the	
  i’th	
  local	
  means,	
  then	
  global	
  
mean	
  =	
  Σ	
  si	
  /	
  	
  Σ	
  ni.	
  
	
  
•  If	
  local	
  means	
  for	
  each	
  par++on	
  are	
  passed	
  
(without	
  corresponding	
  counts),	
  then	
  there	
  is	
  
not	
  enough	
  informa+on	
  to	
  compute	
  global	
  
means.	
  
•  Same	
  tricks	
  works	
  for	
  variance,	
  but	
  need	
  to	
  
pass	
  triples	
  (Σ	
  xi,	
  	
  Σ	
  xi
2,	
  	
  ni).	
  
	
  
Trivial	
  Observa+on	
  2	
  
•  To	
  reduce	
  data	
  passed	
  over	
  the	
  network,	
  
combine	
  appropriate	
  sta+s+cs	
  as	
  early	
  as	
  
possible.	
  
•  Consider	
  average.	
  	
  	
  Recall	
  with	
  MapReduce	
  there	
  
are	
  4	
  steps	
  (Map,	
  Shuffle,	
  Sort	
  and	
  Reduce)	
  and	
  
Reduce	
  pulls	
  data	
  from	
  local	
  disk	
  that	
  performs	
  
Map.	
  
•  A	
  Combine	
  Step	
  in	
  MapReduce	
  combines	
  local	
  
data	
  before	
  it	
  is	
  pulled	
  for	
  Reduce	
  Step.	
  
•  There	
  are	
  built	
  in	
  combiners	
  for	
  counts,	
  means,	
  
etc.	
  
Sec+on	
  3.6	
  
Hadoop	
  Streams	
  
Processing	
  Big	
  Data	
  PaCern	
  4:	
  	
  
Streams	
  over	
  Distributed	
  File	
  Systems	
  
Hadoop	
  Streams	
  
•  In	
  addi+on	
  to	
  the	
  Java	
  API,	
  Hadoop	
  offers	
  
–  Streaming	
  interface	
  for	
  any	
  language	
  that	
  supports	
  
reading	
  and	
  wri+ng	
  to	
  Standard	
  In	
  and	
  Out	
  
–  Pipes	
  for	
  C++	
  
•  Why	
  would	
  I	
  want	
  to	
  use	
  something	
  besides	
  
Java?	
  	
  Because	
  Hadoop	
  Streams	
  provide	
  direct	
  
access	
  to	
  
–  (Without	
  JNI/	
  NIO)	
  to	
  C++	
  libraries	
  like	
  Boost,	
  GNU	
  
Scien+fic	
  Library	
  (GSL)	
  
–  R	
  modules	
  
Pros	
  and	
  Cons	
  
•  Java	
  
+	
  	
  Best	
  documented	
  
+	
  	
  Largest	
  community	
  
–  More	
  LOC	
  per	
  MR	
  job	
  
•  Python	
  
+	
  	
  Efficient	
  memory	
  handling	
  
+	
  	
  Programmers	
  can	
  be	
  very	
  efficient	
  
–  Limited	
  logging	
  /	
  debugging	
  
•  R	
  
+	
  	
  Vast	
  collec+on	
  of	
  sta+s+cal	
  algorithms	
  
–  Poor	
  error	
  handling	
  and	
  memory	
  handling	
  
–  Less	
  familiar	
  to	
  developers	
  
Word	
  Count	
  Python	
  Mapper	
  	
  
def read_input(file):
for line in file:
yield line.split()
def main(separator='t'):
data = read_input(sys.stdin)
for words in data:
for word in words:
print '%s%s%d' % (word, separator, 1)
Word	
  Count	
  Python	
  Reducer	
  
def read_mapper_output(file, separator='t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(sep='t'):
data = read_mapper_output(sys.stdin, sep=sepa)
for word, group in groupby(data, itemgetter(0)):
total_count = sum(int(count) for word,
count in group)
print "%s%s%d" % (word, sep, total_count)
MalStone	
  Benchmark	
  
MalStone	
  A	
   MalStone	
  B	
  
Hadoop	
  MapReduce	
   455m	
  13s	
   840m	
  50s	
  
Hadoop	
  Streams	
  
(Python)	
  
87m	
  29s	
   142m	
  32s	
  
C++	
  implemented	
  UDFs	
   33m	
  40s	
   43m	
  44s	
  
Sector/Sphere	
  1.20,	
  Hadoop	
  0.18.3	
  with	
  no	
  replica+on	
  on	
  Phase	
  1	
  of	
  
Open	
  Cloud	
  Testbed	
  in	
  a	
  single	
  rack.	
  	
  Data	
  consisted	
  of	
  20	
  nodes	
  with	
  
500	
  million	
  100-­‐byte	
  records	
  /	
  node.	
  
Word	
  Count	
  R	
  Mapper	
  
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)",
"", line)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn =
FALSE)) > 0) {
line <- trimWhiteSpace(line)
words <- splitIntoWords(line)
cat(paste(words, "t1n", sep=""), sep="")
}
close(con)
	
  
Word	
  Count	
  R	
  Reducer	
  
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
val <- unlist(strsplit(line, "t"))
list(word = val[1], count = as.integer(val[2]))
}
env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) >
0) {
line <- trimWhiteSpace(line)
split <- splitLine(line)
word <- split$word
count <- split$count
	
  
Word	
  Count	
  R	
  Reducer	
  (cont’d)	
  
if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
}
else assign(word, count, envir = env)
}
close(con)
for (w in ls(env, all = TRUE))
cat(w, "t", get(w, envir = env), "n", sep =
"”)
	
  
Word	
  Count	
  Java	
  Mapper	
  
public static class Map
extends Mapper<LongWritable, Text,Text, IntWritable>
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context
context
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
	
  
Word	
  Count	
  Java	
  Reducer	
  
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
	
  
Code	
  Comparison	
  –	
  Word	
  Count	
  
Mapper	
  
Python
def read_input(file):
for line in file:
yield line.split()
def main(separator='t'):
data = read_input(sys.stdin)
for words in data:
for word in words:
print '%s%s%d' % (word, separator, 1)
R
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
words <- splitIntoWords(line)
cat(paste(words, "t1n", sep=""), sep="")
}
close(con)
Java
public static class Map
extends Mapper<LongWritable, Text,Text, IntWritable>
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
Code	
  Comparison	
  –	
  Word	
  Count	
  
Reducer	
  
Python
def read_mapper_output(file, separator='t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(sep='t'):
data = read_mapper_output(sys.stdin, sep=sepa)
for word, group in groupby(data, itemgetter(0)):
total_count = sum(int(count) for word, count in group)
print "%s%s%d" % (word, sep, total_count)
R
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
val <- unlist(strsplit(line, "t"))
list(word = val[1], count = as.integer(val[2]))
}
env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
split <- splitLine(line)
word <- split$word
count <- split$count
if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
}
else assign(word, count, envir = env)
}
close(con)
for (w in ls(env, all = TRUE))
cat(w, "t", get(w, envir = env), "n", sep = "”)
Java
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Ques+ons?	
  
For	
  the	
  most	
  current	
  version	
  of	
  these	
  notes,	
  see	
  
rgrossman.com	
  

Processing Big Data: An Introduction to Data Intensive Computing

  • 1.
    An  Introduc+on  to     Data  Intensive  Compu+ng     Chapter  3:  Processing  Big  Data   Robert  Grossman   University  of  Chicago   Open  Data  Group     Collin  BenneC   Open  Data  Group     November  14,  2011   1  
  • 2.
    1.  Introduc+on  (0830-­‐0900)   a.  Data  clouds  (e.g.  Hadoop)   b.  U+lity  clouds  (e.g.  Amazon)   2.  Managing  Big  Data  (0900-­‐0945)   a.  Databases   b.  Distributed  File  Systems  (e.g.  Hadoop)   c.  NoSql  databases  (e.g.  HBase)   3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)   a.  Mul+ple  Virtual  Machines  &  Message  Queues   b.  MapReduce   c.  Streams  over  distributed  file  systems   4.  Lab  using  Amazon’s  Elas+c  Map  Reduce   (1100-­‐1200)    
  • 3.
    Sec+on  3.1   Processing  Big  Data  Using   U+lity  and  Data  Clouds   A  Google  produc+on  rack  of   servers  from  about  1999.  
  • 4.
    •  How  do  you  do  analy+cs  over  commodity   disks  and  processors?   •  How  do  you  improve  the  efficiency  of   programmers?  
  • 5.
    Serial  &  SMP  Algorithms   •  *  local  disk  and  memory   local  disk*   Task   local  disk*   Task  Task  Task   Serial  algorithm   Symmetric   Mul+processing   (SMP)  algorithm  
  • 6.
    Pleasantly  (=  Embarrassingly)  Parallel     •  Need  to  par++on  data,  start  tasks,  collect  results.     •  Oden  tasks  organized  into  DAG.   local  disk   Task  Task  Task   local  disk   Task  Task  Task   local  disk   Task  Task  Task   MPI  
  • 7.
    How  Do  You  Program  A  Data  Center?   7  
  • 8.
    The  Google  Data  Stack   •  The  Google  File  System  (2003)   •  MapReduce:  Simplified  Data  Processing…  (2004)   •  BigTable:  A  Distributed  Storage  System…  (2006)   8  
  • 9.
    Google’s  Large  Data  Cloud   9 Google’s  Early  Data  Stack   circa  2000   Google  File  System  (GFS)   Google’s  MapReduce   Google’s  BigTable   Storage  Services   Compute  Services   Applica+ons   Data  Services  
  • 10.
    Hadoop’s  Large  Data  Cloud     (Open  Source)   Storage  Services   Compute  Services   10 Hadoop’s  Stack   Applica+ons   Hadoop  Distributed  File   System  (HDFS)   Hadoop’s  MapReduce   Data  Services   NoSQL,  e.g.  HBase  
  • 11.
    A  very  nice  recent  book  by     Barroso  and  Holzle  
  • 12.
    The  Amazon  Data  Stack   Amazon  uses  a  highly   decentralized,  loosely  coupled,   service  oriented  architecture   consis+ng  of  hundreds  of   services.  In  this  environment   there  is  a  par+cular  need  for   storage  technologies  that  are   always  available.  For  example,   customers  should  be  able  to   view  and  add  items  to  their   shopping  cart  even  if  disks  are   failing,  network  routes  are   flapping,  or  data  centers  are   being  destroyed  by  tornados.     SOSP’07  
  • 13.
    Amazon  Style  Data  Cloud   S3  Storage  Services   Simple  Queue  Service   13 Load  Balancer   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instances   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instance   EC2  Instances   SDB  
  • 14.
    Open  Source  Versions   •  Eucalyptus   –  Ability  to  launch  VMs   –  S3  like  storage   •  Open  Stack   –  Ability  to  launch  VMs   –  S3  like  storage  -­‐  Swid     •  Cassandra   –  Key-­‐value  store  like  S3   –  Columns  like  BigTable   •  Many  other  open  source  Amazon  style  services   available.  
  • 15.
    Some  Programming  Models  for  Data  Centers   •  Opera+ons  over  data  center  of  disks   –  MapReduce  (“string-­‐based”  scans  of  data)   –  User-­‐Defined  Func+ons  (UDFs)  over  data  center   –  Launch  VMs  that  all  have  access  to  highly  scalable  and   available  disk-­‐based  data.   –  SQL  and  NoSQL  over  data  center   •  Opera+ons  over  data  center  of  memory   –  Grep  over  distributed  memory   –  UDFs  over  distributed  memory   –  Launch  VMs  that  all  have  access  to  highly  scalable  and   available  membory-­‐based  data.   –  SQL  and  NoSQL  over  distributed  memory  
  • 16.
    Sec+on  3.2         Processing  Data  By  Scaling  Out     Virtual  Machines  
  • 17.
    Processing  Big  Data  PaCern  1:     Launch  Independent  Virtual  Machines   and  Task  with  a  Messaging  Service  
  • 18.
    Task  With  Messaging  Service   &  Use  S3  (Variant  1)   S3   Task   VM   Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Task   VM   Task   VM   Task   VM   …   Control  VM:  Launches  and   tasks  workers   Worker  VMs  
  • 19.
    Task  With  Messaging  Service   &  Use  NoSQL  DB  (Variant  2)   AWS  SimpleDB   Task   VM   Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Task   VM   Task   VM   Task   VM   …   Control  VM:  Launches  and   tasks  workers   Worker  VMs  
  • 20.
    Task  With  Messaging  Service   &  Use  Clustered  FS  (Variant  3)   GlusterFS   Task   VM   Messaging  Services  (AWS  SMS,  AMQP  Service,  etc.)   Task   VM   Task   VM   Task   VM   …   Control  VM:  Launches  and   tasks  workers   Worker  VMs  
  • 21.
    Sec+on  3.3   MapReduce   Google  2004   Technical  Report  
  • 22.
    Core  Concepts   • Data  are  (key,  value)  pairs  and  that’s  it   •  Par++on  data  over  commodity  nodes  filling  racks   in  a  data  center.   •  Sodware  handles  failures,  restarts,  etc.  This  is   the  hard  part.     •  Basic  examples:   – Word  Count   – Inverted  index  
  • 23.
    Processing  Big  Data  PaCern  2:     MapReduce  
  • 24.
    HDFS   Map   Task   Task   Tracker   local  disk   Map   Task   Map   Task   HDFS   Map   Task   Task   Tracker   local  disk   Map   Task   Map   Task   HDFS   Map   Task   Task   Tracker   local  disk   Map   Task   Map   Task   local  disk   HDFS   Reduce   Task   local  disk   HDFS   Reduce   Task   Shuffle  &  Sort  
  • 25.
    Example:  Word  Count  &  Inverted  Index   •  How  do  you  count   the  words  in  a   million  books?   – (best,  7)   •  Inverted  index:   – (best;  page  1,  page   82,  …)   – (worst;  page  1,   page  12,  …)     Cover  of  serial  Vol.  V,  1859,  London  
  • 26.
    •  Assume  you  have  a  cluster  of  50  computers,  each   with  an  aCached  local  disk  and  half  full  of  web   pages.   •  What  is  a  simple  parallel  programming  framework   that  would  support  the  computa+on  of  word  counts   and  inverted  indices?  
  • 27.
    Basic  PaCern:  Strings   1.  Extract  words   from  web  pages  in   parallel.   2.  Hash  and   sort  words.   3.  Count  (or   construct  inverted   index)  in  parallel.  
  • 28.
    1.  Extract  words   from  web  pages  in   parallel.   2.  Hash  and   sort  words.   3.  Count  (or   construct  inverted   index)  in  parallel.   1.  Extract  binned   field  value  from   data  records  in   parallel.   2.  Hash  and   sort  binned   field  values.   3.  Count  (or   construct  inverted   index)  in  parallel.   What  about  data  records?  
  • 29.
    Map-­‐Reduce  Example   • Input  is  files  with  one  document  per  record   •  User  specifies  map  func+on   –  key  =  document  URL   –  Value  =  document  contents   doc  cdickens  two  ci+es ,   it  was  the  best  of  +mes   it ,  1   was ,  1   the ,  1   best ,  1   Input  of  map   Output  of  map  
  • 30.
    Example  (cont d)   •  MapReduce  library  gathers  together  all  pairs   with  the  same  key  value  (shuffle/sort  phase)   •  The  user-­‐defined  reduce  func+on  combines  all   the  values  associated  with  the  same  key   key  =   it   values  =  1,  1   key  =   was   values  =  1,  1   key  =   best   values  =  1   key  =   worst   values  =  1   Input  of  reduce   it ,  2   was ,  2   best ,  1   worst ,  1    Output  of  reduce  
  • 31.
    Why  Is  Word  Count  Important?   •  It  is  one  of  the  most  important  examples  for   the  type  of  text  processing  oden  done  with   MapReduce.   •  There  is  an  important  mapping                document          <  -­‐-­‐-­‐-­‐-­‐  >            data  record                      words                  <  -­‐-­‐-­‐-­‐-­‐  >              (field,  value)   Inversion  
  • 32.
    Pleasantly  Parallel  MapReduce   Data  structure   Arbitrary   (key,  value)  pairs   Func+ons   Arbitrary   Map  &  Reduce   Middleware   MPI  (message   passing)   Hadoop   Ease  of  use   Difficult   Medium   Scope   Wide   Narrow   Challenge     Geung  something   working   Moving  to   MapReduce    
  • 33.
    Common  MapReduce  Design  PaCerns   •  Word  count   •  Inversion  –  inverted  index   •  Compu+ng  simple  sta+s+cs   •  Compu+ng  windowed  sta+s+cs   •  Sparse  matrix  (document-­‐term,  data  record-­‐ FieldBinValue,  …)   •   Site-­‐en+ty  sta+s+cs   •  PageRank   •  Par++oned  and  ensemble  models   •  EM  
  • 34.
    Sec+on  3.4   User  Defined  Func+ons  over  DFS   sector.sf.net  
  • 35.
    Processing  Big  Data  PaCern  3:     User  Defined  Func+ons  over   Distributed  File  Systems  
  • 36.
    Sector/Sphere   •  Sector/Sphere  is  a  plaworm  for  data  intensive   compu+ng.    
  • 37.
    Idea  1:  Apply  User  Defined  Func+ons   (UDF)  to  Files  in  a  Distributed  File  System   map/shuffle reduce UDFUDF This  generalizes  Hadoop’s  implementa+on  of  MapReduce   over  the  Hadoop  Distributed  File  system.  
  • 38.
    Idea  2:  Add  Security  From  the  Start   •  Security  server  maintains   informa+on  about  users   and  slaves.   •  User  access  control:   password  and  client  IP   address.   •  File  level  access  control.   •  Messages  are  encrypted   over  SSL.  Cer+ficate  is   used  for  authen+ca+on.   •  Sector  is  a  good  basis  for   HIPAA  compliant   applica+ons.   Security Server Master Client Slaves dataAAA SSLSSL
  • 39.
    Idea  3:  Extend  the  Stack  to  Include   Network  Transport  Services   Storage  Services   39   Storage  Services   Rou+ng  &     Transport  Services  Google,  Hadoop   Sector   Compute  Services   Data  Services   Compute  Services   Data  Services  
  • 40.
    Sec+on  3.5     Compu+ng  With  Streams:     Warming  Up  With  Means  and   Variances  
  • 41.
    Warm  Up:  Par++oned  Means   •  Means  and  variances  cannot  be  computed   naively  when  the  data  is  in  distributed   par++ons.   Step  1.  Compute  local   (Σ  xi,    Σ  xi 2,    ni)   in  parallel  for  each   par++on.     Step  2.  Compute  global   mean  and  variance  from   these  tuples.      
  • 42.
    Trivial  Observa+on  1   If  si  =  Σ  xi  is  a  the  i’th  local  means,  then  global   mean  =  Σ  si  /    Σ  ni.     •  If  local  means  for  each  par++on  are  passed   (without  corresponding  counts),  then  there  is   not  enough  informa+on  to  compute  global   means.   •  Same  tricks  works  for  variance,  but  need  to   pass  triples  (Σ  xi,    Σ  xi 2,    ni).    
  • 43.
    Trivial  Observa+on  2   •  To  reduce  data  passed  over  the  network,   combine  appropriate  sta+s+cs  as  early  as   possible.   •  Consider  average.      Recall  with  MapReduce  there   are  4  steps  (Map,  Shuffle,  Sort  and  Reduce)  and   Reduce  pulls  data  from  local  disk  that  performs   Map.   •  A  Combine  Step  in  MapReduce  combines  local   data  before  it  is  pulled  for  Reduce  Step.   •  There  are  built  in  combiners  for  counts,  means,   etc.  
  • 44.
  • 45.
    Processing  Big  Data  PaCern  4:     Streams  over  Distributed  File  Systems  
  • 46.
    Hadoop  Streams   • In  addi+on  to  the  Java  API,  Hadoop  offers   –  Streaming  interface  for  any  language  that  supports   reading  and  wri+ng  to  Standard  In  and  Out   –  Pipes  for  C++   •  Why  would  I  want  to  use  something  besides   Java?    Because  Hadoop  Streams  provide  direct   access  to   –  (Without  JNI/  NIO)  to  C++  libraries  like  Boost,  GNU   Scien+fic  Library  (GSL)   –  R  modules  
  • 47.
    Pros  and  Cons   •  Java   +    Best  documented   +    Largest  community   –  More  LOC  per  MR  job   •  Python   +    Efficient  memory  handling   +    Programmers  can  be  very  efficient   –  Limited  logging  /  debugging   •  R   +    Vast  collec+on  of  sta+s+cal  algorithms   –  Poor  error  handling  and  memory  handling   –  Less  familiar  to  developers  
  • 48.
    Word  Count  Python  Mapper     def read_input(file): for line in file: yield line.split() def main(separator='t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1)
  • 49.
    Word  Count  Python  Reducer   def read_mapper_output(file, separator='t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count)
  • 50.
    MalStone  Benchmark   MalStone  A   MalStone  B   Hadoop  MapReduce   455m  13s   840m  50s   Hadoop  Streams   (Python)   87m  29s   142m  32s   C++  implemented  UDFs   33m  40s   43m  44s   Sector/Sphere  1.20,  Hadoop  0.18.3  with  no  replica+on  on  Phase  1  of   Open  Cloud  Testbed  in  a  single  rack.    Data  consisted  of  20  nodes  with   500  million  100-­‐byte  records  /  node.  
  • 51.
    Word  Count  R  Mapper   trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="") } close(con)  
  • 52.
    Word  Count  R  Reducer   trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count  
  • 53.
    Word  Count  R  Reducer  (cont’d)   if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "t", get(w, envir = env), "n", sep = "”)  
  • 54.
    Word  Count  Java  Mapper   public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }  
  • 55.
    Word  Count  Java  Reducer   public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }  
  • 56.
    Code  Comparison  –  Word  Count   Mapper   Python def read_input(file): for line in file: yield line.split() def main(separator='t'): data = read_input(sys.stdin) for words in data: for word in words: print '%s%s%d' % (word, separator, 1) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="") } close(con) Java public static class Map extends Mapper<LongWritable, Text,Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
  • 57.
    Code  Comparison  –  Word  Count   Reducer   Python def read_mapper_output(file, separator='t'): for line in file: yield line.rstrip().split(separator, 1) def main(sep='t'): data = read_mapper_output(sys.stdin, sep=sepa) for word, group in groupby(data, itemgetter(0)): total_count = sum(int(count) for word, count in group) print "%s%s%d" % (word, sep, total_count) R trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env) } close(con) for (w in ls(env, all = TRUE)) cat(w, "t", get(w, envir = env), "n", sep = "”) Java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
  • 58.
    Ques+ons?   For  the  most  current  version  of  these  notes,  see   rgrossman.com