Sponsored by




               Welcome to
               Cloud Chicago	

            Hosted by




                             Live Tweet on the second
                             screen by using:
                             #cloudcamp
                             @cloudcamp_chi
                                                       1
Thursday, December 13, 12
Agenda

      6:00pm Registration, Food, Drinks and
      Networking
      6:30 Opening Remarks, Patrick Kerpan, CoehsiveFT
      6:45 Lightning Talks
                            Dave Falck, Model Metrics: node.js on AWS
                            Paul Mantz, CohesiveFT: Working with APIs
                            Bob Chojnacki, Jellyvision Labs: Hadoop on AWS
                            Karl Zimmerman, Steadfast: Keep control with the Private Cloud
      7:45 Unpanel: “Who’s in Control of Your Cloud? Security and
      Visibility”
                            Emceed by Mike Dorosh, IBM & Patrick Kerpan, CoehsiveFT
      8:30 Breakout Sessions
      9:00 Wrap Up - Drinks, anyone?
                                                                               #cloudcamp
                                                                               @cloudcamp_chi
Thursday, December 13, 12
Sponsored by




               Dave Falck, Customer Solutions Engineer         Hosted by




                                                         #cloudcamp
                                                         @cloudcamp_chi
Thursday, December 13, 12
Node.js	
  +	
  AWS	
  
      @davidfalck	
  
Why	
  the	
  Node.js	
  Buzz?	
  	
  


*  LinkedIn’s	
  entire	
  mobile	
  software	
  stack	
  is	
  completely	
  
     built	
  in	
  Node	
  
*  Why?	
  Scale.	
  
*  Huge	
  performance	
  gains	
  compared	
  to	
  what	
  they	
  were	
  
     using	
  before	
  (Ruby	
  on	
  Rails)	
  
*  Went	
  from	
  running	
  15	
  servers	
  with	
  15	
  instances	
  (virtual	
  
     servers)	
  on	
  each	
  physical	
  machine,	
  to	
  just	
  four	
  
     instances	
  that	
  can	
  handle	
  double	
  the	
  traffic.	
  	
  
	
  
What	
  is	
  Node.js?	
  
                                      	
  


*  Javascript	
  platform	
  based	
  on	
  Google	
  Chrome	
  V8	
  JS	
  
   Engine	
  	
  
*  Ryan	
  Dahl	
  (Joyent)	
  
*  Event-­‐driven,	
  non-­‐blocking	
  I/O	
  model	
  to	
  allow	
  your	
  
   applications	
  to	
  scale	
  while	
  keeping	
  you	
  from	
  having	
  to	
  
   deal	
  with	
  threads,	
  polling,	
  timeouts,	
  and	
  event	
  loops	
  
*  FAST	
  
    *  Used	
  for	
  real-­‐time,	
  data-­‐intensive	
  apps	
  (mobile!)	
  
*  POPULAR	
  
	
  
Node.js	
  on	
  GitHub	
  
Hello	
  World	
  


var	
  http	
  =	
  require('http');	
  
http.createServer(function	
  (req,	
  res)	
  {	
  
	
  	
  res.writeHead(200,	
  {'Content-­‐Type':	
  'text/plain'});	
  
	
  	
  res.end('Hello	
  Worldn');	
  
}).listen(1337,	
  '127.0.0.1');	
  
What	
  makes	
  Node.js	
  so	
  fast?	
  


*  Thread-­‐based	
  networking	
  is	
  inefficient	
  and	
  difficult	
  
*  Node	
  shows	
  much	
  better	
  memory	
  efficiency	
  under	
  high-­‐
   loads	
  than	
  systems	
  which	
  allocate	
  2mb	
  thread	
  stacks	
  for	
  
   each	
  connection.	
  	
  
*  Users	
  of	
  Node	
  are	
  free	
  from	
  worries	
  of	
  dead-­‐locking	
  the	
  
   process	
  (*there	
  are	
  no	
  locks*)	
  
*  Almost	
  no	
  function	
  in	
  Node	
  directly	
  performs	
  I/O,	
  so	
  the	
  
   process	
  never	
  blocks.	
  	
  
*  Because	
  nothing	
  blocks,	
  less-­‐than-­‐expert	
  programmers	
  
   are	
  able	
  to	
  develop	
  fast	
  systems	
  
Under	
  the	
  Node.js	
  hood   	
  	
  

           Javascript?	
  
Under	
  the	
  Node.js	
  hood                                         	
  	
  


*  Javascript!	
  
   *  Platform	
  independent	
  
   *  Easy	
  to	
  use	
  
   *  Ubiquitous	
  

*  Google	
  Chrome’s	
  V8	
  Javascript	
  Engine	
  
   *  Translates	
  JS	
  into	
  machine	
  code	
  (not	
  interpreted)	
  
When	
  not	
  to	
  use	
  Node.js 	
  	
  


*  Node.js	
  is	
  not	
  ideal	
  for	
  CPU	
  intensive	
  jobs	
  like	
  sorting,	
  
   transformations,	
  number	
  crunching,	
  analytics…	
  
*  Traditional	
  CRUD	
  web	
  apps	
  that	
  need	
  to	
  be	
  highly	
  
   concurrent,	
  performance	
  degradation	
  will	
  occur	
  when	
  
   the	
  data	
  is	
  needed	
  to	
  be	
  transformed…	
  	
  
*  You	
  can	
  offload	
  processing	
  to	
  another	
  language	
  that	
  
   is	
  better	
  at	
  making	
  use	
  of	
  the	
  CPU	
  
*  Cultural	
  fit?	
  Too	
  new?	
  	
  You	
  decide…	
  
Node.js	
  +	
  AWS	
  


*  Dec	
  6th:	
  AWS	
  released	
  developer	
  preview	
  of	
  node.js	
  
   libraries	
  to	
  access	
  AWS:	
  
   *    DynamoDB	
  
   *    S3	
  
   *    EC2	
  	
  
   *    SWS	
  

*  Allows	
  you	
  to	
  manage	
  parallel	
  calls	
  to	
  several	
  AWS	
  
   web	
  services	
  
Node.js	
  +	
  Other	
  Clouds	
  


*  Azure	
  	
  
*  Joyent	
  
*  EngineYard	
  
*  Heroku	
  
More	
  info	
  


*  http://nodejs.org	
  
*  http://en.wikipedia.org/wiki/Nodejs	
  
*  http://aws.typepad.com/aws/2012/12/aws-­‐sdk-­‐for-­‐
   nodejs-­‐now-­‐available-­‐in-­‐preview-­‐form.html	
  
*  http://www.jamesward.com/2011/06/21/getting-­‐
   started-­‐with-­‐node-­‐js-­‐on-­‐the-­‐cloud/	
  
*  http://venturebeat.com/2011/08/16/linkedin-­‐node/	
  
Sponsored by




               Paul Mantz, Software Engineer         Hosted by




                                               #cloudcamp
                                               @cloudcamp_chi
Thursday, December 13, 12
APIs in Cloud Environments
                       Paul Mantz




     Copyright CohesiveFT - Dec 13, 2012            1
Thursday, December 13, 12
API Command-Line Clients

                   • Benefits to Creating API Command-Line Clients
                   •    Lowers barrier of entry
                   •    Familiar to technical consumers
                   •    Advanced usage cases
                   •    Integrates into existing toolsets




     Copyright CohesiveFT - Dec 13, 2012                            2
Thursday, December 13, 12
API Command-Line Clients

                   Excellent Internal Developer Tool
                   • Excellent for testing and rapid development
                   • Useful operations tool




     Copyright CohesiveFT - Dec 13, 2012                           3
Thursday, December 13, 12
API Command-Line Clients

                   Reference Implementation
                   • Gives developers an example to integrate the API
                   • Helps users model workflows
                   • DSL




     Copyright CohesiveFT - Dec 13, 2012                                4
Thursday, December 13, 12
API Command-Line Clients

                   Excellent Demo Tool
                   • Quick installation, often one file




     Copyright CohesiveFT - Dec 13, 2012                        5
Thursday, December 13, 12
Sponsored by




               Bob Chojnacki, Programmer         Hosted by




                                           #cloudcamp
                                           @cloudcamp_chi
Thursday, December 13, 12
Big	
  Data	
  in	
  the	
  Cloud	
  

A	
  Journey	
  into	
  the	
  unknown	
  
Who	
  Jellyvision	
  is	
  and	
  why	
  are	
  
           analy9cs	
  important	
  to	
  us	
  
•  We	
  create	
  interac9ve	
  experiences	
  
    –  Desktop	
  
    –  Mobile	
  
•  …	
  which	
  ask	
  ques9ons,	
  inform	
  people,	
  generate	
  leads	
  
•  “Virtual	
  Advisors”	
  
•  We	
  also	
  collect	
  analy9cs	
  in	
  real	
  9me	
  to	
  generate	
  reports	
  
   about:	
  
     –  How	
  people	
  answered	
  a	
  ques9on	
  
     –  Where	
  they	
  dropped	
  out	
  
     –  Lots	
  of	
  impressive	
  stats!	
  	
  
The	
  Problem	
  
•  Longer	
  term	
  projects	
  and	
  high	
  volume	
  
   projects	
  causing	
  MySQL	
  to	
  bust	
  at	
  the	
  seams	
  
•  Some	
  types	
  of	
  reports	
  taking	
  too	
  long,	
  or	
  
   causing	
  MySQL	
  to	
  crash	
  if	
  we	
  include	
  too	
  
   much	
  data	
  
•  In	
  all	
  fairness,	
  we	
  could	
  probably	
  tune	
  MySQL,	
  
   throw	
  it	
  on	
  bigger	
  servers,	
  more	
  memory	
  
•  Diminishing	
  returns	
  
•  MySQL	
  is	
  fine	
  for	
  collec9ng	
  the	
  data…	
  
The	
  Solu9on	
  
•  Hadoop!	
  
•  Why	
  Hadoop?	
  Lots	
  of	
  possibili9es	
  out	
  there,	
  
   but	
  which	
  one	
  to	
  use?	
  Cassandra,	
  CouchDB,	
  
   Hadoop,	
  Membase,	
  MongoDB,	
  Neo4j,	
  …	
  
•  Big	
  Data	
  meetups	
  tended	
  to	
  have	
  lots	
  of	
  
   people	
  using	
  Hadoop	
  
•  And	
  I	
  knew	
  others	
  using	
  it.	
  
•  And	
  Hortonworks	
  had	
  a	
  fancy	
  point	
  and	
  click	
  
   solu9on	
  I	
  could	
  use	
  to	
  get	
  started	
  quickly	
  
Op9ons	
  with	
  op9ons	
  
•  Now	
  that	
  I	
  picked	
  Hadoop,	
  I	
  had	
  several	
  
   op9ons,	
  and	
  op9ons	
  within	
  op9ons	
  to	
  use	
  to	
  
   analyze	
  my	
  data:	
  
    –  Hive,	
  Pig,	
  MapReduce,	
  Java,	
  R	
  
•  I	
  knew	
  Java	
  
•  MapReduce	
  seemed	
  to	
  make	
  sense	
  
•  I’ll	
  probably	
  play	
  with	
  Hive	
  and	
  Pig	
  next	
  
It’s	
  All	
  About	
  The	
  Data	
  
•    Visit	
  data	
  
•    Event	
  data	
  
•    Denormaliza9on	
  of	
  data	
  
•    Generated	
  a	
  ton	
  of	
  fake	
  data:	
  
      –  Started	
  with	
  600K	
  visits,	
  3M	
  events	
  
      –  Moved	
  up	
  to	
  1.8M	
  visits,	
  60M	
  events	
  
Make	
  it	
  so	
  
•  First	
  experience:	
  Hortonworks	
  Virtual	
  Sandbox	
  
      –  Single	
  node	
  AMI	
  at	
  Amazon	
  
      –  Hadoop	
  1.0	
  
      –  600K	
  visits,	
  3M	
  events	
  
•  On	
  our	
  exis9ng	
  placorm	
  we	
  needed	
  to	
  break	
  reports	
  up	
  into	
  
   smaller	
  chunks	
  for	
  some	
  data	
  because	
  MySQL	
  could	
  not	
  handle	
  it.	
  
•  Results!	
  What	
  would	
  have	
  taken	
  hours,	
  took	
  only	
  5	
  minutes	
  on	
  a	
  
   single	
  node	
  Hadoop	
  "cluster”	
  
•  In	
  reality,	
  some	
  of	
  the	
  queries	
  I	
  could	
  also	
  run	
  with	
  command-­‐line	
  
   tools	
  (wc,	
  grep,	
  awk)	
  on	
  the	
  data	
  considerably	
  faster	
  than	
  even	
  
   Hadoop.	
  
•  Important	
  lessons	
  learned	
  so	
  far:	
  
      –  Think	
  outside	
  the	
  RDBMS:	
  they	
  are	
  great,	
  but	
  it	
  may	
  not	
  make	
  sense	
  
         for	
  all	
  types	
  data	
  
Looking	
  at	
  more	
  real	
  data	
  
•  Now,	
  lets	
  generate	
  data	
  that	
  is	
  much	
  closer	
  to	
  some	
  of	
  our	
  product	
  
•  Instead	
  of	
  one	
  ques9on	
  and	
  answer,	
  how	
  about	
  15	
  ques9ons?	
  	
  Add	
  
   in	
  some	
  other	
  events	
  gives	
  a	
  total	
  of	
  34	
  events.	
  
•  Throw	
  in	
  some	
  people	
  returning,	
  some	
  of	
  them	
  mul9ple	
  9mes	
  
•  Throw	
  in	
  some	
  people	
  who	
  don't	
  start	
  the	
  conversa9on,	
  etc.	
  
•  Run	
  my	
  lijle	
  auto-­‐data-­‐generator	
  and	
  BOOM!	
  20	
  million	
  events	
  
   and	
  4.4GB	
  later	
  I	
  have	
  my	
  data…	
  
•  …	
  which	
  took	
  up	
  too	
  much	
  disk	
  space	
  to	
  run	
  on	
  the	
  demo	
  system	
  I	
  
   was	
  using.	
  	
  Might	
  as	
  well	
  turbo-­‐charge	
  this	
  puppy...	
  
More	
  disk	
  space!	
  
•  Full	
  install	
  of	
  Hadoop	
  (Hortonworks	
  HDP)	
  
•  Single	
  node	
  
•  600K	
  visits,	
  20M	
  events	
  
    –  6m	
  29s,	
  ~30s	
  aner	
  map	
  phase	
  completed	
  
•  1.8M	
  visits,	
  60M	
  events	
  
    –  18m	
  3s,	
  ~90s	
  aner	
  map	
  phase	
  completed	
  
More	
  nodes	
  
•  3	
  nodes:	
  11m	
  
•  4	
  nodes:	
  9m	
  16s	
  
•  Yay!	
  Nodes!	
  
Caveats	
  
•  Not	
  using	
  Hadoop	
  to	
  its	
  fullest	
  /	
  basically	
  a	
  
   weekend	
  job	
  
•  Algorithms	
  employed	
  in	
  this	
  example	
  probably	
  
   won't	
  end	
  up	
  it	
  a	
  book	
  alongside	
  Knuth’s	
  
Next	
  steps	
  
•  Make	
  sure	
  results	
  on	
  real	
  data	
  lines	
  up	
  
•  Integrate	
  with	
  team	
  to	
  generate	
  reports	
  they	
  
   need	
  
End	
  stuff	
  
•  Thanks	
  to	
  the	
  folks	
  at	
  Hortonworks	
  who	
  
   answered	
  my	
  fran9c	
  and	
  spas9c	
  ques9ons.	
  
Sponsored by




               Karl Zimmerman, President         Hosted by




                                           #cloudcamp
                                           @cloudcamp_chi
Thursday, December 13, 12
Keep Your Control.
Private Cloud with Karl Zimmerman, CEO of Steadfast.
Private Cloud:
What do we mean?

 Private cloud is a form of cloud computing where the
 customer has some control/ownership of the service
 implementation. It is a scalable, elastic IaaS solution
 based on cloud computing but with more control over
 resources.
Private Cloud:
What are the advantages?

 Security
 Availability
 No vendor lock-in
 Ease of management
Private Cloud:
Security



 Dedicated & segregated resources
 More options to integrate with existing security
Private Cloud:
Availability


 Understanding and control of the infrastructure
 Get the resources you need, when you need them
 You're not subject to the whims of other users
Private Cloud:
Vendor Lock-In



 No "secret sauce."
 Utilize true open source
Private Cloud:
Management


 Easier to find employees with general IT knowledge
 Utilize a broader array of tools and software
 Get support/assistance from multiple levels
Private Cloud:
To Summarize

Private cloud can deliver what you need out of a public
cloud, but giving you more control. Losing control over
security, availability and issues like vendor lock-in and
management vanish into thin air like, well, a cloud. And the
fact that it doesn’t have to cost you more is a plus, too.
Sponsored by




               Unpanel: “Who’s in Control of Your
               Cloud? Security and Visibility”
                                                                                        Hosted by
               Emceed by:
               Mike Dorosh, Program Manager –Cloud Technical Partnerships, IBM 
               & Patrick Kerpan CEO, CoehsiveFT




                                                                                  #cloudcamp
                                                                                  @cloudcamp_chi
Thursday, December 13, 12
#cloudcamp
                            @cloudcamp_chi
Thursday, December 13, 12

Cloud Camp Chicago Dec 2012 Slides

  • 1.
    Sponsored by Welcome to Cloud Chicago Hosted by Live Tweet on the second screen by using: #cloudcamp @cloudcamp_chi 1 Thursday, December 13, 12
  • 2.
    Agenda 6:00pm Registration, Food, Drinks and Networking 6:30 Opening Remarks, Patrick Kerpan, CoehsiveFT 6:45 Lightning Talks Dave Falck, Model Metrics: node.js on AWS Paul Mantz, CohesiveFT: Working with APIs Bob Chojnacki, Jellyvision Labs: Hadoop on AWS Karl Zimmerman, Steadfast: Keep control with the Private Cloud 7:45 Unpanel: “Who’s in Control of Your Cloud? Security and Visibility” Emceed by Mike Dorosh, IBM & Patrick Kerpan, CoehsiveFT 8:30 Breakout Sessions 9:00 Wrap Up - Drinks, anyone? #cloudcamp @cloudcamp_chi Thursday, December 13, 12
  • 3.
    Sponsored by Dave Falck, Customer Solutions Engineer Hosted by #cloudcamp @cloudcamp_chi Thursday, December 13, 12
  • 4.
    Node.js  +  AWS   @davidfalck  
  • 5.
    Why  the  Node.js  Buzz?     *  LinkedIn’s  entire  mobile  software  stack  is  completely   built  in  Node   *  Why?  Scale.   *  Huge  performance  gains  compared  to  what  they  were   using  before  (Ruby  on  Rails)   *  Went  from  running  15  servers  with  15  instances  (virtual   servers)  on  each  physical  machine,  to  just  four   instances  that  can  handle  double  the  traffic.      
  • 6.
    What  is  Node.js?     *  Javascript  platform  based  on  Google  Chrome  V8  JS   Engine     *  Ryan  Dahl  (Joyent)   *  Event-­‐driven,  non-­‐blocking  I/O  model  to  allow  your   applications  to  scale  while  keeping  you  from  having  to   deal  with  threads,  polling,  timeouts,  and  event  loops   *  FAST   *  Used  for  real-­‐time,  data-­‐intensive  apps  (mobile!)   *  POPULAR    
  • 7.
  • 8.
    Hello  World   var  http  =  require('http');   http.createServer(function  (req,  res)  {      res.writeHead(200,  {'Content-­‐Type':  'text/plain'});      res.end('Hello  Worldn');   }).listen(1337,  '127.0.0.1');  
  • 9.
    What  makes  Node.js  so  fast?   *  Thread-­‐based  networking  is  inefficient  and  difficult   *  Node  shows  much  better  memory  efficiency  under  high-­‐ loads  than  systems  which  allocate  2mb  thread  stacks  for   each  connection.     *  Users  of  Node  are  free  from  worries  of  dead-­‐locking  the   process  (*there  are  no  locks*)   *  Almost  no  function  in  Node  directly  performs  I/O,  so  the   process  never  blocks.     *  Because  nothing  blocks,  less-­‐than-­‐expert  programmers   are  able  to  develop  fast  systems  
  • 10.
    Under  the  Node.js  hood     Javascript?  
  • 11.
    Under  the  Node.js  hood     *  Javascript!   *  Platform  independent   *  Easy  to  use   *  Ubiquitous   *  Google  Chrome’s  V8  Javascript  Engine   *  Translates  JS  into  machine  code  (not  interpreted)  
  • 12.
    When  not  to  use  Node.js     *  Node.js  is  not  ideal  for  CPU  intensive  jobs  like  sorting,   transformations,  number  crunching,  analytics…   *  Traditional  CRUD  web  apps  that  need  to  be  highly   concurrent,  performance  degradation  will  occur  when   the  data  is  needed  to  be  transformed…     *  You  can  offload  processing  to  another  language  that   is  better  at  making  use  of  the  CPU   *  Cultural  fit?  Too  new?    You  decide…  
  • 13.
    Node.js  +  AWS   *  Dec  6th:  AWS  released  developer  preview  of  node.js   libraries  to  access  AWS:   *  DynamoDB   *  S3   *  EC2     *  SWS   *  Allows  you  to  manage  parallel  calls  to  several  AWS   web  services  
  • 14.
    Node.js  +  Other  Clouds   *  Azure     *  Joyent   *  EngineYard   *  Heroku  
  • 15.
    More  info   * http://nodejs.org   *  http://en.wikipedia.org/wiki/Nodejs   *  http://aws.typepad.com/aws/2012/12/aws-­‐sdk-­‐for-­‐ nodejs-­‐now-­‐available-­‐in-­‐preview-­‐form.html   *  http://www.jamesward.com/2011/06/21/getting-­‐ started-­‐with-­‐node-­‐js-­‐on-­‐the-­‐cloud/   *  http://venturebeat.com/2011/08/16/linkedin-­‐node/  
  • 16.
    Sponsored by Paul Mantz, Software Engineer Hosted by #cloudcamp @cloudcamp_chi Thursday, December 13, 12
  • 17.
    APIs in CloudEnvironments Paul Mantz Copyright CohesiveFT - Dec 13, 2012 1 Thursday, December 13, 12
  • 18.
    API Command-Line Clients • Benefits to Creating API Command-Line Clients • Lowers barrier of entry • Familiar to technical consumers • Advanced usage cases • Integrates into existing toolsets Copyright CohesiveFT - Dec 13, 2012 2 Thursday, December 13, 12
  • 19.
    API Command-Line Clients Excellent Internal Developer Tool • Excellent for testing and rapid development • Useful operations tool Copyright CohesiveFT - Dec 13, 2012 3 Thursday, December 13, 12
  • 20.
    API Command-Line Clients Reference Implementation • Gives developers an example to integrate the API • Helps users model workflows • DSL Copyright CohesiveFT - Dec 13, 2012 4 Thursday, December 13, 12
  • 21.
    API Command-Line Clients Excellent Demo Tool • Quick installation, often one file Copyright CohesiveFT - Dec 13, 2012 5 Thursday, December 13, 12
  • 22.
    Sponsored by Bob Chojnacki, Programmer Hosted by #cloudcamp @cloudcamp_chi Thursday, December 13, 12
  • 23.
    Big  Data  in  the  Cloud   A  Journey  into  the  unknown  
  • 24.
    Who  Jellyvision  is  and  why  are   analy9cs  important  to  us   •  We  create  interac9ve  experiences   –  Desktop   –  Mobile   •  …  which  ask  ques9ons,  inform  people,  generate  leads   •  “Virtual  Advisors”   •  We  also  collect  analy9cs  in  real  9me  to  generate  reports   about:   –  How  people  answered  a  ques9on   –  Where  they  dropped  out   –  Lots  of  impressive  stats!    
  • 25.
    The  Problem   • Longer  term  projects  and  high  volume   projects  causing  MySQL  to  bust  at  the  seams   •  Some  types  of  reports  taking  too  long,  or   causing  MySQL  to  crash  if  we  include  too   much  data   •  In  all  fairness,  we  could  probably  tune  MySQL,   throw  it  on  bigger  servers,  more  memory   •  Diminishing  returns   •  MySQL  is  fine  for  collec9ng  the  data…  
  • 26.
    The  Solu9on   • Hadoop!   •  Why  Hadoop?  Lots  of  possibili9es  out  there,   but  which  one  to  use?  Cassandra,  CouchDB,   Hadoop,  Membase,  MongoDB,  Neo4j,  …   •  Big  Data  meetups  tended  to  have  lots  of   people  using  Hadoop   •  And  I  knew  others  using  it.   •  And  Hortonworks  had  a  fancy  point  and  click   solu9on  I  could  use  to  get  started  quickly  
  • 27.
    Op9ons  with  op9ons   •  Now  that  I  picked  Hadoop,  I  had  several   op9ons,  and  op9ons  within  op9ons  to  use  to   analyze  my  data:   –  Hive,  Pig,  MapReduce,  Java,  R   •  I  knew  Java   •  MapReduce  seemed  to  make  sense   •  I’ll  probably  play  with  Hive  and  Pig  next  
  • 28.
    It’s  All  About  The  Data   •  Visit  data   •  Event  data   •  Denormaliza9on  of  data   •  Generated  a  ton  of  fake  data:   –  Started  with  600K  visits,  3M  events   –  Moved  up  to  1.8M  visits,  60M  events  
  • 29.
    Make  it  so   •  First  experience:  Hortonworks  Virtual  Sandbox   –  Single  node  AMI  at  Amazon   –  Hadoop  1.0   –  600K  visits,  3M  events   •  On  our  exis9ng  placorm  we  needed  to  break  reports  up  into   smaller  chunks  for  some  data  because  MySQL  could  not  handle  it.   •  Results!  What  would  have  taken  hours,  took  only  5  minutes  on  a   single  node  Hadoop  "cluster”   •  In  reality,  some  of  the  queries  I  could  also  run  with  command-­‐line   tools  (wc,  grep,  awk)  on  the  data  considerably  faster  than  even   Hadoop.   •  Important  lessons  learned  so  far:   –  Think  outside  the  RDBMS:  they  are  great,  but  it  may  not  make  sense   for  all  types  data  
  • 30.
    Looking  at  more  real  data   •  Now,  lets  generate  data  that  is  much  closer  to  some  of  our  product   •  Instead  of  one  ques9on  and  answer,  how  about  15  ques9ons?    Add   in  some  other  events  gives  a  total  of  34  events.   •  Throw  in  some  people  returning,  some  of  them  mul9ple  9mes   •  Throw  in  some  people  who  don't  start  the  conversa9on,  etc.   •  Run  my  lijle  auto-­‐data-­‐generator  and  BOOM!  20  million  events   and  4.4GB  later  I  have  my  data…   •  …  which  took  up  too  much  disk  space  to  run  on  the  demo  system  I   was  using.    Might  as  well  turbo-­‐charge  this  puppy...  
  • 31.
    More  disk  space!   •  Full  install  of  Hadoop  (Hortonworks  HDP)   •  Single  node   •  600K  visits,  20M  events   –  6m  29s,  ~30s  aner  map  phase  completed   •  1.8M  visits,  60M  events   –  18m  3s,  ~90s  aner  map  phase  completed  
  • 32.
    More  nodes   • 3  nodes:  11m   •  4  nodes:  9m  16s   •  Yay!  Nodes!  
  • 33.
    Caveats   •  Not  using  Hadoop  to  its  fullest  /  basically  a   weekend  job   •  Algorithms  employed  in  this  example  probably   won't  end  up  it  a  book  alongside  Knuth’s  
  • 34.
    Next  steps   • Make  sure  results  on  real  data  lines  up   •  Integrate  with  team  to  generate  reports  they   need  
  • 35.
    End  stuff   • Thanks  to  the  folks  at  Hortonworks  who   answered  my  fran9c  and  spas9c  ques9ons.  
  • 36.
    Sponsored by Karl Zimmerman, President Hosted by #cloudcamp @cloudcamp_chi Thursday, December 13, 12
  • 37.
    Keep Your Control. PrivateCloud with Karl Zimmerman, CEO of Steadfast.
  • 38.
    Private Cloud: What dowe mean? Private cloud is a form of cloud computing where the customer has some control/ownership of the service implementation. It is a scalable, elastic IaaS solution based on cloud computing but with more control over resources.
  • 39.
    Private Cloud: What arethe advantages? Security Availability No vendor lock-in Ease of management
  • 40.
    Private Cloud: Security Dedicated& segregated resources More options to integrate with existing security
  • 41.
    Private Cloud: Availability Understandingand control of the infrastructure Get the resources you need, when you need them You're not subject to the whims of other users
  • 42.
    Private Cloud: Vendor Lock-In No "secret sauce." Utilize true open source
  • 43.
    Private Cloud: Management Easierto find employees with general IT knowledge Utilize a broader array of tools and software Get support/assistance from multiple levels
  • 44.
    Private Cloud: To Summarize Privatecloud can deliver what you need out of a public cloud, but giving you more control. Losing control over security, availability and issues like vendor lock-in and management vanish into thin air like, well, a cloud. And the fact that it doesn’t have to cost you more is a plus, too.
  • 45.
    Sponsored by Unpanel: “Who’s in Control of Your Cloud? Security and Visibility” Hosted by Emceed by: Mike Dorosh, Program Manager –Cloud Technical Partnerships, IBM  & Patrick Kerpan CEO, CoehsiveFT #cloudcamp @cloudcamp_chi Thursday, December 13, 12
  • 46.
    #cloudcamp @cloudcamp_chi Thursday, December 13, 12