Log everything!
Dr. Stefan Schadwinkel und Mike Lohmann




                                          1	
  
Who we are.




               Dr. Stefan Schadwinkel                            Mike Lohmann
                       Analytics                                   Architektur
Author (heise.de, Cereb.Cortex, EJN, J.Neurophysiol.)   Author (PHPMagazin, IX, heise.de)




                                                                            Log everything   2	
  
                                                                                             2
Agenda.




 §  What we do. What we need to do. What we are doing.

 §  Requirement: Log everything!

 §  Infrastructure and technologies.

 §  We want happy business users.




 	
  




                                                          Log everything   3	
  
                                                                           3
Icans GmbH




             Log everything   4	
  
                              4
Numberfacts of PokerStrategy.com




    7.600.000
    Requests/Day
                                       PokerStrategy.com
                                       Education since 2005




 6.000.000                                 19 Languages
 Registered Users




             2.800.000             700.000
             PI/Day                Posts/Day



                                                  Log everything   5	
  
                                                                   5
Topics of this talk




- How to use existing technologies and standards.   - Out of the box solution

- Scalability and simplicity of the solution        - Ready to use scripts	
  

- „Good enough“ for now!

- Showing way from requirement to solution.

- OpenSource Sf2 bundles for logging.

- Livedemo.




                                                                          Log everything   6	
  
                                                                                           6
What we do.




 §  We teach Poker.

 §  We create webapplications.

 §  We serve millions of users in different countries respecting

   a multitude of market rules.

 §  We make business decisions driven by complex

   data analytics.




                                                                    Log everything   7	
  
                                                                                     7
What we need to do.




 §  We need to try out other teaching topics, fast.

 §  We need to gather data from all of these „try outs“ to accumulate them

   and build business decisions on their analysis.

 §  We need a bigger infrastructure to gather more data.

 §  We need to hire more (good) people! J




                                                                              Log everything   8	
  
                                                                                               8
What we are doing.




 §  We build ECF (Education Community Framework).

 §  We (can) log everything!

 §  We (now) use Amazon S3 and Amazon EMR to have a scaling

   storage and map reduce solution.

 §  We hire (good) people! J




                                                               Log everything   9	
  
                                                                                9
Requirement: Log everything.




 §  „Are you mad?!“

 §  „Be more specific, please!“

 §  „But what about the user‘s data?!“
      	
  




                                          Log everything   10	
  
                                                           10
Logging Tools / Technologies




   Producer          Transport           Storage            Analytics

   Symfony2              Now:             Now:              MapReduce
                       RabbitMQ         S3 Storage             Hive
   Application     Erlang Consumer      Hadoop via
   Server and                            Amazon            BI via QlikView
                       Was:                EMR
   Databases
                       Flume
                                             Was:
                                     Virtualized Inhouse
                                           Hadoop




                                                                 15.10.12    11	
  
                                                                             11
Logging Infrastructure




     Producer              Transport          Storage      Analytics

         Databases                             Hadoop
                                               - Cluster


                                                           QlikView	
  

                 App
       Reverse
                 1-x
LB      Proxy




                                                 S3         Graylog	
  

                                   Consumer
                                                            Zabbix	
  
                       Rabbit MQ



                                                                15.10.12   12	
  
                                                                           12
Producer




           /Home    Page
                   Controller
                                PageHit-Event
                   PageHit
                    Event                                  Shovel
                                Logger::log()
                   Listener


                   Monolog-                       Local
                    Logger                      RabbitMQ

                   Processor

                   Formatter               LogMessage, JSON

                   Handler



                                                                    15.10.12   13	
  
                                                                               13
Producer




 §  LoggingComponent: Provides interfaces, filters and handlers

 §  LoggingBundle: Glues all together with Symfony2
      	
  




  h=ps://github.com/ICANS/IcansLoggingComponent	
  
  h=ps://github.com/ICANS/IcansLoggingBundle	
  
  	
  




                                                                   15.10.12   14	
  
                                                                              14
Transport – First Try




  §  Hey, if we use Hadoop, why not use Flume?

      -  Part of the Ecosystem
      -  Central config
      -  Extensible via Plugins
      -  Flexible Flow Configuration
      -  How? : Flume Nodes à Flume Sinks




                                                  15.10.12   15	
  
                                                             15
Transport – First Try




  §  But, .. wait!
        -  Ecosystem? Just like Hadoop version numbers…
        -  Admins say: Central config woes!
        -  issues: multi-master, logical vs. physical nodes, Java heap
           space, etc.
        -  Will my plugin run with flume-ng?
        -  Ever tried to keep your complex flow and switch reliability levels?


        Read: Our admins still hate me …




                                                                                 15.10.12   16	
  
                                                                                            16
Transport – Second Try




 §  RabbitMQ vs. Flume Nodes
     -  Each app server has ist own local RabbitMQ
     -  The local RabbitMQ shovels ist data to a central RabbitMQ
       cluster
     -  Similar to the Flume Node concept
     -  Decentralized config: Producers and consumers simply connect




                                                                       15.10.12   17	
  
                                                                                  17
Transport – Second Try




 §  But, .. wait! We still need Sinks.
      -  Custom crafted RabbitMQ consumers
      -  We could write them in PHP, but ..


      -  Erlang, teh awesome!
            -  Battle-hardened OTP framework.
            -  „Let it crash!“ .. and recover.
            -  Hot code change. If you want.


      Read: Runs forever.




                                                 15.10.12   18	
  
                                                            18
Storage – First Try




                      §  Use out-of-the-box Hadoop (Cloudera)

                      §  But:
                            -  Virtualized Infrastructure
                            -  Unknown usage patterns
      Hadoop
                            -  Must be cost effective
                            -  Major Hadoop version upgrades




                                                                 15.10.12   19	
  
                                                                            19
Storage – Second Try




                       §  Use Amazon Webservices

                       §  Provides flexible virtualized infrastructure

                       §  Cost-effective storage: S3
    Amazon S3
                       §  Hadoop on demand: EMR




                                                                          15.10.12   20	
  
                                                                                     20
Storage – Storage Amazon S3




                   §  Erlang RabbitMQ consumer simply copies the

                     incoming data to S3

                       - Easy: exchange „hadoop“ command with „s3cmd“
    Amazon S3




                                                                    15.10.12   21	
  
                                                                               21
Storage – Storage Amazon S3




                   §  S3 bucket receives many small, compressed log file chunks

                   §  Amazon provides s3DistCp which does distributed data copy:

                       -  Aggregate many small files into partitioned large chunks
    Amazon S3
                       -  Change compression




                                                                        15.10.12     22	
  
                                                                                     22
Analytics




 §  We want happy business users.

 §  We want to answer questions.
        -  People want answers to questions they have. Now.
        -  No, they couldn‘t tell you that question yesterday. If they had
          known, they would have already asked for the answer. Yesterday.

 §  We also want data-driven applications.
        -  Production system analysis.
        -  Fraud prevention.
        -  Recommendations.
        -  Social metrics for our users.
 	
  


                                                                             15.10.12   23	
  
                                                                                        23
Analytics




 §  Remember MapReduce.

        -  Custom Jobs.
            -  Streaming: Use your favorite.
            -  Java API: Cascading. Use your favorite: Java, Groovy, Clojure,
              Scala.

        -  Data Queries.
            -  Hive: similar to SQL.
            -  Pig: Data flow.
            -  Cascalog: Datalog-like QL using Clojure and Cascading.
 	
  




                                                                                15.10.12   24	
  
                                                                                           24
Analytics




 §  Cascalog is Clojure, Clojure is Lisp


  (?<- (stdout)          [?person]       (age ?person ?age) … (< ?age 30))

  Query     Cascading     Columns of          „Generator“                 „Predicate“
 Operator   Output Tap    the dataset
                           generated
                          by the query

                                                  §  as many as you want
                                                  §  both can be any clojure function
                                                  §  clojure can call anything that is
                                                     available within a JVM




                                                                                        15.10.12   25	
  
                                                                                                   25
Analytics




 §  We use Cascalog to preprocess and organize that incoming flow of log messages:




                                                                             15.10.12   26	
  
                                                                                        26
Analytics




 §  Let‘s run the Cascalog processing on Amazon EMR:


    ./elastic-mapreduce --create --name „Log Message Compaction"
    --bootstrap-action s3://[BUCKET]/mapreduce/configure-daemons
    --num-instances $NUM
    --slave-instance-type m1.large
    --master-instance-type m1.large
    --jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar
    --step-action TERMINATE_JOB_FLOW
    --step-name "Cascalog"
    --main-class icans.cascalogjobs.processing.compaction
    --args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error




                                                                                        15.10.12    27	
  
                                                                                                    27
Analytics




 §  After the Cascalog Query we have:


   s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo



                                                     Hive	
  ParSSoning!	
  




                                                                                  15.10.12   28	
  
                                                                                             28
Analytics




 §  Now	
  we	
  can	
  access	
  the	
  log	
  data	
  within	
  Hive:




                                                                           15.10.12   29	
  
                                                                                      29
Analytics




 §  Now	
  we	
  can	
  run	
  Hive	
  queries	
  on	
  the	
  [WEBSITE]_icanslog_content	
  table!	
  
 §  But	
  we	
  also	
  want	
  to	
  store	
  the	
  result	
  to	
  S3.




                                                                                                           15.10.12   30	
  
                                                                                                                      30
Analytics




 §  Now,	
  get	
  the	
  stats:




                                    15.10.12   31	
  
                                               31
Analytics




 §  We can now simply copy the data from S3 and import in any local analytical tool, like:

    -  Excel (It must really make business people happy…)
    -  QlikView (Anyone can be happy with it…)
    -  R (If I want an answer…)




                                                                                 15.10.12     32	
  
                                                                                              32
Merci.




         ?
         Questions




                     15.10.12   33	
  
                                33
Contacts.




       Dr. Stefan Schadwinkel               Mike Lohmann
  stefan.schadwinkel@icans-gmbh.com   mike.lohmann@icans-gmbh.com
            ICANS_StScha                     mikelohmann




                                                           15.10.12   34	
  
                                                                      34
Tools/Technologies




                     15.10.12   35	
  
                                35
ICANS GmbH
Valentinskamp 18
20354 Hamburg
Germany


Phone:   +49 40 22 63 82 9-0
Fax:     +49 40 38 67 15 92


Web: www.icans-gmbh.com
	
  



                               36	
  

Log everything!

  • 1.
    Log everything! Dr. StefanSchadwinkel und Mike Lohmann 1  
  • 2.
    Who we are. Dr. Stefan Schadwinkel Mike Lohmann Analytics Architektur Author (heise.de, Cereb.Cortex, EJN, J.Neurophysiol.) Author (PHPMagazin, IX, heise.de) Log everything 2   2
  • 3.
    Agenda. §  Whatwe do. What we need to do. What we are doing. §  Requirement: Log everything! §  Infrastructure and technologies. §  We want happy business users.   Log everything 3   3
  • 4.
    Icans GmbH Log everything 4   4
  • 5.
    Numberfacts of PokerStrategy.com 7.600.000 Requests/Day PokerStrategy.com Education since 2005 6.000.000 19 Languages Registered Users 2.800.000 700.000 PI/Day Posts/Day Log everything 5   5
  • 6.
    Topics of thistalk - How to use existing technologies and standards. - Out of the box solution - Scalability and simplicity of the solution - Ready to use scripts   - „Good enough“ for now! - Showing way from requirement to solution. - OpenSource Sf2 bundles for logging. - Livedemo. Log everything 6   6
  • 7.
    What we do. §  We teach Poker. §  We create webapplications. §  We serve millions of users in different countries respecting a multitude of market rules. §  We make business decisions driven by complex data analytics. Log everything 7   7
  • 8.
    What we needto do. §  We need to try out other teaching topics, fast. §  We need to gather data from all of these „try outs“ to accumulate them and build business decisions on their analysis. §  We need a bigger infrastructure to gather more data. §  We need to hire more (good) people! J Log everything 8   8
  • 9.
    What we aredoing. §  We build ECF (Education Community Framework). §  We (can) log everything! §  We (now) use Amazon S3 and Amazon EMR to have a scaling storage and map reduce solution. §  We hire (good) people! J Log everything 9   9
  • 10.
    Requirement: Log everything. §  „Are you mad?!“ §  „Be more specific, please!“ §  „But what about the user‘s data?!“   Log everything 10   10
  • 11.
    Logging Tools /Technologies Producer Transport Storage Analytics Symfony2 Now: Now: MapReduce RabbitMQ S3 Storage Hive Application Erlang Consumer Hadoop via Server and Amazon BI via QlikView Was: EMR Databases Flume Was: Virtualized Inhouse Hadoop 15.10.12 11   11
  • 12.
    Logging Infrastructure Producer Transport Storage Analytics Databases Hadoop - Cluster QlikView   App Reverse 1-x LB Proxy S3 Graylog   Consumer Zabbix   Rabbit MQ 15.10.12 12   12
  • 13.
    Producer /Home Page Controller PageHit-Event PageHit Event Shovel Logger::log() Listener Monolog- Local Logger RabbitMQ Processor Formatter LogMessage, JSON Handler 15.10.12 13   13
  • 14.
    Producer §  LoggingComponent:Provides interfaces, filters and handlers §  LoggingBundle: Glues all together with Symfony2   h=ps://github.com/ICANS/IcansLoggingComponent   h=ps://github.com/ICANS/IcansLoggingBundle     15.10.12 14   14
  • 15.
    Transport – FirstTry §  Hey, if we use Hadoop, why not use Flume? -  Part of the Ecosystem -  Central config -  Extensible via Plugins -  Flexible Flow Configuration -  How? : Flume Nodes à Flume Sinks 15.10.12 15   15
  • 16.
    Transport – FirstTry §  But, .. wait! -  Ecosystem? Just like Hadoop version numbers… -  Admins say: Central config woes! -  issues: multi-master, logical vs. physical nodes, Java heap space, etc. -  Will my plugin run with flume-ng? -  Ever tried to keep your complex flow and switch reliability levels? Read: Our admins still hate me … 15.10.12 16   16
  • 17.
    Transport – SecondTry §  RabbitMQ vs. Flume Nodes -  Each app server has ist own local RabbitMQ -  The local RabbitMQ shovels ist data to a central RabbitMQ cluster -  Similar to the Flume Node concept -  Decentralized config: Producers and consumers simply connect 15.10.12 17   17
  • 18.
    Transport – SecondTry §  But, .. wait! We still need Sinks. -  Custom crafted RabbitMQ consumers -  We could write them in PHP, but .. -  Erlang, teh awesome! -  Battle-hardened OTP framework. -  „Let it crash!“ .. and recover. -  Hot code change. If you want. Read: Runs forever. 15.10.12 18   18
  • 19.
    Storage – FirstTry §  Use out-of-the-box Hadoop (Cloudera) §  But: -  Virtualized Infrastructure -  Unknown usage patterns Hadoop -  Must be cost effective -  Major Hadoop version upgrades 15.10.12 19   19
  • 20.
    Storage – SecondTry §  Use Amazon Webservices §  Provides flexible virtualized infrastructure §  Cost-effective storage: S3 Amazon S3 §  Hadoop on demand: EMR 15.10.12 20   20
  • 21.
    Storage – StorageAmazon S3 §  Erlang RabbitMQ consumer simply copies the incoming data to S3 - Easy: exchange „hadoop“ command with „s3cmd“ Amazon S3 15.10.12 21   21
  • 22.
    Storage – StorageAmazon S3 §  S3 bucket receives many small, compressed log file chunks §  Amazon provides s3DistCp which does distributed data copy: -  Aggregate many small files into partitioned large chunks Amazon S3 -  Change compression 15.10.12 22   22
  • 23.
    Analytics §  Wewant happy business users. §  We want to answer questions. -  People want answers to questions they have. Now. -  No, they couldn‘t tell you that question yesterday. If they had known, they would have already asked for the answer. Yesterday. §  We also want data-driven applications. -  Production system analysis. -  Fraud prevention. -  Recommendations. -  Social metrics for our users.   15.10.12 23   23
  • 24.
    Analytics §  RememberMapReduce. -  Custom Jobs. -  Streaming: Use your favorite. -  Java API: Cascading. Use your favorite: Java, Groovy, Clojure, Scala. -  Data Queries. -  Hive: similar to SQL. -  Pig: Data flow. -  Cascalog: Datalog-like QL using Clojure and Cascading.   15.10.12 24   24
  • 25.
    Analytics §  Cascalogis Clojure, Clojure is Lisp (?<- (stdout) [?person] (age ?person ?age) … (< ?age 30)) Query Cascading Columns of „Generator“ „Predicate“ Operator Output Tap the dataset generated by the query §  as many as you want §  both can be any clojure function §  clojure can call anything that is available within a JVM 15.10.12 25   25
  • 26.
    Analytics §  Weuse Cascalog to preprocess and organize that incoming flow of log messages: 15.10.12 26   26
  • 27.
    Analytics §  Let‘srun the Cascalog processing on Amazon EMR: ./elastic-mapreduce --create --name „Log Message Compaction" --bootstrap-action s3://[BUCKET]/mapreduce/configure-daemons --num-instances $NUM --slave-instance-type m1.large --master-instance-type m1.large --jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar --step-action TERMINATE_JOB_FLOW --step-name "Cascalog" --main-class icans.cascalogjobs.processing.compaction --args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error 15.10.12 27   27
  • 28.
    Analytics §  Afterthe Cascalog Query we have: s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo Hive  ParSSoning!   15.10.12 28   28
  • 29.
    Analytics §  Now  we  can  access  the  log  data  within  Hive: 15.10.12 29   29
  • 30.
    Analytics §  Now  we  can  run  Hive  queries  on  the  [WEBSITE]_icanslog_content  table!   §  But  we  also  want  to  store  the  result  to  S3. 15.10.12 30   30
  • 31.
    Analytics §  Now,  get  the  stats: 15.10.12 31   31
  • 32.
    Analytics §  Wecan now simply copy the data from S3 and import in any local analytical tool, like: -  Excel (It must really make business people happy…) -  QlikView (Anyone can be happy with it…) -  R (If I want an answer…) 15.10.12 32   32
  • 33.
    Merci. ? Questions 15.10.12 33   33
  • 34.
    Contacts. Dr. Stefan Schadwinkel Mike Lohmann stefan.schadwinkel@icans-gmbh.com mike.lohmann@icans-gmbh.com ICANS_StScha mikelohmann 15.10.12 34   34
  • 35.
    Tools/Technologies 15.10.12 35   35
  • 36.
    ICANS GmbH Valentinskamp 18 20354Hamburg Germany Phone: +49 40 22 63 82 9-0 Fax: +49 40 38 67 15 92 Web: www.icans-gmbh.com   36