Extending Piwik At R7.com
Upcoming SlideShare
Loading in...5
×
 

Extending Piwik At R7.com

on

  • 2,999 views

How we deployed Piwik web analytics system to handle a huge amount of unpredicted traffic, adding some cloud and modern scalability techniques. files:https://github.com/lorieri/piwik-presentation

How we deployed Piwik web analytics system to handle a huge amount of unpredicted traffic, adding some cloud and modern scalability techniques. files:https://github.com/lorieri/piwik-presentation

Statistics

Views

Total Views
2,999
Views on SlideShare
2,941
Embed Views
58

Actions

Likes
14
Downloads
41
Comments
2

5 Embeds 58

http://www.piwikjapan.org 51
http://piwik.twain.jp 3
http://www.linkedin.com 2
http://xxx.broceliand.fr 1
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • <3 beautiful
    Are you sure you want to
    Your message goes here
    Processing…
  • See the config files in this github: https://github.com/lorieri/piwik-presentation
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Extending Piwik At R7.com Extending Piwik At R7.com Presentation Transcript

  • Extending Piwik at r7.com Phase 1 – Collecting data Adding some cloud and modern scalability to a traditional LAMP stack leonardo lorieri, r7.com system architect, 'lorieri at gmail.com', feb/2012
  • Why Piwik ?   - Open Source = flexible, understandable, free!   - Great interface   - Mobile app   - REST API   - Developers knows the market needs   - Efficient in small machines   - Lots of possible improvements   - Lots of improvements already in the roadmap   - Great and supportive community (Thank you all!)
  • Our Plan, goals and trade-offs   - Don't change original code       - reduces development and maintenance costs   - Count only visits and page views       - to be fast and focused ( even though you still can use the .js tracker,         it is easy to get lost in the UI's beauty and all its functionalities)   - Handle odd unexpected traffic peaks       - from tv announcements    - Count not only websites       - media delivery, internal searches, debugs   - At least 99% of accuracy   - Have numbers to compare with other analytics tools   - We've lost P3P for now
  • Our big problem - The TV Effect
    • from Gaiser's presentation at http://www.slideshare.net/rgaiser/r7-no-aws-qcon-sp-2011
    • Traffic peak during a TV Show
  • Regular Piwik Setup based on Rodrigo Campos presentation http://www.slideshare.net/xinu/capacity-planning-for-linux-systes   - Apache/Nginx   - Php   - Mysql
  • Bigger Piwik Setup based on Rodrigo Campos presentation http://www.slideshare.net/xinu/capacity-planning-for-linux-systes   - Apache/Nginx   - Php   - MySql
  • Regular Php Scaling Piwik Setup based on Rodrigo Campos presentation http://www.slideshare.net/xinu/capacity-planning-for-linux-systes   - Apache/Nginx   - Php   - MySql Replication       (slave for backup only,       piwik is not &quot;slave ready&quot;) Load balancer/Nginx
  • Two problems, one easy solution
    •   Problem: Data Collection for the TV Effect
    •   Easy solution: make it asynchronous
    •  
    •   Problem: Data processing
    •   Hard solution: huge ($$$) servers and complex tunings
  • Asynchronous Piwik Setup based on Rodrigo Campos presentation http://www.slideshare.net/xinu/capacity-planning-for-linux-systes   - Nginx   - NOT even Php   - MySql Master   - Apache+Php for Admin UI   - Archive cron Load balancer/Nginx   - MySql Slave   - Perl/Python worker to     process logs (manages user cookies) (user cookie) - accesses logs Visits REST API <img src=> request Admin/ Reports
  • Nginx (more details later)
    •   - Small virtual machines can handle thousands requests per second
    •   - Visits divided in logs by virtual hosts
    •   - HttpUserIdModule
    •     - automatically creates and handles user id cookies
    •   - HttpLogModule
    •     - formats log as NSCA combined (logging cookies and referrers)
    •   - HttpEmptyGifModule
    •     - respond an empty gif 
    •   - HttpHeadersModule
    •     - expires -1;
    •   (all modules available in ubuntu's nginx-extras package)
    •   - Logrotate
    •     - unix tool to rotate logs
    •     - each 5 minutes for us
    •  
  • Log processing &quot;Worker&quot; (more details later)
    •   - Copy and uncompress available logs
    •   - Format them as a REST API request
    •     - force date visit
    •     - force client ip
    •     - force idVisitor
    •     - force User Agent
    •     - User log's Referrer as URL
    •     - Use referrer as page tittle (useful to log multiple hostnames)
    •   - Send request to Piwik server in parallel
    •     - we are using 270 concurrent requests that
    •       makes 1300 requests per second
  • Mysql Master (more details later)
    •  
    •   - The only machine that has to be huge, saving money 
    •   - Piwik admin and reports interface is here
    •     - could be somewhere else, but the machine is huge anyway
    •   - Mysql tuning 
    •   - Raid tuning
    •   - Linux networking tuning 
    •     - same in all machines, to handle too many tcp concurrent connections
    •   
  • Php tuning (more details later)
    •   - Max execution time
    •   - Max INPUT TIME   <--- bug in php reports it as max execution time
    •   - Max memory limit
    •   - Apc
    •   - Apc shm_size
    •  
    • Some problems:
    •   - Consider Apache, it is slower than nginx, but more stable and
    •     much easier to debug, easier to control concurrency
    •   - MYSQLi is more stable and has better debugging than Mysql_pdo
    •   - mod_php is more stable and easier to debug than fastcgi
  • Piwik tuning
    • Follow the rules: http://piwik.org/faq/new-to-piwik/#faq_137
    •   - disable unused plugins
    •   - since the cookies comes from nginx, you can set in the config.ini:
    • [Tracker]
    • trust_visitors_cookies=1
    •  
  • Handling TV Effect
    • nginx requests/second > maximum requests on peaks of traffic
    •   - autoscaling guarantees it
    •   - autoscaling provides scheduled capacity changes
    • total requests in a day  <  (rest api requests/second) * (all seconds in a day)
    •   - even though the peaks requests vary in 1000% in a short time, the total amount of traffic is easily handled when it comes in a queue with fixed requests/second rate, it will only take some more time to catch up
    • maximum apache concurrent requests  > maximum concurrent worker connections
    •   - the program that process the logs cannot make more requests than the apache can handle, we configure apache to 1000 concurrent requests and configure the worker to input 260 concurrent requets. so apache has some free slots to other admin tasks.
    • mysql max connections > apache concurrent requests
    •   - otherwise you will get &quot;too many connections&quot;
    • archive.php performance > rest api input rate
    •   - you can't input more that than archieve.php will be able to handle, otherwise you will endup with logs that you will never be able to process
  • Our real setup, how we deployed it
    • AWS autoscaling for the Nginx machines
    •     - easy high availability, increases and decrease collecting machines automatically
    •       saving money.
    •     - logrotate runs when a machine is &quot;terminated&quot;, to make sure none requests were lost
    • AWS SNS
    •     - To easily notifies when a new log files is ready to use, making it easy to synchronize
    •       the files processing
    •     - Notifies to multiple queues
    •     - Having multiple queues, we can use same logs to multiple analysis tools. We use
    •       one for web analysis and another for flash player debug
    • AWS SQS
    •     - Easy queue service, so we don't need expensive and complex high availability setups for it
    • AWS S3
    •     - Cheap and virtually unlimited storage
    •     - Very easy access to files
    •     - Durable, Amazon guarantees better durability than regular data centers
    • Nginx
    •     - embedded perl script to get real IP on amazon (perl module is also include in ubuntu's
    •       package)
    • Logrotate
    •     - Added a s3cmd command (package also available on ubuntu) to upload the log to a S3 bucket,
    •       and added an AWS CLI command to send a notification to SNS once it is finished.
  • Our setup diagram Visits ELB Elastic Load Balancer nginx autoscaling pool S3 bucket SNS Notifications SQS queues Other workers/processors for other projects worker BigAss MySql mysql connection mysql slave, apache, piwik api, python-boto, python-twisted mysql master, piwik Piwik Users one file per virtualhost per machine, for each 5 minutes one notification per s3 file Datacenter
  • Our Worker - Part 1
    • Our choice for the REST API was based in same PHP scaling philosophy: Small standalone processes easy to multiply. Also, as in the MySQL replication, it is easier and healthy to process lot of small pieces than to freeze the servers with huge processes.
    • To input the requests in parallel we used python twisted as shown in this blog post:
    • http://oubiwann.blogspot.com/2008/06/async-batching-with-twisted-walkthrough.html
    • We installed apache and piwik in the mysql slave machine (of course the php is connecting in the master's mysql), then we tuned apache, mysql and tcp connections (as shown before). We access the REST API using http://127.0.0.1/.
    • From the twisted blog post mentioned early, we changed the maxRun to 260, added some logging and error handling (we check if a gif was returned and its size, otherwise we log the failed request to be reprocessed later), and we implemented the callLater mentioned in the blog's comments for 0.03 seconds.
    • To get messages and files from Amazon, we are using python-boto
  • Our Worker - Part 2
    • Work Flow:
    •     - check for new messages in the AWS SQS queue
    •     - if there is an message, it means a new file is available, the message
    •       contains a s3 file path
    •     - with the s3 file path, download it, uncompress it
    •     - transform the NSCA log into a REST API request URL
    •     - put the URL in an array
    •     - delete the message in the queue
    •     - run twisted reactor for that array, making requests in the Piwik server in parallel
    •     - if a request fails, log it to be reprocessed later,
    •       alarm it in the monitoring system
    •       (we use zabbix btw, for more information: http://lorieri.github.com/zabbix/)
    •  
    • Note: It is good to have one SNS and SQS for each virtual host if you have too many.
    •  
    • Python details later
  • Better costs management
    • - Contributing to Piwik sharing our ideas, brings more ideas and more improvements, and 
    •     one of its consequences is to reduce costs
    • - CPU on Amazon is cheap and you pay as you use by time
    • - Traffic on Amazon is cheap, and you pay as you use, no long term contracts
    • - By dividing work we can better manage resources, like having only one or two huge machines 
    •   for the MySQLs and lots of small virtual nginx on autoscaling setup. It is easy to decouple the workers 
    •   processing to other machines
    • - Not changing Piwik's code reduces maintenance and development costs
    • - High Availability on Amazon is easy and cheap
    • - Storage durability on Amazon is automatic and cheap
    • - Storage retrieval and management on Amazon is very easy and fast
    • - Distribution control on Amazon is easy and cheap
    • - Having an easy way to access the logs, makes it simple to replay traffic, so you can run tests as
    • much as you need, and test as many tools as you want, improving resources usage and reducing costs
    • - Amazon reduces their prices and improve services all the time
  • Not only web analytics
    • We are also using Piwik to log video plays
    • Once an user hits the play button in the flash player, it triggers a
    • GET request similiar to this:
    • http://player.mysite.com/CATEGORY/VIDEONAME
    • And we use the Video name as the Action's Page Tittle
    • It will appear in the Piwik's Actions interface divided by category,
    • and by Videos Name in the actions page tittles
  • Real Numbers
    • Sorry, we can't provide real numbers, but we can do tests and show how far we can go.
    •   - Collecting data
    •     Nginx: as much requests per second as we need, just a matter of adding more nginx
    •     cheap virtual machines
    •   - REST API
    •     Running outside the master machine we've got 1500 requests/s, our Mysql Master
    •     has 2 quad core cpus, 64GB memory and Raid 10
    •   - Download of logs
    •     If you run inside amazon, the traffic is free, the bandwidth is huge
    •     and the latency is small. We download the logs outside amazon and it is not our 
    •     bottleneck yet
    •   - Distribution tasks control
    •     SNS and SQS do it for us, not a bottleneck yet
    •   - We are still testing how many data we can archive for a month or two, it is already
    •     possible to archive one hour of 1000 requests/s in 30 minutes (considering S3 download 
    •     and uncompress), enough to log 50 million a day. But the tests are in too early stages.
  • CODE OR GTFO!
    • It is hard to show all the code.
    • Most of tools are regular tools from Ubuntu and Amazon, and some others are relevant only for us. But some code and few links can help a lot.
    • Unfortunately I can't teach everything and how to use all Amazon tools, but the key points will be shown, like how to get the real user ip address on Amazon and some of the linux and mysql tuning.
  • MySql tuning details - Raid
    • Raid: http://hwraid.le-vert.net/
    • Our Raid: http://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS
    • Our commands:
    • Check battery status:
    • /usr/sbin/megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood'|grep ': Yes'
    • Turn on the write cache:
    • /usr/sbin/megacli -LDInfo -LAll -aAll|tee /tmp/chefraidstatus |grep 'Default Cache Policy: WriteBack'
    •  
    • Turn on the cache:
    • /usr/sbin/megacli -LDSetProp Cached -LALL -aALL
    •  
    • Turn off the cache in case the battery is not good
    • /usr/sbin/megacli -LDSetProp Cached -LALL -aALL
    •  
    • Turn on the HDD cache
    • /usr/sbin/megacli -LDSetProp EnDskCache -LAll -aAll
    • Turn on the adaptive cache
    • /usr/sbin/megacli -LDSetProp ADRA -LALL -aALL
    • DO NOT FORGET TO MONITOR THE RAID: There are tool for it in the website above
    •  
  • MySql tuning details - Innodb
    • all mysql tunings you can find here:
    •          http://www.slideshare.net/matsunobu/linux-and-hw-optimizations-for-mysql-7614520
    • Our tuning:
    •  
    • table_cache=1024 tmp_table_size=6G max_heap_table_size=6G thread_cache=16 query_cache_size=1G query_cache_limit=4M
    • default-storage-engine = InnoDB expire_logs_days = 5
    • ignore-builtin-innodb plugin-load=innodb=ha_innodb_plugin.so max_binlog_size = 1024M skip-name-resolve innodb_flush_log_at_trx_commit=2 innodb_thread_concurrency=32  # we have 16 cpu threads
    • innodb_buffer_pool_size = 40G # we have 64G of memory innodb_flush_method=O_DIRECT innodb_additional_mem_pool_size=100M innodb_log_buffer_size = 18M innodb_log_file_size = 300M
    • interactive_timeout = 999999
    • wait_timeout = 999999
  • Linux tuning
    • /etc/sysctl.conf:
    • vm.swappiness = 0
    • net.core.somaxconn = 1024
    • net.ipv4.tcp_rmem = 4096 4096 16777216 net.ipv4.tcp_wmem = 4096 4096 16777216 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_sack = 1 net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_fin_timeout = 20 net.ipv4.tcp_keepalive_intvl = 30 net.ipv4.tcp_keepalive_probes = 5 net.ipv4.tcp_tw_reuse = 1 net.core.netdev_max_backlog = 5000 net.ipv4.ip_local_port_range = 2000 65535 fs.file-max=999999
    • /etc/security/limits.conf
    •  
    • # max open files *               -       nofile         999999
    • /etc/default/nginx
    • ULIMIT=&quot;-n 999999&quot;
  • Nginx confs - Getting Real User IP on AWS ELB
    • # apt-get install nginx-extras
    • To get real user IP in a Elastic Load Balancer setup, added those lines inside the http context on /etc/nginx/nginx.conf:
    •         perl_set  $ip 'sub {                 my $r=shift;                 local $_ = $r->header_in(&quot;X-Forwarded-For&quot;);                 # XXX only works well because we know the AWS network uses 10.x.x.x ip addresses
    •                 # Thanks Zed9h                 my $ip0 = m{.*b(                         (?:                                 d|                                 1[1-9]|                                 [2-9]d|                                 [12]d{2}                         ).d+.d+.d+                 )b}xo && $1;                 # $ip0 ne $ip1 && &quot;$ip0 ne $ip1tt$_&quot;; # debug                 $ip0 || $r->remote_addr         }';
    • (Thanks Zed for the Perl script)
  • Nginx confs - Adding a virtual host (1/2)
    • create a file on /etc/nginx/sites-available/VHOST.conf
    • server {         listen   80; ## listen for ipv4; this line is default and implied
    •         server_name VHOST.MYSITE.com;         root /usr/share/nginx/www;         index index.html index.htm;         userid on;         userid_name uid;         userid_domain MYSITE.com;         userid_expires max;         set $myid $uid_got;         location = /crossdomain.xml {                 echo &quot;<?xml version=&quot;1.0&quot;?><!DOCTYPE cross-domain-policy SYSTEM &quot;http://www.macromedia.com/xml/dtds/cross-domain-policy.dtd&quot;><cross-domain-policy><allow-access-from domain=&quot;*&quot; /></cross-domain-policy>&quot;;                   expires       modified +24h;                   access_log off;                   error_log /var/log/nginx/error.log;         }
    •         location / {                 if ($uid_got = &quot;&quot;){                         set $myid $uid_set;                 }                 expires -1; #               return 204;  #use this if you want an empty response                 empty_gif;   #use this if you want an empty gif response         }
    •         location /healthcheck {                     try_files $uri $uri $uri =404;                     access_log off;                     error_log /var/log/nginx/error.log;         }                           location /nginx_status {           stub_status on;           access_log   off;           allow 127.0.0.1;           deny all;           access_log off;           error_log /var/log/nginx/error.log;         }         # !!!!!!!!!!!!!!!!!!!!         # the log format is for Amazon AWS only, if you have the real IP, change         # the ip variable to $remote_addr         log_format VHOST        '$ip - $remote_user [$time_local]  '                                 '&quot;$request&quot; $status $body_bytes_sent '                                 '&quot;$http_referer&quot; &quot;$http_user_agent&quot; '                                 '&quot;$myid&quot;';         # /mnt is the AWS's fastest partition         access_log /mnt/log/nginx/VHOST.access.log VHOST;         error_log /mnt/log/nginx/VHOST.error.log; }
    Nginx confs - Adding a virtual host (2/2)
  • Testing Nginx
    • $ curl localhost/
    • result must be a gif
    • $ curl -I localhost/
    • cookie must be set
  • Php tuning details
    • # apt-get install php-apc
    •  
    •  
    • create a file /etc/php5/conf.d/piwik.ini
    • memory_limit = 15G max_execution_time = 0 max_input_time = 0 apc.shm_size = 64
    • * Piwik tuning on previous slides
  • AWS SNS
    •  
    • It is out of the scope to teach how to create an Autoscaling group, a SNS, a S3 bucket and SQS queue.
    • We will only show we using them.
    •  
    • Create a SNS topic on Amazon &quot;MYTOPIC&quot;, and attach a SQS queue on it &quot;MYQUEUE&quot;
    • Install SNS client and unzip it somewhere, let's say /usr/local/bin:
    • Download from Amazon the file: SimpleNotificationServiceCli-2010-03-31.zip
    • Install JDK:
    • # apt-get install openjdk-6-jdk
    • Create a .conf file with a key and secret to access the SNS. let's say /usr/local/sns.conf:
    • AWSAccessKeyId=XXXXXXXXXXXX AWSSecretKey=XXXXXXXXX
    • Create a source file on /usr/local/sns_env.source:
    • export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/bin/SimpleNotificationServiceCli-1.0.2.3/bin/  
    • export AWS_SNS_HOME=/usr/local/bin/SimpleNotificationServiceCli-1.0.2.3/ export EC2_REGION=us-east-1 export AWS_CREDENTIAL_FILE=/usr/local/sns.conf
  • AWS S3 and s3cmd
    • Create a S3 bucket on Amazon
    • $ apt-get install s3cmd
    • $ s3cmd --configure
    • $ cp ~/.s3cmd.cfg /usr/local/s3cmd.cfg
  • Log rotate (1/3)
    • Script was ripped from Ubuntu's init and logrotate and put in a cronjob. You can use logrotate though.
    •  
    • in the crontab:
    • */5 * * * *  nice /bin/bash /usr/loca/bin/VHOST.sendS3.sh >> /mnt/log/VHOST.send.log 2>&1
    •  
    • sendS3.sh:
    •  
    • #!/bin/bash date #print date to the log DEBUGS3=`mktemp` atexit() {         rm -f $DEBUGS3 } trap atexit 0 BUCKET=&quot;MYBUCKET&quot; PROJECT=&quot;MYVHOST&quot; ARCHIVEDIR=&quot;/mnt/MYVHOST/&quot; S3CMD_CONF=&quot;/usr/local/s3cmd.cfg&quot; ORIGINPATH=&quot;/mnt/log/nginx/VHOST.access.log&quot; SNS_ENV=&quot;/usr/local/sns_env.source&quot; SNS_TOPIC=&quot;MYTOPIC&quot;  
    • HOST=`hostname` #we use instance-id on amazon DATE=$(date --utc +%Y%m%d_%H%M%S) DATEDIR=$(date --utc +%Y/%m/%d) POSTPATH=&quot;$PROJECT/$DATEDIR/$PROJECT-$DATE-$HOST.log&quot; LOCALPATH=&quot;$ARCHIVEDIR/$POSTPATH&quot; GZLOCALPATH=&quot;$LOCALPATH.gz&quot; REMOTEPATH=&quot;s3://$BUCKET/$POSTPATH.gz&quot;
  • Log rotate (2/3)
    • echo &quot;->Trying file: $REMOTEPATH&quot; LOCALDIR=&quot;$(dirname &quot;$LOCALPATH&quot;)&quot; #sleep 1 recomended by nginx's wiki mkdir -p &quot;$LOCALDIR&quot; && mv &quot;$ORIGINPATH&quot; &quot;$LOCALPATH&quot; && { [ ! -f /var/run/nginx.pid ] || kill -USR1 `cat /var/run/nginx.pid` ; }  && sleep 1 && gzip &quot;$LOCALPATH&quot; && { MD5=$(/usr/bin/md5sum &quot;$GZLOCALPATH&quot; | awk '{ print $1 }') ; } #try 3 times if [ -z &quot;$MD5&quot; ] then         echo &quot;ERROR ON MD5&quot;         OK=1 else         OK=$(/usr/bin/s3cmd -d --no-progress -c &quot;$S3CMD_CONF&quot; put &quot;$GZLOCALPATH&quot; &quot;$REMOTEPATH&quot; 2>&1 |grep -q &quot;DEBUG: MD5 sums: computed=$MD5, received=&quot;$MD5&quot;&quot;;echo $?)         if [ &quot;$OK&quot; -eq &quot;1&quot; ]         then                 OK=$(/usr/bin/s3cmd -d --no-progress -c &quot;$S3CMD_CONF&quot; put &quot;$GZLOCALPATH&quot; &quot;$REMOTEPATH&quot; 2>&1 |grep -q &quot;DEBUG: MD5 sums: computed=$MD5, received=&quot;$MD5&quot;&quot;;echo $?)                 if [ &quot;$OK&quot; -eq &quot;1&quot; ]                 then                         /usr/bin/s3cmd -d --no-progress -c &quot;$S3CMD_CONF&quot; put &quot;$GZLOCALPATH&quot; &quot;$REMOTEPATH&quot; 2>&1|tee &quot;$DEBUGS3&quot;                         OK=$(grep -q &quot;DEBUG: MD5 sums: computed=$MD5, received=&quot;$MD5&quot;&quot; &quot;$DEBUGS3&quot;; echo $?)                 fi         fi fi
  • Log rotate (3/3)
    •  
    • # if ok, publish a message on SNS
    •  
    • if [ &quot;$OK&quot; = &quot;0&quot; ] then         source &quot;$SNS_ENV&quot;         echo -n '-> Message: '         sns-publish &quot;$TOPIC&quot; --message &quot;$REMOTEPATH&quot;         OK=${PIPESTATUS[0]} fi echo &quot;OK=$OK&quot; #for monitoring #/usr/bin/zabbix_sender -s &quot;$HOST&quot; -z XXXXXX.com -k XXXXXX -o &quot;$OK&quot;
  • Rotating and uploading logs on reboot and shutdown
    • This is a protection for the Autoscaling group where machines are created and
    • terminated all the time
    • create a file on /etc/init.d/VHOST.sendme.sh
    • #!/bin/bash
    • /bin/echo TERMINATED `date --utc` >> /mnt/log/nginx/VHOST.access.log
    • /usr/bin/nice -20 /bin/bash /usr/local/bin/VHOST.sendS3.sh
    • Then execute:
    • # update-rc.d VHOST.sendme.sh stop 21 0 6 .
    • (the dot in the end of the line is required)
  • Worker details (1/3)
    • I'm not a developer, my worker python code is too ugly to be shown. It is very similar to the blog post mentioned early, the only addition is download S3 files and ready messages on SQS, the functions are similar as those:
    • Connect to S3 and SQS
    •  
    •  
    • from boto.sqs.connection import SQSConnection from boto.s3.connection import S3Connection
    • from boto.sqs.message import RawMessage # for SNS messages
    • import json
    • print &quot;connecting to sqs&quot; logging.info(&quot;connecting to sqs&quot;) connsqs = SQSConnection('xxxxxxxxxxxx', 'xxxxxxxxxxxxx')
    • print &quot;connecting to s3&quot; logging.info(&quot;connection to s3&quot;) conns3 = S3Connection('xxxxxxxxxxxxxxxx', 'xxxxxxxxxxxxx')
  • Worker details (2/3)
    • Reading and deleting SQS messages, and putting results in an array:
    • print &quot;getting queue&quot; logging.info(&quot;getting queue&quot;) my_queue = connsqs.get_queue('MYQUEUE') my_queue.set_message_class(RawMessage) #raw messages from SNS
    • maxmsgs = 10 msgs = [] msg = my_queue.read() while msg:         logging.info(&quot;getting message&quot;)         msgsingle = json.loads(msg.get_body())['Message']         logging.info(msgsingle)         msgs.append(msgsingle)
    •  
    •         logging.info(&quot;deleting message&quot;)         my_queue.delete_message(msg)         if len(msgs) < maxmsg :                 logging.info(&quot;getting more messages&quot;)                 msg = my_queue.read()         else:                 msg = False
  • Worker details (3/3)
    • Getting files from S3 and putting lines in an array:
    • msg_data[] lines = [] filename = '/tmp/tmppiwikpy.%s.txt' % os.getpid() for msg_data in msgs:         llog = &quot;trying file &quot;+msg_data         logging.info(llog)         if &quot;s3://MYBUCKET/&quot; in msg_data :                 s3obj = msg_data.replace(&quot;s3://MYBUCKET/&quot;,&quot;&quot;)                 llog = &quot;downloading &quot;+msg_data                 logging.info(llog)                 key = conns3.get_bucket('MYBUCKET').get_key(s3obj)                 key.get_contents_to_filename(filename)                 llog = &quot;decompressing file&quot;+msg_data                 logging.info(llog)                 fgz = gzip.open(filename, 'r');                 line = fgz.readline()                 while line:                         lines.append(line)                         line = fgz.readline()  
    • llog = &quot;closing and deleting temporary file&quot; logging.info(llog) fgz.close() os.remove(filename)
  • Piwik REST API
    • Check it here:
    • http://piwik.org/docs/tracking-api/#toc-tracking-api-rest-documentation
    • Your worker script has to create an url like this:
    •  
    • http://127.0.0.1/piwik.php?action_name=NAME&idsite=XX&rand=RANDOMNUMBER&rec=1&url=URL&cip=USER_IP&token_auth=YOUR_PIWIK_ADMIN_TOKEN&_id=COOKIE_FROM_NGINX&cdt=DATE_OF_VISIT
    •  
    • The cookie from Nginx is the first 16 characters from its md5 sum, as the same as Piwik does internally
    •  
    • Date of visit must be in UTC
  • Others / Next steps
    •   - If you deploy it not in Amazon, it makes sense to send the log lines to a queue
    •     and have lots of small workers reading and replicating them into Piwik. It is
    •     easier to handle, skip or reprocess a failed line than an entire log file
    •   - We still have window to improve, we are not using SSDs cards, we didn't do
    •     any partition or sharding
    •   - Visit logs and Action logs will need to be changed in order to make the database 
    •     cheaper and more scalable.
    •   - Our next step will be try to improve the archives, probably our next bottleneck.
  • What is missing on Piwik
    •   - split read and write connections to mysql database, so we can have 
    •     the benefits of mysql replications, like run a dedicated slave to the
    •     archive.php selects and a dedicated slave to non-admin users.
    •   - create a database per website. It is easier to maintain and reduces
    •     the mysql indexes sizes to fit them in memory. You can partition the 
    •     tables by idsite, it helps.
    •   - people read this presentation and send feedbacks :)
    •     please use Piwik's forums for this
    •   - feature request: optionally have a Mysql connection per website or be
    •     able to configure Piwik's interface to import data from other Piwik 
    •     installations, having all websites on a single place. Doing that we can 
    •     have smaller databases for each website. (Zabbix have simliar feature)
    • - feature request: have mysql optional connections profiles, so we can
    •   set smaller buffers for smaller tasks, improving the memory usage
    • Thanks Piwik !
    • Now we have modern analytics for old problems
    • and a modern scaling setup for a traditional LAMP stack
    • Thanks Zed (aka Carlo) for all programming support,
    • Gaiser for all Amazon tips, Matt for all Piwik tips.
    • And thanks to R7 Managers Denis and Vechiato to believe and provide time and
    • resources to make it happen and R7 Director Brandi to review this and allow us to share.