Log Everything!
@DC13
Stefan & Mike

Dr. Stefan Schadwinkel

Mike Lohmann

Co-Founder / Analytics Engineer

Co-Founder / Software Engineer

stef...
ABOUT DECK36
Who We Are
–  DECK36 is a young spin-off from ICANS
–  Small team of 7 engineers
–  Longstanding expertise in...
WHAT WE WILL TALK ABOUT
Topics
–  Log everything! – The Data Pipeline.
–  Tackling the Leviathan – Realtime Stream Process...
Log everything!
The Data Pipeline
THE DATA PIPELINE
Requirements
Background: Building and operating multiple education communities
Baseline: PokerStrategy.c...
THE DATA PIPELINE
Requirements
Producer

Transport

Storage

Analytics

Realtime Stream Processing
Producer
–  Monolog Plu...
THE DATA PIPELINE
Logging Pipeline
Producer

Transport

Storage

Analytics

Realtime Stream Processing
Analytics 
-  Hadoo...
THE DATA PIPELINE
Unified Message Format

-  Fixed, guaranteed envelope

-  Processing driven by message content 
-  Single...
Unified Message Form
THE DATA PIPELINE
Compaction
RabbitMQ consumer (Erlang) stores data to cloud 
-  Relatively large amount of files
-  Mixed ...
THE DATA PIPELINE
Compaction
Using Cascalog
-  Based on Clojure (LISP) and Cascading
-  Provides a Datalog-like query lang...
Cacalog Query Syntax

Cascalog is Clojure, Clojure is Lisp

(?<- (stdout)
Query
Operator

Cascading
Output Tap

[?person]
...
Cacalog Query Syntax

Run the Cascalog processing on Amazon EMR:
./elastic-mapreduce [standard parameters omitted]
--jar s...
The Data Pipeline
Data Queries with Hive
Hive is table-based and provides SQL-like syntax
-  Assumes one storage location ...
Hive @ Amazon (1)
Hive @ Amazon (2)

We can now simply copy the data from S3 
and import into any local analytical tool
e.g. Excel, Redshift...
Further Reading

-  More details in the Log Everything! ebook
-  Available at Amazon and DeveloperPress
THE DATA PIPELINE
Still: It’s Batch Processing
-  While quite efficient in flight, the logistics
of getting the job started ...
THE DATA PIPELINE

Instant Insight through Stream Processing
-  Often, only updates for the recent day,
week, or month are...
More Wind In The Sails
With Storm
REALTIME STREAM PROCESSING

Instant Insight through Stream Processing
-  Distributed realtime processing
framework
-  Batt...
Realtime Stream Processing Infrastructure with Storm

Producer

Transport

Analytics

Storage
Realtime Data Stream Analyti...
REALTIME STREAM PROCESSING
JS Client Features
-  Event system
-  Master/Slave Tabs
-  Local queuing of data
-  Ability to ...
Realtime Stream Processing - Loading the JS Client

<script .. src=“https://cdn.tradimo.com/js/starlog-client.min.js?5193e...
Realtime Stream Processing - JS Client in action

UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Command...
Realtime Stream Processing - JS Client in action
function ClickFetcher()
{
this.collectData = function (callback)
{
var cl...
Client Live Demo 


https://localhost:3001/test/1-page-stub.html
REALTIME STREAM PROCESSING
Producer Libraries
-  LoggingComponent: Provides interfaces, filters and handlers
-  LoggingBund...
Realtime Stream Processing - PHP & Storm

UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badg...
Storm & PHP Live Demo
REALTIME STREAM PROCESSING
Get Inspired!
Powered-by Storm: https://github.com/nathanmarz/storm/wiki/Powered-By
-  50+ comp...
Questions?
Thanks a lot!
You can find us:

github.com/DECK36

info@deck36.de

deck36.de
Log everything! @DC13
Upcoming SlideShare
Loading in …5
×

Log everything! @DC13

715 views
609 views

Published on

Big commercial websites breathe data: they create a lot of it very fast, but also need the feedback based on the very same data to become better and better.
In this talk we're showing our ideas, the drawbacks and the solutions, for building your own big data infrastructure.
We further explore the possibilities to access and harness the data using map/reduce and near real-time approaches in order to prepare you for the most challenging part of it all: gaining relevant knowledge you did not had before.

This talk was held at the Developer Conference 2013 (http://www.developer-conference.eu/session_post/log-everything/)

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
715
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Log everything! @DC13

  1. 1. Log Everything! @DC13
  2. 2. Stefan & Mike Dr. Stefan Schadwinkel Mike Lohmann Co-Founder / Analytics Engineer Co-Founder / Software Engineer stefan.schadwinkel@deck36.de mike.lohmann@deck36.de
  3. 3. ABOUT DECK36 Who We Are –  DECK36 is a young spin-off from ICANS –  Small team of 7 engineers –  Longstanding expertise in designing, implementing and operating complex web systems –  Developing own data intelligence-focused tools and web services –  Offering our expert knowledge in Automation & Operations, Architecture & Engineering, Analytics & Data Logistics
  4. 4. WHAT WE WILL TALK ABOUT Topics –  Log everything! – The Data Pipeline. –  Tackling the Leviathan – Realtime Stream Processing with Storm. –  JS Client DataCollector: Live Demo –  Storm Processing with PHP: Live Demo
  5. 5. Log everything! The Data Pipeline
  6. 6. THE DATA PIPELINE Requirements Background: Building and operating multiple education communities Baseline: PokerStrategy.com KPIs –  6M registered users, 700k posts/month, 2.8M page impressions/day, 7.6M requests/ day New products à New business models à New Questions –  Extendable generic solution –  Storage and accessability more important than specific, optimized applications
  7. 7. THE DATA PIPELINE Requirements Producer Transport Storage Analytics Realtime Stream Processing Producer –  Monolog Plugin, JS Client Transport –  Flume 0.9.4 m( à RabbitMQ, Erlang Consumer –  Evaluated Apache Kafka Storage –  Hadoop HDFS (our very own) à Amazon S3
  8. 8. THE DATA PIPELINE Logging Pipeline Producer Transport Storage Analytics Realtime Stream Processing Analytics -  Hadoop MapReduce à Amazon EMR, Python, R -  Exports to Excel (CSV), Qlikview à Amazon Redshift Realtime Stream Processing -  Twitter Storm
  9. 9. THE DATA PIPELINE Unified Message Format -  Fixed, guaranteed envelope -  Processing driven by message content -  Single message gets compressed (LZOP) to about 70% of original size " (1184 B à 817 B) -  Message bulk gets compressed to about 12-14% of original size " (@ 42k & 325k messages)
  10. 10. Unified Message Form
  11. 11. THE DATA PIPELINE Compaction RabbitMQ consumer (Erlang) stores data to cloud -  Relatively large amount of files -  Mixed messages We want -  A few files -  Messages grouped by „Event Type“ and „Time Partition“ -  Data transformation Determined by message content s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo Hive partitioning!
  12. 12. THE DATA PIPELINE Compaction Using Cascalog -  Based on Clojure (LISP) and Cascading -  Provides a Datalog-like query language -  Don‘t LISP? à JCascalog Very handy features (unavailable in Hive or Pig) -  Cascading Output Taps can be parameterized by data records -  Trap location for corrupted records (job finishes for all the correct messages) -  Runs within the JVM à large available codebase, arbitrary processing is simple
  13. 13. Cacalog Query Syntax Cascalog is Clojure, Clojure is Lisp (?<- (stdout) Query Operator Cascading Output Tap [?person] Columns of the dataset generated by the query (age ?person ?age) … (< ?age 30)) „Generator“ „Predicate“ -  as many as you want -  both can be any clojure function -  clojure can call anything that is available within a JVM
  14. 14. Cacalog Query Syntax Run the Cascalog processing on Amazon EMR: ./elastic-mapreduce [standard parameters omitted] --jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar --main-class icans.cascalogjobs.processing.compaction --args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error
  15. 15. The Data Pipeline Data Queries with Hive Hive is table-based and provides SQL-like syntax -  Assumes one storage location (directory) per table -  Simple to use if you know SQL -  Widely used, rapid development for „simple“ queries Hive @ Amazon -  Table locations can be S3 -  „Cluster on demand“ à requires to rebuild Hive metadata -  CREATE TABLE for source and target S3 locations -  Import Table metadata (auto-discovery for partitions) -  INSERT OVERWRITE to query source table(s) and store to target S3 location
  16. 16. Hive @ Amazon (1)
  17. 17. Hive @ Amazon (2) We can now simply copy the data from S3 and import into any local analytical tool e.g. Excel, Redshift, QlikView, R, etc.
  18. 18. Further Reading -  More details in the Log Everything! ebook -  Available at Amazon and DeveloperPress
  19. 19. THE DATA PIPELINE Still: It’s Batch Processing -  While quite efficient in flight, the logistics of getting the job started are significant. -  Only cost-efficient for long distance travel.
  20. 20. THE DATA PIPELINE Instant Insight through Stream Processing -  Often, only updates for the recent day, week, or month are necessary -  Time is of importance when direct feedback or user interaction is desired
  21. 21. More Wind In The Sails With Storm
  22. 22. REALTIME STREAM PROCESSING Instant Insight through Stream Processing -  Distributed realtime processing framework -  Battle-proven by Twitter -  All *BINGO-Abilities fulfilled! -  Hadoop = data batch processing; Storm = realtime data processing -  More (and maybe new) *BINGO: DRPC, ETL, RTET, Spouts, Bolts, Tuple, Topology -  Easy to use (Really!)
  23. 23. Realtime Stream Processing Infrastructure with Storm Producer Transport Analytics Storage Realtime Data Stream Analytics Storm-Cluster Supervisor NodeJS Supervisor S3 Worker Worker Worker Zabbix Graylog Apps &Server Queue Zookeeper Nimbus (Master) DB
  24. 24. REALTIME STREAM PROCESSING JS Client Features -  Event system -  Master/Slave Tabs -  Local queuing of data -  Ability to use node modules -  Easy to extend -  Complete development suite -  Deliver bundles with vendors or not
  25. 25. Realtime Stream Processing - Loading the JS Client <script .. src=“https://cdn.tradimo.com/js/starlog-client.min.js?5193e1ba0325c756b78d87384d2f80e9"></script> https://../starlog-client.min.js Create signed cookie starlog-client.min.js Set-Cookie:UUID /socket.io/1/websockets Upgrade: websockets Cookie: UUID Established connection Check cookie HTTP 101 – Protocol Change Connection: Upgrade Upgrade: websocket Collecting Data Sending data in UMF Sending data to the client UMF NodeJS Counts Queue Backend Magic Queue
  26. 26. Realtime Stream Processing - JS Client in action UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge ClickEvent collector register onclick Event Clicked-Data observe localstorage Clicked-Data Clicked-Data-UMF SocketConnect NodeJS
  27. 27. Realtime Stream Processing - JS Client in action function ClickFetcher() { this.collectData = function (callback) { var clicked = 1; logger.debug('ClickFetcher - collectData called!'); window.onclick = function() { var collectedData = { key : window.location.host.toString()+window.location.pathname.toString(), value: { payload: clicked, timestamp: +new Date() } }; localstorage.set(collectedData, function (storageResult) { logger.debug("err = " + storageResult.hasError()); logger.debug("storageResult = " + storageResult); }, false, true, true); clicked++; }; }; } var clickFetcher = new ClickFetcher(); starlogclient.on(starlogclient.COLLECTINGDATA, clickFetcher.collectData);
  28. 28. Client Live Demo https://localhost:3001/test/1-page-stub.html
  29. 29. REALTIME STREAM PROCESSING Producer Libraries -  LoggingComponent: Provides interfaces, filters and handlers -  LoggingBundle: Glues all together for Symfony2 -  Drupal Logging Module: Using the LoggingComponent -  JS Frontend Client: LogClient Framework for Browsers https://github.com/ICANS/IcansLoggingComponent https://github.com/ICANS/IcansLoggingBundle https://github.com/ICANS/drupal-logging-module https://github.com/DECK36/starlog-js-frontend-client
  30. 30. Realtime Stream Processing - PHP & Storm UseCase: If num of clicks on a Domain % 10 == 0, send „Star Trek Commander“ Badge Using PHP for that! https://github.com/Lazyshot/storm-php/blob/master/lib/storm.php Clicked-Data-UMF Queue Event: „Star Trek Commander“ Badge
  31. 31. Storm & PHP Live Demo
  32. 32. REALTIME STREAM PROCESSING Get Inspired! Powered-by Storm: https://github.com/nathanmarz/storm/wiki/Powered-By -  50+ companies (Twitter, Yahoo, Groupon, Ooyala, Baidu, Wayfair, …) -  Ads & real-time bidding, Data-centric (Economic, Environmental, Health), User interactions Language-agnostic backend systems (Operate Storm, Develop in PHP) Streaming „counts“: Sentiment Analysis, Frequent Items, Multi-armed Bandits, … DRPC: Custom user feeds, Complex Queries (i.e. trace graph links) Realtime, distributed ETL -  Buffering / Retries -  Integrate Data: Third-party API, Machine Learning -  Store to DBs, Search engines, etc
  33. 33. Questions?
  34. 34. Thanks a lot!
  35. 35. You can find us: github.com/DECK36 info@deck36.de deck36.de

×