Log everything!

Log everything!
Dr. Stefan Schadwinkel und Mike Lohmann

1

Who we are.

Dr. Stefan Schadwinkel Mike Lohmann
Analytics Architektur
Author (heise.de, Cereb.Cortex, EJN, J.Neurophysiol.) Author (PHPMagazin, IX, heise.de)

Log everything 2

2

Agenda.

§  What we do. What we need to do. What we are doing.

§  Requirement: Log everything!

§  Infrastructure and technologies.

§  We want happy business users.

Log everything 3

3

Icans GmbH

Log everything 4

4

Numberfacts of PokerStrategy.com

7.600.000
Requests/Day
PokerStrategy.com
Education since 2005

6.000.000 19 Languages
Registered Users

2.800.000 700.000
PI/Day Posts/Day

Log everything 5

5

Topics of this talk

- How to use existing technologies and standards. - Out of the box solution

- Scalability and simplicity of the solution - Ready to use scripts

- „Good enough“ for now!

- Showing way from requirement to solution.

- OpenSource Sf2 bundles for logging.

- Livedemo.

Log everything 6

6

What we do.

§  We teach Poker.

§  We create webapplications.

§  We serve millions of users in different countries respecting

a multitude of market rules.

§  We make business decisions driven by complex

data analytics.

Log everything 7

7

What we need to do.

§  We need to try out other teaching topics, fast.

§  We need to gather data from all of these „try outs“ to accumulate them

and build business decisions on their analysis.

§  We need a bigger infrastructure to gather more data.

§  We need to hire more (good) people! J

Log everything 8

8

What we are doing.

§  We build ECF (Education Community Framework).

§  We (can) log everything!

§  We (now) use Amazon S3 and Amazon EMR to have a scaling

storage and map reduce solution.

§  We hire (good) people! J

Log everything 9

9

Requirement: Log everything.

§  „Are you mad?!“

§  „Be more specific, please!“

§  „But what about the user‘s data?!“

Log everything 10

10

Logging Tools / Technologies

Producer Transport Storage Analytics

Symfony2 Now: Now: MapReduce
RabbitMQ S3 Storage Hive
Application Erlang Consumer Hadoop via
Server and Amazon BI via QlikView
Was: EMR
Databases
Flume
Was:
Virtualized Inhouse
Hadoop

15.10.12 11

11

Logging Infrastructure

Producer Transport Storage Analytics

Databases Hadoop
- Cluster

QlikView

App
Reverse
1-x
LB Proxy

S3 Graylog

Consumer
Zabbix

Rabbit MQ

15.10.12 12

12

Producer

/Home Page
Controller
PageHit-Event
PageHit
Event Shovel
Logger::log()
Listener

Monolog- Local
Logger RabbitMQ

Processor

Formatter LogMessage, JSON

Handler

15.10.12 13

13

Producer

§  LoggingComponent: Provides interfaces, filters and handlers

§  LoggingBundle: Glues all together with Symfony2

h=ps://github.com/ICANS/IcansLoggingComponent

h=ps://github.com/ICANS/IcansLoggingBundle

15.10.12 14

14

Transport – First Try

§  Hey, if we use Hadoop, why not use Flume?

-  Part of the Ecosystem
-  Central config
-  Extensible via Plugins
-  Flexible Flow Configuration
-  How? : Flume Nodes à Flume Sinks

15.10.12 15

15

Transport – First Try

§  But, .. wait!
-  Ecosystem? Just like Hadoop version numbers…
-  Admins say: Central config woes!
-  issues: multi-master, logical vs. physical nodes, Java heap
space, etc.
-  Will my plugin run with flume-ng?
-  Ever tried to keep your complex flow and switch reliability levels?

Read: Our admins still hate me …

15.10.12 16

16

Transport – Second Try

§  RabbitMQ vs. Flume Nodes
-  Each app server has ist own local RabbitMQ
-  The local RabbitMQ shovels ist data to a central RabbitMQ
cluster
-  Similar to the Flume Node concept
-  Decentralized config: Producers and consumers simply connect

15.10.12 17

17

Transport – Second Try

§  But, .. wait! We still need Sinks.
-  Custom crafted RabbitMQ consumers
-  We could write them in PHP, but ..

-  Erlang, teh awesome!
-  Battle-hardened OTP framework.
-  „Let it crash!“ .. and recover.
-  Hot code change. If you want.

Read: Runs forever.

15.10.12 18

18

Storage – First Try

§  Use out-of-the-box Hadoop (Cloudera)

§  But:
-  Virtualized Infrastructure
-  Unknown usage patterns
Hadoop
-  Must be cost effective
-  Major Hadoop version upgrades

15.10.12 19

19

Storage – Second Try

§  Use Amazon Webservices

§  Provides flexible virtualized infrastructure

§  Cost-effective storage: S3
Amazon S3
§  Hadoop on demand: EMR

15.10.12 20

20

Storage – Storage Amazon S3

§  Erlang RabbitMQ consumer simply copies the

incoming data to S3

- Easy: exchange „hadoop“ command with „s3cmd“
Amazon S3

15.10.12 21

21

Storage – Storage Amazon S3

§  S3 bucket receives many small, compressed log file chunks

§  Amazon provides s3DistCp which does distributed data copy:

-  Aggregate many small files into partitioned large chunks
Amazon S3
-  Change compression

15.10.12 22

22

Analytics

§  We want happy business users.

§  We want to answer questions.
-  People want answers to questions they have. Now.
-  No, they couldn‘t tell you that question yesterday. If they had
known, they would have already asked for the answer. Yesterday.

§  We also want data-driven applications.
-  Production system analysis.
-  Fraud prevention.
-  Recommendations.
-  Social metrics for our users.

15.10.12 23

23

Analytics

§  Remember MapReduce.

-  Custom Jobs.
-  Streaming: Use your favorite.
-  Java API: Cascading. Use your favorite: Java, Groovy, Clojure,
Scala.

-  Data Queries.
-  Hive: similar to SQL.
-  Pig: Data flow.
-  Cascalog: Datalog-like QL using Clojure and Cascading.

15.10.12 24

24

Analytics

§  Cascalog is Clojure, Clojure is Lisp

(?<- (stdout) [?person] (age ?person ?age) … (< ?age 30))

Query Cascading Columns of „Generator“ „Predicate“
Operator Output Tap the dataset
generated
by the query

§  as many as you want
§  both can be any clojure function
§  clojure can call anything that is
available within a JVM

15.10.12 25

25

Analytics

§  We use Cascalog to preprocess and organize that incoming flow of log messages:

15.10.12 26

26

Analytics

§  Let‘s run the Cascalog processing on Amazon EMR:

./elastic-mapreduce --create --name „Log Message Compaction"
--bootstrap-action s3://[BUCKET]/mapreduce/configure-daemons
--num-instances $NUM
--slave-instance-type m1.large
--master-instance-type m1.large
--jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar
--step-action TERMINATE_JOB_FLOW
--step-name "Cascalog"
--main-class icans.cascalogjobs.processing.compaction
--args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error

15.10.12 27

27

Analytics

§  After the Cascalog Query we have:

s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo

Hive
ParSSoning!

15.10.12 28

28

Analytics

§  Now
we
can
access
the
log
data
within
Hive:

15.10.12 29

29

Analytics

§  Now
we
can
run
Hive
queries
on
the
[WEBSITE]_icanslog_content
table!

§  But
we
also
want
to
store
the
result
to
S3.

15.10.12 30

30

Analytics

§  Now,
get
the
stats:

15.10.12 31

31

Analytics

§  We can now simply copy the data from S3 and import in any local analytical tool, like:

-  Excel (It must really make business people happy…)
-  QlikView (Anyone can be happy with it…)
-  R (If I want an answer…)

15.10.12 32

32

Merci.

?
Questions

15.10.12 33

33

Contacts.

Dr. Stefan Schadwinkel Mike Lohmann
stefan.schadwinkel@icans-gmbh.com mike.lohmann@icans-gmbh.com
ICANS_StScha mikelohmann

15.10.12 34

34

Tools/Technologies

15.10.12 35

35

ICANS GmbH
Valentinskamp 18
20354 Hamburg
Germany

Phone: +49 40 22 63 82 9-0
Fax: +49 40 38 67 15 92

Web: www.icans-gmbh.com

36

Log everything!

More Related Content

Similar to Log everything!

Log everything!