Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Krishna Gade
Data Engineering Manager
Discover Pinterest
Big Data and Apache Mesos
Connor Doyle
Mesosphere
Roger Wang
Pinterest
Bernardo Gomez Palacio
Guavus
Pinterest is a data product.
A/B Experimentation
Promoted Pins
Product Insights
Spam Control Related Pins
Home Feed
Search Quality
DATA
Numbers
• > 30 billion Pins
• 10 billion messages-a-day logged to Kafka
• 10 petabytes of data in S3
• Ingest 20 terabytes...
4x Data Growth
Data Architecture Overview
pins
repins, likes
impressions
Kafka
App
Storm
HadoopSinger
HBase
Redshift
Insights
Features
Roadmap
• Switch to Kafka 0.8 for all data streams
• Invest in scalable stream processing for realtime insights and produc...
Roger Wang
Software Engineer
Singer
A High-Performance Logging Infrastructure
Logging Infrastructure before Singer
Storm
kafka
agentkafka
agent
Host
Kafka
Consumer
S3
Kafka
copier
Kafka Cluster
Hadoop...
Logging Infrastructure with Singer
Logging infrastructure with Singer
Storm
kafka
agentkafka
agent
Host
singer
agent
Kafka...
Singer Logging Agent
•Simple logging mechanism for applications
• Decouple applications from log repository
• Existing app...
Singer Features
•At-least-once delivery
•Configurable adaptive log latency by periodical tailing
•Dynamically discover new...
Singer Architecture
LogStream
monitor
Configuration
watcher
Reader Writer
Log
repository
Reader Writer
Reader Writer
Reade...
Singer Concepts and Components
•LogStream/LogFile
•LogPosition
•LogStreamMonitor
•LogStreamProcessor
•LogStreamReader/LogF...
Log Stream Monitor
LogStream
monitor
Log Stream A-1 Processor Stats
Log Stream B-1 Processor Stats
Log Stream B-2
LogStrea...
Log Stream Processor
Reader
Writer
Commit position
Refresh
LogStream
EOS
next
batch
update stats
calculate next processing...
Adaptive Log Processing Interval
No message
next cycle =
min(MaxInterval, 2*current interval)
> 1 messages
next cycle = Mi...
Pluggable Log Stream Reader
LogFileReader LogMessage with LogPosition
LogMessage: {key: <binary>; timestamp: <timestamp>; ...
Log Message
Envelope thrift message passed between Reader and Writer:
key binary Uninterpreted binary used to co-locate
me...
Log Position
● Caching can give wrong byte offset
● Implement a generic buffered Java InputStream which tracks byte offset...
Log Rotation
log log.1 log.2 log.4log.3 log.6log.5 log.7
log log.1 log.2 log.4log.3 log.6log.5 log.7
1. Using inode to ide...
Duplicate inodes
log log.1 log.2 log.4log.3 log.6log.5 log.7
log log.1 log.2 log.4log.3 log.6log.5 log.7
10 12 1413 1615 1...
Log File Reader Caveats
Corrupted block Partial LogMessage
Log File Reader kept open between processing cycle to avoid fil...
Pluggable Log Stream Writer
•Writer interprets LogMessage
•Examples:
• Log archiver interpret the message as file path
• K...
Log Configuration
Puppet master
Watcher
Restart Singer on change
puppet
agent
Singer Deployment
•Debian package: part of base image?
•Dynamic configuration update through Puppet
•Resource footprint en...
Alternatives
•Scribe
•Logstash
•…
What’s next?
•Resilient file format so that we can skip corrupted blocks
•Pluggable log processing policy
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging Infrastructure
Upcoming SlideShare
Loading in …5
×

Singer, Pinterest's Logging Infrastructure

5,901 views

Published on

Krishna Gade and Roger Wang talk about Pinterest and Singer, our Logging Infrastructure.

Published in: Technology

Singer, Pinterest's Logging Infrastructure

  1. 1. Krishna Gade Data Engineering Manager Discover Pinterest Big Data and Apache Mesos
  2. 2. Connor Doyle Mesosphere Roger Wang Pinterest Bernardo Gomez Palacio Guavus
  3. 3. Pinterest is a data product.
  4. 4. A/B Experimentation Promoted Pins Product Insights Spam Control Related Pins Home Feed Search Quality DATA
  5. 5. Numbers • > 30 billion Pins • 10 billion messages-a-day logged to Kafka • 10 petabytes of data in S3 • Ingest 20 terabytes of new data each day • Petabyte-a-day processed in Hadoop • 6 Hadoop clusters of 3000 nodes in AWS • Over 100 regular users running over 2,000 jobs each day
  6. 6. 4x Data Growth
  7. 7. Data Architecture Overview pins repins, likes impressions Kafka App Storm HadoopSinger HBase Redshift Insights Features
  8. 8. Roadmap • Switch to Kafka 0.8 for all data streams • Invest in scalable stream processing for realtime insights and products • Migrate to a robust Hadoop 2.0 platform • Experiment with Spark esp., for machine learning • Unified batch and stream compute framework
  9. 9. Roger Wang Software Engineer Singer A High-Performance Logging Infrastructure
  10. 10. Logging Infrastructure before Singer Storm kafka agentkafka agent Host Kafka Consumer S3 Kafka copier Kafka Cluster Hadoop cluster
  11. 11. Logging Infrastructure with Singer Logging infrastructure with Singer Storm kafka agentkafka agent Host singer agent Kafka Consumer S3Secor Kafka Cluster Hadoop cluster
  12. 12. Singer Logging Agent •Simple logging mechanism for applications • Decouple applications from log repository • Existing applications that logs to disk •Isolate applications from Singer agent failure •Isolate applications from log repository failure • Avoid internal buffering and log loss •Better resource usage • Connection consolidation • Flexible batching
  13. 13. Singer Features •At-least-once delivery •Configurable adaptive log latency by periodical tailing •Dynamically discover new log streams •Dynamically pick up new log configuration •Pluggable log stream reader •Pluggable log stream writer •Rich set of stats via Ostrich
  14. 14. Singer Architecture LogStream monitor Configuration watcher Reader Writer Log repository Reader Writer Reader Writer Reader Writer Log configuration LogStream processors A - 1 A -2 B - 1 C - 1
  15. 15. Singer Concepts and Components •LogStream/LogFile •LogPosition •LogStreamMonitor •LogStreamProcessor •LogStreamReader/LogFileReader •LogStreamWriter
  16. 16. Log Stream Monitor LogStream monitor Log Stream A-1 Processor Stats Log Stream B-1 Processor Stats Log Stream B-2 LogStream Registrar empty log stream Processor Stats Periodic Task
  17. 17. Log Stream Processor Reader Writer Commit position Refresh LogStream EOS next batch update stats calculate next processing time schedule next processing cycle Abort on exception No Yes Load position and seek reader Abort on exception Process batch Abort on exception Processing a batch
  18. 18. Adaptive Log Processing Interval No message next cycle = min(MaxInterval, 2*current interval) > 1 messages next cycle = MinInterval [MinInterval, MaxInterval]
  19. 19. Pluggable Log Stream Reader LogFileReader LogMessage with LogPosition LogMessage: {key: <binary>; timestamp: <timestamp>; message: <binary>} LogPosition: inode + byte offset
  20. 20. Log Message Envelope thrift message passed between Reader and Writer: key binary Uninterpreted binary used to co-locate message. Examples are: session id so that all log entries in the session are on the same partition. No seder cost. timestamp nanosecs message binary Uninterpreted binary data. Examples are: Text log line, thrift message or file path. No serder cost.
  21. 21. Log Position ● Caching can give wrong byte offset ● Implement a generic buffered Java InputStream which tracks byte offsets ● Restrictions: Reader should not cache or read-ahead. LogFile inode next log file to read from byteOffset byte offset from head of file next byte to read from the file
  22. 22. Log Rotation log log.1 log.2 log.4log.3 log.6log.5 log.7 log log.1 log.2 log.4log.3 log.6log.5 log.7 1. Using inode to identify log file. 2. Check inode<->filename mapping when open file by name. 10 12 1413 1615 1711 12 1413 1615111018
  23. 23. Duplicate inodes log log.1 log.2 log.4log.3 log.6log.5 log.7 log log.1 log.2 log.4log.3 log.6log.5 log.7 10 12 1413 1615 1711 12 1413 1615111018 Skip the cycle to wait for log rotation.
  24. 24. Log File Reader Caveats Corrupted block Partial LogMessage Log File Reader kept open between processing cycle to avoid file opening cost
  25. 25. Pluggable Log Stream Writer •Writer interprets LogMessage •Examples: • Log archiver interpret the message as file path • Kafka writer create Kafka message without deserialize the content in the envelope.
  26. 26. Log Configuration Puppet master Watcher Restart Singer on change puppet agent
  27. 27. Singer Deployment •Debian package: part of base image? •Dynamic configuration update through Puppet •Resource footprint enformed •Rich stats exported through Ostrich to OpenTSD
  28. 28. Alternatives •Scribe •Logstash •…
  29. 29. What’s next? •Resilient file format so that we can skip corrupted blocks •Pluggable log processing policy

×