[@IndeedEng] Logrepo: Enabling Data-Driven Decisions

4,608 views

Published on

Video available at: http://youtu.be/y0WC1cxLsfo

At Indeed our applications generate billions of log events each month across our seven data centers worldwide. These events store user and test data that form the foundation for decision making at Indeed. We built a distributed event logging system, called Logrepo, to record, aggregate, and access these logs. In this talk, we'll examine the architecture of Logrepo and how it evolved to scale.

Jeff Chien joined Indeed as a software engineer in 2008. He's worked on jobsearch frontend and backend, advertiser, company data, and apply teams and enjoys building scalable applications.

Jason Koppe is a Systems Administrator who has been with Indeed since late 2008. He's worked on infrastructure automation, monitoring, application resiliency, incident response and capacity planning.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,608
On SlideShare
0
From Embeds
0
Number of Embeds
1,543
Actions
Shares
0
Downloads
18
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

[@IndeedEng] Logrepo: Enabling Data-Driven Decisions

  1. 1. go.indeed.com/IndeedEngTalks
  2. 2. Logrepo Enabling Data-Driven Decisions
  3. 3. Jeff Chien Software Engineer Indeed Apply Team
  4. 4. Scale More job searches worldwide than any other employment website. ● ● ● ● ● Over 100 million unique users Over 3 billion searches per month Over 24 million jobs Over 50 countries Over 28 languages
  5. 5. I help people get jobs.
  6. 6. Job seeker flow using Indeed Apply 1. Search 2. View job 3. Click “Apply Now” 4. Submit application
  7. 7. Knowing how users interact with our system helps us make better products
  8. 8. Likelihood of applying to a job Have to upload a resume Have Indeed Resume
  9. 9. We Have Questions ● What percentage of applications use Indeed resumes? ● How many searches for “java” in “Austin”? ● How often are resumes edited? ● How long does it take to aggregate jobs?
  10. 10. Complicated Questions How many applications … to jobs from CareerBuilder … by job seekers who searched for “java” in “Austin” … used an Indeed resume? Is the percentage different on mobile compared to web? How much has this changed in 2011 compared to 2014?
  11. 11. More Information Better Decisions
  12. 12. More information Need to log events ● job searches ● clicks ● applies
  13. 13. What to log Client information - unique user identifier, user agent, ip address… User behavior - clicks, alert signups… Performance - backend request duration, memory usage... A/B test groups - control and test groups
  14. 14. Better decisions Use empirical data to make decisions Not based on assumptions nor the highest paid person’s opinion!
  15. 15. Objective Collect data on user actions and system performance from many different applications in multiple data centers
  16. 16. How we build systems Simple Fast Resilient Scalable
  17. 17. Simple Easy interface Reuse familiar technologies
  18. 18. Fast No impact to runtime performance Data available soon
  19. 19. Resilient Does not lose data in spite of system or network failures
  20. 20. Scalable Can handle large quantities of data
  21. 21. Requirements Powerful enough to express diverse data
  22. 22. Requirements Powerful enough to express diverse data Store all data forever
  23. 23. Requirements Powerful enough to express diverse data Store all data forever Events stored at least once
  24. 24. Requirements Powerful enough to express diverse data Store all data forever Events stored at least once Easy to add new data to logs
  25. 25. Requirements Powerful enough to express diverse data Store all data forever Events stored at least once Easy to add new data to logs Easy to access logs in bulk
  26. 26. Requirements Powerful enough to express diverse data Store all data forever Events stored at least once Easy to add new data to logs Easy to access logs in bulk Time range based access
  27. 27. Non-Goals Random access to individual events Real time access to events Complex data types
  28. 28. Logrepo A distributed event logging system Est. 2006
  29. 29. Logrepo stores log entries Everything is a string Key/value pairs URL-encoded
  30. 30. Organic click log entry uid=18dtbolr20nk23qh&type=orgClk&v=0&tk=18dtbnn3p0n k20g9&jobId=500&onclick=1&avgCmpRtg=2.9&url=http% 3A%2F%2Fwww.indeed.com%2Frc%2Fclk&href=http% 3A%2F%2Fwww.indeed.com%2Fjobs%3Fq%3D%26l% 3DNewburgh%252C%2BNY%26start% 3D20&agent=Mozilla%2F5.0+%28Windows+NT+6.1% 3B+WOW64%3B+rv%3A26.0%29+Gecko% 2F20100101+Firefox%2F26.0&raddr=173.50.255.255 &ckcnt=17&cksz=1033&ctk=18dtbc6960nk20vd&ctkRcv=1 &&
  31. 31. URL-decoded organic click log entry uid=18dtbolr20nk23qh& type=orgClk& v=0& tk=18dtbnn3p0nk20g9& jobId=500& onclick=1& avgCmpRtg=2.9& url=http://www.indeed.com/rc/clk& href=http://www.indeed.com/jobs?q=&l=Newburgh% 2C+NYstart=20&agent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0& ...
  32. 32. URL-decoded organic click log entry uid=18dtbolr20nk23qh& type=orgClk& v=0& tk=18dtbnn3p0nk20g9& jobId=500& onclick=1& avgCmpRtg=2.9& url=http://www.indeed.com/rc/clk& href=http://www.indeed.com/jobs?q=&l=Newburgh% 2C+NYstart=20&agent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0& ...
  33. 33. Advantages Human-readable
  34. 34. Advantages Human-readable Arbitrary keys
  35. 35. Advantages Human-readable Arbitrary keys Low overhead to add new key/value pairs
  36. 36. Advantages Human-readable Arbitrary keys Low overhead to add new key/value pairs Self-describing
  37. 37. Advantages Human-readable Arbitrary keys Low overhead to add new key/value pairs Self-describing Easy to parse in any language
  38. 38. Required log entry keys Every log entry has uid and type Type is an arbitrary string uid=18dtbolr20nk23qh&type=orgClk&...
  39. 39. UID format uid=18ducm8u50nk23qh&type=jobsearch&... UID is always the first key Unique 16 characters Base 32 [0-9a-v]
  40. 40. UID breakdown uid=18ducm8u50nk23qh Date = 2014-01-10 Time = 09:35:24.357 Server id = 1512 App instance id = 2 UID Version = 0 Random value = 3921
  41. 41. UID generation Unique IDs are unique Random value avoids UID collisions Random value is between 0 and 8191 Up to 8000 events per application instance per millisecond
  42. 42. UID format benefits Contains useful metadata Compact format reduces memory requirements Easy to compare or sort events by time
  43. 43. Job seeker events 1. Search for jobs 2. Click on job 3. Apply to job All events are part of the same flow
  44. 44. Parent-child relationships between events Events can reference other events with &tk=18ducm8u50nk23qh... Children know their parents Parents don’t know their children Extremely powerful model
  45. 45. Parent-child relationships between events An organic click points to the search it occurred on uid=18dtbnn3p0nk20g9&type=jobsearch&v=0&... uid=18dtbolr20nk23qh&type=orgClk&v=0 &tk=18dtbnn3p0nk20g9&...
  46. 46. More jobsearch child events Sponsored job clicks Javascript errors Job alert signups And many more...
  47. 47. Job seeker views a job job view 18en3o3ov16r25rp load IndeedApply user submission post to employer uid=18en3o3ov16r25rp&type=viewjob&...
  48. 48. Indeed Apply loads job view 18en3o3ov16r25rp load IndeedApply 18en3o3s216ph6d5 user submission post to employer uid=18en3o3s216ph6d5&type=loadJs &vjtk=18en3o3ov16r25rp&...
  49. 49. Prepare job application job view 18en3o3ov16r25rp load IndeedApply 18en3o3s216ph6d5 user submission 18en3qe0u16pi5ct post to employer uid=18en3qe0u16pi5ct&type=appSubmit &loadJsTk=18en3o3s216ph6d5&...
  50. 50. Submit job application job view 18en3o3ov16r25rp load IndeedApply 18en3o3s216ph6d5 uid=18en3qe2r0nji3h6&type=postApp &appSubmitTk=18en3qe0u16pi5ct&... POST /apply HTTPS/1.1 Host: employer.com { user submission 18en3qe0u16pi5ct post to employer 18en3qe2r0nji3h6 "applicant": { "name": "John Doe", "email": "jobseeker@gmail.com", "phone": "555-555-5555", }, "jobTitle": "Software Engineer" ...
  51. 51. Javascript latency ping At start of page load, browser executes js to ping Indeed Server receives the ping and logs an event
  52. 52. Parent job search and child js latency ping uid=18dqpc3lm16pi2an&type=jobsearch&... uid=18dqpc3s516pi566&type=lat&tk=18dqpc 3lm16pi2an
  53. 53. Subtracting UID timestamps yields duration uid=18dqpc3s516pi566&type=lat&tk=18dqpc3lm16pi2an uid timestamp Jan 9, 2014 00:00:05.253 tk timestamp Jan 9, 2014 00:00:05.046 Latency = 1389247205253 - 1389247205046 = 207 ms Approximates perceived latency to jobseeker
  54. 54. West coast perceived latency in California vs. Washington
  55. 55. Writing log entries from apps LogEntry entry = factory.createLogEntry("search"); entry.setProperty("q", query); entry.setProperty("acctId", accountId); entry.setProperty("time", elapsedMillis); // ... entry.commit();
  56. 56. Creating a log entry LogEntry entry = factory.createLogEntry("search"); Creates a log entry with UID and type set UID timestamp tied to createLogEntry() call
  57. 57. Populating a log entry entry.setProperty("q", query); entry.setProperty("acctId", accountId); entry.setProperty("time", elapsedMillis); // ...
  58. 58. Lists Separate values with commas String groups = "foo,bar,baz"; logEntry.setProperty("grps", groups); // uid=...&grps=foo%2Cbar%2Cbaz&...
  59. 59. Lists of Tuples Encapsulate each tuple in parenthesis Comma-separate elements within tuple // Two jobs with (job id, score) String jobs = "(123,1.0)(400,0.8)"; logEntry.setProperty("jobs", jobs); // uid=...&jobs=%28123%2C1.0%29%28400%2C0.8%29&...
  60. 60. Committing a log entry After log entry is fully populated... entry.commit();
  61. 61. Jason Koppe System Administrator
  62. 62. I engineer systems that help people get jobs.
  63. 63. Before logrepo
  64. 64. Before logrepo
  65. 65. log4j - Java logging framework ● Code - what ● Configuration - define what goes to where ● Appender - where (file, smtp) http://logging.apache.org/log4j/1.2/
  66. 66. Before logrepo
  67. 67. Reusing log4j for logrepo
  68. 68. Redundancy from the start Write to local disk (FileAppender) Write to remote server #1 (? Appender) Write to remote server #2 (? Appender)
  69. 69. Writing to a remote server syslog Protocol for transporting messages across an IP network Est. 1980s http://tools.ietf.org/html/rfc5424
  70. 70. Using log4j with syslog Out-of-the-box, log4j only supported UDP syslog UDP could result in data loss
  71. 71. Avoiding data loss TCP guarantees data transfer Use TCP!
  72. 72. Creating a reliable Appender SyslogTcpAppender ● created by Indeed ● TCP-enabled log4j syslog Appender ● buffers messages before transport Resilient for short network and syslog server downtimes
  73. 73. Choosing a syslog daemon syslog-ng syslog daemon which supports TCP Est. 1998 http://www.balabit.com/network-security/syslog-ng
  74. 74. Redundancy with log4j Write to local disk (FileAppender) Write to remote server #1 (SyslogTcpAppender) Write to remote server #2 (SyslogTcpAppender)
  75. 75. Redundancy over TCP
  76. 76. Each syslog-ng server receives unsorted log entries immediately flushes entries to files on disk called raw logs
  77. 77. Quick redundancy over TCP
  78. 78. Optimized for redundancy raw logs are probably out-of-order each app writes to syslog independently
  79. 79. Optimize for read access patterns LogRepositoryBuilder (“Builder”) ● sort ● deduplicate ● compress
  80. 80. Builder architecture
  81. 81. Builder architecture
  82. 82. Builder architecture
  83. 83. Builder architecture
  84. 84. Builder creates segment files uid=15mt000000k1&type=orgClk&v=1&k=4... uid=15mt000010k7&type=orgClk&v=1&k=3... uid=15mt000020k8&type=orgClk&v=1&k=2... uid=15mt000030ss&type=orgClk&v=1&k=9...
  85. 85. Repeated strings compress well uid=15mt000000k1&type=orgClk&v=1&k=4... uid=15mt000010k7&type=orgClk&v=1&k=3... uid=15mt000020k8&type=orgClk&v=1&k=2... uid=15mt000030ss&type=orgClk&v=1&k=9... compresses by 85%
  86. 86. Archive directory structure /orgClk/15mt/0.log4181.seg.gz logentry type
  87. 87. Archive directory structure /orgClk/15mt/0.log4181.seg.gz 4-char UID prefix, base 32
  88. 88. Archive directory structure /orgClk/15mt/0.log4181.seg.gz 4-char UID prefix, base 32 ~9.3 hour time period
  89. 89. Archive directory structure /orgClk/15mt/0.log4181.seg.gz 5-char UID prefix, base 32
  90. 90. Archive directory structure /orgClk/15mt/0.log4181.seg.gz 5-char UID prefix, base 32 ~17 minute time period
  91. 91. Archive directory structure /orgClk/15mt/0.log4181.seg.gz unique number
  92. 92. Archive directory structure /orgClk/15mt/0.log4181.seg.gz unique number Supports more than 1 segment file per type per 5-char UID prefix
  93. 93. Multiple segment files Keep Builder memory usage fixed When Builder memory fills, it flushes to disk Each flush creates files for 5-char UID prefix
  94. 94. Multiple segment files Keep Builder memory usage fixed When Builder memory fills, it flushes to disk Each flush creates files for 5-char UID prefix
  95. 95. Multiple segment files Keep Builder memory usage fixed When Builder memory fills, it flushes to disk Each flush creates files for 5-char UID prefix
  96. 96. Builder creates the archive
  97. 97. Redundancy
  98. 98. Redundancy
  99. 99. Ensure archive consistency ● ● Delayed Builder on second server Add new segment files for log entries missed by first Builder ● Causes multiple segment files for a 5-char UID prefix
  100. 100. Providing access to logrepo LogRepositoryReader (“Reader”) ● simple request protocol ● reads from (multiple) segment files ● provides sorted stream of entries to TCP client as quickly as possible
  101. 101. Reader request protocol 1. Start time 2. End time 3. Logrepo type
  102. 102. Reader request using netcat start time (ms since 1970-01-01, the start of Unix time) $ echo 1295905740000 1295913600000 orgClk
  103. 103. Reader request using netcat end time (ms since 1970-01-01) $ echo 1295905740000 1295913600000 orgClk
  104. 104. Reader request using netcat logrepo type $ echo 1295905740000 1295913600000 orgClk
  105. 105. Reader request using netcat send echo across a TCP session $ echo 1295905740000 1295913600000 orgClk | nc 192.168.0.1 9999
  106. 106. Reader request using netcat UID-sorted results $ echo 1295905740000 1295913600000 orgClk | nc 192.168.0.1 9999 uid=15mt00l710k3262q&type=orgClk&v=0&... uid=15mt00l780k137d9&type=orgClk&v=0&... ... uid=15mt7ggvj142h06k&type=orgClk&v=0&...
  107. 107. Reading entries from archive 1295905740000 1295913600000 orgClk 1. Isolate to the type directory
  108. 108. Reading entries from archive 1295905740000 1295913600000 orgClk 2. Convert request timestamps to UID prefix uidPrefixFromTime(1295905740000) = 15mt0 uidPrefixFromTime(1295913600000) = 15mt7
  109. 109. Reading entries from archive 1295905740000 1295913600000 orgClk 15mt0 3. Find segments matching first UID prefix ls orgClk/15mt/0* orgClk/15mt/0.log3094.seg.gz orgClk/15mt/0.log4181.seg.gz
  110. 110. Reading entries from archive 1295905740000 1295913600000 orgClk 4. Read sorted segments simultaneously, merge into a single sorted stream /orgClk/15mt/0.log3094.seg.gz: uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&... /orgClk/15mt/0.log4181.seg.gz: uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...
  111. 111. Reading entries from archive 1295905740000 1295913600000 orgClk 4. Read sorted segments simultaneously, merge into a single sorted stream /orgClk/15mt/0.log3094.seg.gz: 1 uid=15mt000080g1i0j5&type=orgClk&... 3 uid=15mt00l780k137d9&type=orgClk&... /orgClk/15mt/0.log4181.seg.gz: 2 uid=15mt00l710k3262q&type=orgClk&... 4 uid=15mt00l790k1i2rs&type=orgClk&...
  112. 112. Reading entries from archive 1295905740000 1295913600000 orgClk 4. Read sorted segments simultaneously, merge into a single sorted stream 1 uid=15mt000080g1i0j5&type=orgClk&... 2 uid=15mt00l710k3262q&type=orgClk&... 3 uid=15mt00l780k137d9&type=orgClk&... 4 uid=15mt00l790k1i2rs&type=orgClk&...
  113. 113. Reading entries from archive 1295905740000 1295913600000 orgClk 5. Only return log entries between timestamps 1 uid=15mt000080g1i0j5&type=orgClk&... 2 uid=15mt00l710k3262q&type=orgClk&... 3 uid=15mt00l780k137d9&type=orgClk&... 4 uid=15mt00l790k1i2rs&type=orgClk&...
  114. 114. Reading entries from archive 1295905740000 1295913600000 orgClk 15mt0 15mt7 15mt1 15mt2 15mt3 15mt4 15mt5 15mt6 6. Read segments for each UID prefix, one prefix at a time
  115. 115. Reading entries from archive 1295905740000 1295913600000 orgClk 7. Stop reading files when entry crosses request boundary
  116. 116. The first years (2007 & 2008) ● Single datacenter ● App servers ● 2 logrepo servers ● syslog-ng ● Builder ● Reader
  117. 117. Growth job seekers
  118. 118. Growth products job seekers
  119. 119. Growth products datacenters job seekers
  120. 120. Growth log entries
  121. 121. Multi-datacenter rationale Latency Redundancy
  122. 122. Multi-datacenter rationale Job seekers
  123. 123. Logrepo in multiple datacenters ● Single datacenter ● Consumers ● Reader ● Every datacenter ● Applications producing logentries ● 2 syslog servers ● Builders (minimize Internet traffic)
  124. 124. Single datacenter archival /dc1/orgClk/15mt/0.log4181.seg.gz random number 25-bit timestamp prefix, base 32 ~17-minute time period event type (orgClick means organic search result click)
  125. 125. Multiple datacenter archival /dc1/orgClk/15mt/0.log4181.seg.gz random number 25-bit timestamp prefix, base 32 ~17-minute time period event type (orgClick means organic search result click) datacenter
  126. 126. Datacenter dirs avoid collisions ~$ ls */orgClk/15mt/0* dc1/orgClk/15mt/0.log1481.seg.gz dc3/orgClk/15mt/0.log1481.seg.gz Different datacenters
  127. 127. Datacenter dirs avoid collisions ~$ ls */orgClk/15mt/0* dc1/orgClk/15mt/0.log1481.seg.gz dc3/orgClk/15mt/0.log1481.seg.gz Same segment filename Independent Builders
  128. 128. UID breakdown uid=18ducm8u50nk23qh Date = 2014-01-10 Time = 09:35:24.357 Server id = 1512 App instance id = 2 UID Version = 0 Random value = 3921
  129. 129. UID breakdown uid=18ducm8u50nk23qh Date = 2014-01-10 Time = 09:35:24.357 Server id = 1512 App instance id = 2 UID Version = 0 Random value = 3921
  130. 130. Using server ID for uniqueness Each datacenter gets 256 server IDs 1. 2. 3. 4. DC #1 uses 0 - 255 DC #2 uses 256 - 511 DC #3 uses 512 - 767 ...
  131. 131. The next years (2009 - 2011) ● Multiple datacenters ● 2 logrepo servers ● syslog-ng ● Builder ● Consumer datacenter ● Reader ● Consumers
  132. 132. More logentries More consumers
  133. 133. Diverse requests
  134. 134. Single server disk bottleneck
  135. 135. Scaling logrepo reads Bottleneck: single active Reader server Goal: spread logrepo accesses across a cluster of servers
  136. 136. Read logrepo from HDFS Hadoop Distributed File System (HDFS) “a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.” http://hadoop.apache.org/docs/stable1/hdfs_design.html
  137. 137. Using HDFS for logrepo access
  138. 138. Using HDFS for logrepo access
  139. 139. Using HDFS for logrepo access
  140. 140. Resilient logrepo in HDFS Store each logentry on 3 servers
  141. 141. Push to HDFS quickly Mirror every segment file into HDFS
  142. 142. Push to HDFS quickly /dc1/orgClk/15mt/0.log4181.seg.gz 5-char UID prefix, base 32 ~17-minute time period 500,000+ files per day
  143. 143. HDFS optimized for fewer files Reduce the number of logrepo files in HDFS keeps us efficient
  144. 144. HDFS optimized for fewer files Reduce the number of logrepo files in HDFS keeps us efficient HDFSArchiver
  145. 145. Archive yesterday in HDFS /dc1/orgClk/15mt/0.log4181.seg.gz type 20-bit timestamp prefix ~9.3 hour period 2,500 files per day
  146. 146. Scaling logrepo in HDFS 500,000+ files per day 2,500 files per day
  147. 147. Logrepo A distributed event logging system Created @IndeedEng ● Application Open source ● log4j
  148. 148. Logrepo A distributed event logging system Created @IndeedEng ● Application ● SyslogTcpAppender Open source ● log4j
  149. 149. Logrepo A distributed event logging system Created @IndeedEng ● Application ● SyslogTcpAppender Open source ● log4j ● syslog-ng
  150. 150. Logrepo A distributed event logging system Created @IndeedEng ● Application ● SyslogTcpAppender ● Builder Open source ● log4j ● syslog-ng
  151. 151. Logrepo A distributed event logging system Created @IndeedEng ● Application ● SyslogTcpAppender ● Builder Open source ● log4j ● syslog-ng ● gzip
  152. 152. Logrepo A distributed event logging system Created @IndeedEng ● Application ● SyslogTcpAppender ● Builder ● Reader Open source ● log4j ● syslog-ng ● gzip
  153. 153. Logrepo A distributed event logging system Created @IndeedEng ● Application ● SyslogTcpAppender ● Builder ● Reader Open source ● log4j ● syslog-ng ● gzip ● rsync+ssh
  154. 154. Logrepo A distributed event logging system Created @IndeedEng ● Application ● SyslogTcpAppender ● Builder ● Reader Open source ● log4j ● syslog-ng ● gzip ● rsync+ssh ● Hadoop
  155. 155. Logrepo A distributed event logging system Created @IndeedEng ● Application ● SyslogTcpAppender ● Builder ● Reader ● HDFSPusher Open source ● log4j ● syslog-ng ● gzip ● rsync+ssh ● Hadoop
  156. 156. Logrepo A distributed event logging system Created @IndeedEng ● Application ● SyslogTcpAppender ● Builder ● Reader ● HDFSPusher ● HDFSReader Open source ● log4j ● syslog-ng ● gzip ● rsync+ssh ● Hadoop
  157. 157. Logrepo A distributed event logging system Created @IndeedEng ● Application ● SyslogTcpAppender ● Builder ● Reader ● HDFSPusher ● HDFSReader ● HDFSArchiver Open source ● log4j ● syslog-ng ● gzip ● rsync+ssh ● Hadoop
  158. 158. All time logrepo = 150 TB compressed
  159. 159. jobsearch event set abredistime acmetime addltime adsc adsdelay adsi badsc badsi boostojc boostoji bsjc bsjcwia bsji bsjindapplies bsjindappviews bsjrev bsjwia ckcnt cksz counts ctkage ctkagedays dayofweek dcpingtime domTotalTime ds-mpo dsmiss dstime featemp fj freekwac freekwarev freesjc freesjrev frmtime galatdelay iplat iplong jslatdelay jsvdelay kwac kwacdelay kwai kwarev kwcnt lacinsize lacsgsize lmstime mpotime mprtime navTotTime ndxtime ojc ojclong ojcshort ojcwia oji ojindapplies ojindappviews ojwia oocsc page prcvdlatency primfollowcnt prvwoji prvwojlat prvwojopentime prvwojreq radsc radsi recidlookupbudget rectime redirCount redirTime relfollowcnt respTime returnvisit rojc roji rqcnt rqlcnt rqqcnt rrsjc rrsji rrsjrev rsavail rsjc rsji rsused rsviable serpsize sjc sjcdelay sjclong sjcnt sjcshort sjcwia sji sjindapplies sjindappviews sjrev sjwia sllat sllong sqc sqi sugtime svj svjnostar svjstar tadsc tadsi time timeofday totcnt totfollowcnt totrev tottime tsjc tsjcwia tsji tsjindapplies tsjindappviews tsjrev tsjwia unqcnt vp wacinsize wacsgsize
  160. 160. acmepage acmereviewmod acmeservice acmesession adclick adcrequest adcrev adschannel adsclick adsenseclick adve advt agghttp aggjira aggjob aggjob_waldorf aggsherlock aggsourcehealth agstiming api apijsv apisearch archiveindex archiveindex_shingled_test bin carclicks click clickanalytics cobrand dctmismatch draw dupepairs dupepairs_mini dupepairs_old dupepairsall dupepairsall_mini ejchecker emilyops feedbridge globalnav googlebot_organic homepage impression indeedapply jhst jobalert jobalertorganic jobalertsearch jobalertsponsored jobexpiration jobexpiration2 jobexpiration3 jobprocessed jobqueueblock jobsearch jssquery keywordAd locsvc lucyindexermain mechanicalturk mindyops mobhomepage mobil mobile mobileorganic mobilesponsored mobrecjobs mobsearch mobviewjob myindeed myindfunnel myindpage myindrezcreate myindsession old opsesjasx organic orgmodel orgmodelsubset orgmodelsubset90 passportaccount passportpage passportsignin ramsaccess recjobs recommendservice resumedata resumesearch rexcontacts rexfunnel reximpression rexsearch rezSrchSearch rezalert rezalertfunnel rezfunnel rezjserr rezsrchrequest rezview searchablejobs seo session sjmodel sponsored sysadappinfo sysadapptiming testndx testndx1 testndx2 tmp usrsvccache usrsvcrequest viewjob webusersignin
  161. 161. Every day at Indeed ● Create 5 billion log entries ● App spends 0.03 ms to create each log entry ● Add 500 GB to the archive ● Add 1.5 TB to HDFS ● Consumers read from HDFS at 18.5 GB/s ● 100s of consumers request 1000 different logrepo types
  162. 162. Four types of consumers Ad-hoc command line Standard Java programs Hadoop map/reduce Real-time monitoring
  163. 163. Command line access $ echo 1388556000000 1388642400000 jobsearch | nc logrepo 9999 uid=18d6666o916r15g3&type=jobsearch&q=VP+IT uid=18d6666ob0mp27aa&type=jobsearch&q=Lab+Tech uid=18d6666ob0nl15ce&type=jobsearch&q=daycare uid=18d6666og0nk24rb&type=jobsearch&q=Chef+Upscale ...
  164. 164. Slowest searches from log entries Reuses standard unix tools and patterns $ echo 1388556000000 1388642400000 jobsearch | nc logrepo 9999 | egrep -o '&searchTime=[^&]+' | egrep -o '[0-9]+' | sort -r -n | head
  165. 165. Programmatic access is trivial We have clients for ● java ● python ● php ● pig
  166. 166. A typical logrepo consumer (single machine) Reads one primary log event type Reads a dozen child events per primary Total size of each event set = 10KB
  167. 167. A typical logrepo consumer (single machine) Millions of events read per run Thousands of consumers run each day Tens of terabytes processed each day
  168. 168. Efficient Parsing Important for single machine consumers Log entry parsing too slow Fast Minimize memory usage
  169. 169. URL String Parsing (now available on github) 4x faster than String.split(...), generates 50% less garbage Parses 1 million log entries of size 0.5K each in 3 seconds https://github.com/indeedeng http://go.indeed.com/urlparsing
  170. 170. Hadoop clients Reliable, scalable, distributed computing
  171. 171. Hadoop clients Reliable, scalable, distributed computing Most new consumers use Hadoop
  172. 172. Hadoop clients Reliable, scalable, distributed computing Most new consumers use Hadoop Read log entries directly from HDFS
  173. 173. Hadoop clients Reliable, scalable, distributed computing Most new consumers use Hadoop Read log entries directly from HDFS Divide and conquer to scale
  174. 174. Monitoring Want to monitor ● Business metrics ● Operational metrics “Available soon” isn’t good enough
  175. 175. Datadog Third party monitoring service Stream metrics to Datadog HQ Real-time dashboards
  176. 176. Datadog
  177. 177. miniEPL 'jobsearch.organic_clk': "SELECT COUNT(*), 'clicks' AS unit FROM orgClk", 'jobsearch.totTime': "SELECT int(totTime), 'ms' AS unit FROM jobsearch(totTime IS NOT NULL)", 'mobile.mobsearch.oji': "SELECT tupleCount (orgRes), 'results' AS unit FROM mobsearch",
  178. 178. Getting logs into Datadog
  179. 179. Data redundancy Replaying events Click charging
  180. 180. Replaying events 1. Job alert email sign up broke for logged in users
  181. 181. Replaying events 1. Job alert email sign up broke for logged in users 2. Got alert parameters + jobsearch uid from access logs
  182. 182. Replaying events 1. Job alert email sign up broke for logged in users 2. Got alert parameters + jobsearch uid from access logs 3. Got account id from jobsearch log entries
  183. 183. Replaying events 1. Job alert email sign up broke for logged in users 2. Got alert parameters + jobsearch uid from access logs 3. Got account id from jobsearch log entries 4. Recreated job alert sign ups
  184. 184. Click charging 1. Store sponsored click data in database
  185. 185. Click charging 1. Store sponsored click data in database 2. Log sponsored click data to logrepo
  186. 186. Click charging 1. Store sponsored click data in database 2. Log sponsored click data to logrepo 3. Verify logs match database
  187. 187. Click charging 1. Store sponsored click data in database 2. Log sponsored click data to logrepo 3. Verify logs match database 4. Charge for clicks
  188. 188. Click charging 1. Store sponsored click data in database 2. Log sponsored click data to logrepo 3. Verify logs match database 4. Charge for clicks 5. Profit!
  189. 189. What does logrepo enable? Answering business and operational questions Data-driven decisions
  190. 190. Average cover letter length inside US vs. outside US?
  191. 191. Mobile searches per hour in JP vs. UK?
  192. 192. Resume creation by country?
  193. 193. Email alert opens by email domain?
  194. 194. Percent of app downloads from iOS, Android, Windows?
  195. 195. How quickly does a datacenter take on traffic after a failover?
  196. 196. Q&A https://github.com/indeedeng http://go.indeed.com/urlparsing
  197. 197. Next @IndeedEng Talk Big Value from Big Data: Building Decision Trees at Scale Andrew Hudson, Indeed CTO February 26, 2014 http://engineering.indeed.com/talks

×