Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Everything I Knew About Logging and Search is Wrong - New Relic FutureStack Presentation by Jon Gifford


Published on

Forget everything you know about log data… odds are if your understanding isn't out of date, there's a good chance it's being applied wrong. Search paved the path for the Internet revolution, search changed how we look at everything—now search has changed the way you can use your endless log and machine data to proactively address and answer operational issues.

In this presentation, Jon Gifford shares what a decade and a half leading development on some of the world's most complex, distributed and cloud-centric search engines has taught him about solving the daily critical problems of DevOps and Engineering teams. And how looking at massive volumes of log data daily for years has completely reshaped how he views, uses and expects both logging and search technology to function in the years ahead, including best practices, insider secrets and predictions about what tomorrow's search development will bring.

Published in: Technology, Education
  • Be the first to comment

Everything I Knew About Logging and Search is Wrong - New Relic FutureStack Presentation by Jon Gifford

  1. 1. Everything you know about Logging and Search is Wrong (Or, How I Learned To Stop Worrying And Love The “Logs”) Jon Gifford - Loggly FutureStack - October 2013 | Logging as a service 1 Introduction • I’m Jon Gifford • • Co-Founder and Chief Search Officer for Loggly 15+ years of search experience • Lots of different search products, lots and lots of machines, lots and lots and lots of logs... • Bias: Lucene, Java, *nix, Distributed, Realtime, Measure, Measure, Measure • Loggly is “Logging As A Service” • • “Logging” is not necessarily what you think it is “As A Service” means we deal with the complexities, our customers get the good stuff with none of the pain | Logging as a service 2
  2. 2. What You Think You Know | Logging as a service 3 Search is... • Finding stuff fast, in the chaos of real life • • • • • • The Web (WebCrawler, AltaVista, Yahoo, Google, ...) Enterprise (Google, Autonomy, IBM, ...) eCommerce (Amazon, eBay, NewEgg, ...) PC (Spotlight, Explorer, Editor/IDE, grep, ...) Smartphone (Spotlight, QSB, Siri, ...) Everywhere else (DVR, ...) | Logging as a service 4
  3. 3. Logs are... • Inconsistent, incomplete, incoherent, intransigent, and mispelled. Good luck! • Sep Sep Sep Sep 2 2 2 2 System logs 00:30:14 00:49:06 00:49:06 00:49:06 E.g. OSX Whatu newsyslog[26381]: logfile turned over Whatu.local WindowServer[86]: CGXSetWindowBackgroundBlurRadius: Invalid window 0xffffffff Whatu.local loginwindow[57]: find_shared_window: WID -1 Whatu.local loginwindow[57]: CGSGetWindowTags: Invalid window 0xffffffff • Daemon logs • Application logs E.g. mail 2013-07-30T08:01:01.000+00:00 solrc01 postfix/pickup[7416]: 46F472D320: uid=1021 from=<xyzzy> 2013-07-30T08:01:01.000+00:00 solrc01 postfix/cleanup[18159]: 46F472D320: message-id=<> 2013-07-30T08:01:01.000+00:00 solrc01 postfix/qmgr[1421]: 46F472D320: from=<>, size=653, nrcpt=1 (queue active) 2013-09-12 21:39:46,356 - INFO 2013-09-12 21:39:46,358 - INFO 2013-09-12 21:42:22,003 - INFO 2013-09-12 21:42:22,335 - INFO E.g. ZooKeeper [Snapshot Thread:FileTxnSnapLog@254] - Snapshotting: 6100eef911 [SyncThread:1:FileTxnLog@199] - Creating new log file: log.6100eef913 [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client / which had sessionid 0x1410b16bf5f0527 [NIOServerCxn.Factory:$Factory@251] - Accepted socket connection from / | Logging as a service 5 My Story - Early Days | Logging as a service 6
  4. 4. DIY @ LookSmart • 1997-04: Custom C/C++, FAST • Big, slow batch jobs to build index (weeks!). Big team (30-50 people). Big deployments (100’s of boxes). • LookSmart Directory, WWW, FindArticles, Live! (Y! Answers) • • • • Heavy focus on relevance, including paid boosting Lots of plumbing, lots of endoscopy Q from Hell: Give me 100k results? Pain & Suffering... • WTF is happening? | Logging as a service 7 WTF? TMA! • Fundamental to understanding your system(s) • • • • Troubleshooting: What broke? When? Why? Monitoring: Whats happening now? Analysis: What is happening over time? No “one true way” • • Everyone has their preferred tools, no-one can agree on just one Logs are obvious source, but... • • • Usually for SysAdmins, by SysAdmins / for Devs, by Devs. Massive fragmentation (files all over the place) Flying blind is a good way to crash and burn | Logging as a service 8
  5. 5. What I Know Now (2004) • Search is HARD when you DIY • • But... it is the most fun you can have with your keyboard, but you really have to WANT to be a search company Distributed Systems are complicated beasts • • Data partitioning and query routing plus caching are incredibly important for performance and reliability (and they are indivisible) • • Crawling, Indexing and Search clusters each have their own unique failure modes Little things: Lose a box? disk? Good luck figuring out why TMA is not negotiable • Without it, you have NO chance of keeping the system alive, let alone improving it. The quality of the data you use is crucial. | Logging as a service 9 Lucene @ LookSmart • 2004: Lucene 1.3 • • • Smaller, Faster, Cheaper - Furl Micro-batching - hourly updates, from RDBMS Lucene library integrated into existing system • partitioned by customer • • • makes some things simple, but has limits “Real-time” relevance - hacking Lucene norm to avoid “update” TMA: Same pain... • Re-used/rewrote existing tools. Still laggy & maintenance heavy. Still using logs as source of data. • First inkling of an idea... Can we use Lucene for TMA? | Logging as a service 10
  6. 6. Streaming @ Technorati • 2005/6: Spread + Lucene • Lots of complex Lucene surgery & wrapping • • Spread-based data & control bus • • • • Distributed filters and facets, time-sharded sub-minute updates, RT monitoring. Nice! Crazy fast counts, replaced RDBMS for Authority Q from Hell: Can I get 1 million results? TMA: Less pain, kinda sorta • Spread-based monitoring makes log aggregation “less important” until a box goes AWOL :-( | Logging as a service 11 Relevance @ Scout • 2007-9: Spread + ExternalFileField + Lucene • • Less Lucene surgery, but time-sharding still requires wrapping Many sources, sub-minute updates, more complex relevancy • • • • • ExternalFileField incredibly good for relevance, but difficult to co-ordinate with index changes UI emphasized graphs, word clouds, sentiment Visualization of complex data doable, users love it Q from Hell: not so common here TMA: The Devil you know • Good enough sometimes is enough | Logging as a service 12
  7. 7. What I Know Now (2009) • • Q from Hell is INCREDIBLY frustrating Search is getting really really good • • • • New respect for boolean queries on fielded data Real numerics mean analytics are improving, fast Logs are just data • • SolrCloud with NRT is just over the horizon (Oct 2012) But... You’re probably logging the wrong thing, in the wrong format, for the wrong consumer Logging is just transport • Syslog / REST log shipping means real-time aggregation | Logging as a service 13 My Story - Loggly | Logging as a service 14
  8. 8. SolrCloud* @ Loggly • • 2010-12: 0MQ + fast commit + JSON SolrCloud bones + plugins • Time-sharding impacts • • • • shard allocation, merging, search Custom 0MQ DataImportHandler JSON means semi-structured data is easy (native) TMA: Everything from ONE data source! • syslog + REST means easy, standardized transport feeding Solr in pretty damn close to real time • Native numerics + Facets + Histograms are AWESOME! | Logging as a service 15 ElasticSearch @ Loggly • 2013: Kafka + NRT + parsing + JSON • • • • Massive performance improvements (Lucene 4) Added parsing to enrich incoming data Focus on Product, not Plumbing • • • • Simplified cluster/index/shard management Easy to use search UI with full blown Lucene under the hood Point and click Analytics on any field NRT Alerting to wake you up at night TMA: Holy Grail • Truly NRT, fully aggregated, analytics engine. Game Over :-) | Logging as a service 16
  9. 9. What’s still hard? • Scale: Billions of events per day • • Performance: Resource balancing • • Indexing, searching, migrating, merging, filters, ... Lots of ways to chew up CPU, RAM, disk, network Reliability: Distributed systems • • From thousands of Customers, to hundreds of Indices. Big, interconnected systems (mis-)behave in complex ways, and true understanding is both a priori and a posteriori Sanity: Automation • No-one can run hundreds of machines by hand | Logging as a service 17 What We Know (2013) • Search is Analytics • • Query is a chainsaw/cleaver/scalpel Facets/Aggregations/Pivots are getting better and better • • Logs/Logging is real-time time-series data • Transport, format, etc are far less important than you think • • Native numerics really are a magic bullet Get the data in, any way you can For some problems, “Search” is all you need • Free yourself from the SQL/MapReduce handcuffs ;-) • But don’t force it. | Logging as a service 18
  10. 10. Our Story - The Future | Logging as a service 19 Coming Soon... • “Search” continues to improve • ES vs Solr competition good for both, and Lucene • • • Update (finally!), Geo, “JOIN”, ... Getting close to being a DataStore • • Functions/Scripts are micro-MapReduce Really, truly, can be the source of Truth Problems continue to evolve • Monitoring -> Analytics -> Product -> BI • • Users are not just Ops & Devs UI really really matters | Logging as a service 20
  11. 11. Coming Later • Everything interesting is real-time time-series • • Inner Loop becomes Stream Engine • • Filter with Query, Transform with built-ins or custom code Predictive Analytics is everywhere • • Forget batch, it died a long time ago “Bounce AppX on BoxY”, “Get milk on your way home” Network Effect kicks in • Lots of companies share the same problems | Logging as a service 21 Coming NOW! | Logging as a service 22