Founded 2004, first software release in 2006HQ: San Francisco / Region HQ: London, Hong KongOver 600 employees, based in 10 countriesQ2 Revenue: $44.5 million; +71% year-over-yearFree download to massive scaleOn-premise, in the cloud and SaaS4,400+ CustomersCustomers in over 80 countries54 of the Fortune 100
So where did we come up with this name? It’s from the term Spelunking – to explore underground caves. Splunking is to explore large amounts of machine data.
Machine data is an incredibly valuable resource, but organizations rarely get the value they need from it. Splunk helps these organizatons solve a very difficult problem, collecting, storing and analyzing this data to provide strategic insights for iT and the business. Our mission is simple, it’s to take machine data and make it accessible, usable and valuable to everyone – and hopefully this will include your organization.
Splunk is the leading enterprise solution for managing and analyzing machine data. It provides a unified way to organize and to extract actionable insights from the massive amounts of machine data generated across diverse sources.One person can download and implement Splunk in hours, rather than having a team of people take months or even years to deploy a solution. You can connect to your data in a few clicks and create powerful dashboards with a few more. Key capabilities:Splunk collects machine data securely and reliably from wherever it’s generated. Splunk stores and indexes all of the data in real time in a centralized location and protects it with role-based access controls. Splunk turns your machine data into a NoSQL data fabric that can be searched, browsed, navigated, analyzed and visualized. This enables IT professionals businesses to solve a wide range of mission-critical problems, all without the inherent limitations of traditional approaches.Search and analyze live streaming and terabytes of historically indexed data from one place. Splunk automatically monitors your data for trends and specific patterns of activity or behavior. Then notifies the people that need to know immediately.Powerful search, drilldown and reporting capabilities meet the needs of novice users and expert analysts alike. Easy-to-create dashboards put critical insights from your machine data into the hands of the people who need it.
Here’s the context for all the material that follows. “Enterprise Services” program is all about…
Here’s the context for all the material that follows. “Enterprise Services” program is all about…
Logs scatted everywhere = complex ecosystemLooming horizon = data explosionStory: going live, millions of hits start coming in, try to figure out what is actually happening
4 hours. No joke.We were drawn to innovate; just try something new and see what happens.
A list of consumers of the Locations service over a 24 hour period.Story:Identify bad API key before the developer knew what was wrong.
We’re taking a look at our infrastructure design because of this.
Able to report on non-functional requirements.Going forward we can do a better job of not over-estimating infrastructure needs; thus saving a lot more money, not wasting idle inventory on the shelf, and open the door to putting the right money in the right places then.
You saw the original map at the beginning of our presentation; as we expose more APIs, what can we learn from them?
How are we adhering to this advice? We have accomplished many of these metrics already. Most of these are achievable with Splunk.
The more you have in Splunk, the more complete the monitoring picture can be.
Great for perf/load testing; see all the errors in one place.You can even put the Jenkins logs in Splunk and show the results across all APIs being developed.
Allow apps to have multiple ways to get logs into SplunkNo UF on consumer devicesBuild transactions across multiple layers of the infraUse UFs on end points everywhere = FASTESTElse, consolidate and mount Splunk = FASTElse, use CLS RESTful API = SLOW
Nothing is wrong. Your data is wrong. Getting people to trust what Splunk is telling us.Story about 1 of the nodes being down and initially people didn’t believe it was right.
Putting Data to Work by Splunking All the Things at Target - Gartner AADI 2012
Splunk Company Overview Company (NASDAQ: SPLK) Founded 2004, first software release in 2006 HQ: San Francisco / Region HQ: London, Hong Kong Over 600 employees, based in 10 countries FY 12 Revenue: $121MM; FY 13 Guidance: $183MM – Q2 FY 13 Revenue: $44.5 million Business Model / Products Free download to massive scale Software deployed on-premise and in the cloud; Splunk Storm delivered via a SaaS model 4,400+ Customers Customers in over 80 countries 54 of the Fortune 100 Largest license: 100 Terabytes per day 1
SplunkSpelunking: to explore underground cavesSplunking: to explore and visualize large amounts of machine data 5
MissionMake machine data accessible, usable and valuable to everyone. 6
Splunk Collects and Indexes Any Machine DataCustomer Outside theFacing Data Datacenter Click-stream data Manufacturing, Shopping cart data logistics… Online transaction data CDRs & IPDRs Power consumption Logfiles Configs Messages Traps Metrics Scripts Changes Tickets RFID data Alerts GPS data Windows Linux/Unix Virtualization Applications Databases Networking Registry Configuration & Cloud Web logs Configurations Configurations Event logs s Hypervisor Log4J, JMS, JMX Audit/query syslog File system syslog Guest OS, Apps .NET events logs SNMP sysinternals File system Cloud Code and scripts Tables netflow ps, iostat, top Schemas 7
Splunk Collects and Indexes Any Machine DataCustomer Outside theFacing Data Datacenter Click-stream data Manufacturing, Shopping cart data logistics… •Any amount, any location, any source. Online transaction data CDRs & IPDRs Power consumption Logfiles Configs Messages Traps Metrics Scripts Changes Tickets RFID data Alerts GPS data No upfront schema Windows Linux/Unix No custom connectors Databases Virtualization Applications Networking Registry Configuration &No RDBMS Web logs Cloud Configurations Configurations Event logs s Hypervisor Log4J, JMS, JMX Audit/query syslog File system syslog sysinternals File system No need to filter/forward logs Guest OS, Apps Cloud .NET events Code and scripts Tables SNMP netflow ps, iostat, top Schemas 8
Turning Machine Data into Operational Intelligence Integrated Collection, Storage and Visualization. Ad hoc search Monitor and alert Real-time Collection and Report and Indexing analyze Custom dashboards Developer Platform 9
Turning Machine Data into Operational IntelligenceMachine Data Integrated Collection, Storage Operational Intelligence and Visualization. Business Insights Gain real-time insight from your machine data to make better-informed business decisions. Operational Visibility Gain operational visibility to make better-informed IT decisions. Proactive Monitoring Monitor infrastructure to identify issues, problems and attacks before they impact your customers and services. Search and Investigation Find and fix problems across the organization using machine data. 10
Enabling Application Intelligence for Dev & Production Talks to every technology in your Databases stackEnd userdevices MessagingEnd user Networking/ Networking/ Networking/ Correlates data across the differentdevices Loadbalancing Loadbalancing Loadbalancing tiers – find causal links Legacy Security SystemsEnd userdevices Web App Services Servers Built for Big Data - Visualize, Virtualization analyze, trend all your data at scale Servers Storage 11
Operational Intelligence Across Use CasesApplication IT Web Business Internet of Security ComplianceManagement Ops Intelligence Analytics Things DEVELOPER FRAMEWORK 12
Broad Adoption Across 4,400+ Customers Over Half the Fortune 100Financial Services & Insurance Retail Technology Cloud and Online Services Cloud and Online Services Cloud and Online Services Cloud and Online Services Government Healthcare Manufacturing Media & Entertainment Cloud and Online Services Cloud and Online Services Cloud and Online Services Cloud and Online Services Energy and Utilities Education Telecommunications Travel and Leisure Cloud and Online Services Cloud and Online Services Cloud and Online Services Cloud and Online Services 13
Putting Data to Workby Splunking All theThings at TargetDan Cundiff, Target Corporation
About MeTechnical Architect7+ years development experience working across several groups:security, social media and knowledge management, and serviceoriented architecturesCurrently focused on API development, creating RESTful APIs that areused in and outside of the enterprise across a wide range of devices,applications, and business partnersEnjoy automating - all the things - exchanging pro tips on continuousintegration and deployment @pmotch 16
Context: Enterprise Services @ TargetData and transactional APIs for all the domains in our business– Products (inventory, price, description, etc)– Locations– Coupons– etcAPIs exposed inside and outsideMostly RESTful APIs, some pub sub/messagingUsed by mobile devices, applications, partners on the outside, etc.Constantly evolving, rapidly improving, all the time 17
Part Problem. Part Opportunity.First API go-live:– Millions of log events per day (grep/cut/sed/awk not cutting it)– Logs scattered everywhere– Limited access to logs– Needed end to end visibility of web services– Needed ability to discover information in logs– Can we be pro-active? Faster reactive?Looming horizon:– BILLIONS of log events coming– Questions changing everyday from business, support, execs, developers 18
Solution. Gave Splunk a Try.Installed Splunk on a lab serverHooked up Splunk to the logsQuickly created 15+ searches and reportsGenerated a dashboard for visibility and trendingTotal time to do all this in Splunk: ~4 hours 19
Why Splunk? Find What We Proactive Full Stack Visibility Community! Don’t Know• Understand • Indicators of • API gateway • Community “Normal” outliers, • Network (load (Splunkbase, • Actionable anomalies, balancers, blogs, etc) events percentage firewalls) • Google-able™ • Identify changes, standard • Web/app • App store! tolerances deviations • OS• Find things we • Quick and flexible didn’t know dashboards existed • Drilldown 20
Splunk delivers us a new type of intelligence. 21
Understanding “Normal”Overall volume of requests API response time SLAs Error code by proportion Error code by volume All the data in one place allows us to track multiple indicators of “Normal” 22
Better Understand ConsumersWho and how is it being used?What’s their experience? 23
Better Understand Consumers, Part 2Load testing in production? 24
Understanding Our InfrastructureExpected design vs actual implementationNot balancing workload as expected 25
Understanding ProvidersHow are providers responding?Is overhead added to the API response? 26
Splunking all the ThingsConsumer appsProvider systemsOS, firewalls, proxiesExternal API gateway logsAnything in between (middleware, integrations, etc)Correlate with logs from apps degrees away (e.g. .com web logs)Development (perf test results, git, Jenkins/CI, wiki, etc)
DashboardsGlobal dashboard summarizing all APIsBI dashboardsExecutive dashboardsCustom dashboards for different roles brings right information to appropriate fingertips 32
Dashboards, Part 2Environment dashboardsfor each API– CI– Test– Stage– Prod 33
Dashboards, Part 3Alert trendingdashboards foreach API 34
Splunking Continuous IntegrationDrill down into CI results linked straight from Jenkins– Filtered by date OR transaction GUID 35
Splunking Continuous Integration, Part 2We practice code as documentationEvery commit, Jenkins runs, extracts documentation from code, puts itin the respective wiki pages (pretty cool! – automated / no humans)Splunk monitors wiki changes using the MediaWiki APIMonitor CI + human wiki changeshttps://github.com/pmotch/wikislurp 36
Common Logging ServiceCLS is our strategy for getting logs from all places into SplunkHow– Use UFs on end points everywhere– Else, consolidate and mount Splunk– Else, use CLS RESTful APIEnables end-to-end visibility– Insert GUIDs across all the hops in the transactionUse out of the box log formats (e.g. Log4j) 37
LessonsRTFM– Keep logs flat– Keep timestamp (ISO8601) at the beginning– k=vIterate quick, push to prod; minimal tweaks to SplunkFlatten out of box audit events (XML)– Toggle at runtimeDon’t re-invent the wheel, use what your system provides, Splunk canhandle it! 39
Lessons, Part 2Don’t pre-optimize up front– Governance– Standards– Alerting– Access controlsOptimize as needed 40
Lessons, Part 4Create best practices, standards, etc in a wiki 42
Challenges: Organizational“Stop. We already have tools that do this. Use those.”– tgtMAKE saves the day– tgtMAKE = R&D– R&D = $, servers, flak shelter, people networkMake it real strategy– Demo to as many key players as possible– Drum up interested– Show actual value 43
Challenges: Organizational, Part 2The data can’t be trusted? 44
RecapBe bold. Tooling matters. Sell it. Splunk all the things! Iterate, adapt, change quickly. 45