Splunk is a different kind of company with a different kind of product. Our technology is built by IT pros for IT pros to be software people will want to use, from novice to guru. The product features one code base. Splunk software is standards-based and built on an open architecture. In addition Splunk is flexible and extensible allowing you to access any data from any format and provide it for viewing across an organization. The Splunk architecture was designed to scale from a single user to truly massive and distributed global deployments. Splunk software doesn’t dumb down or normalize data to fit into a database, potentially removing context. And finally we are easy to work with and provide a transparent support environment. Our documentation is all public, as well as our product roadmap, we even have real engineers on our IRC channel.
Splunk automatically extracts a set of default fields for each event it indexes. You can "create" more "custom" fields by defining additional index-time and search-time field extractions. You can accomplish this manual field extraction through the use of search commands, the Interactive Field Extractor, and configuration files.
Using Splunk's Common Information Model as a guide, you can normalize field names in your IT data so that loading external applications like firewall reports will "just work" with your existing fields. Tag event types to add information to your data. Any event type can have multiple tags. For example, you can tag all firewall event types as firewall, tag a subset of firewall event types as deny and tag another subset as allow. Once an event type is tagged, any event type matching the tagged pattern will also be tagged.
Splunk software enables organizations to gain new insights from this data and a key focus for Splunk 6 is to empower a broader base of users in the organization with this insight – users that extend beyond core IT users.The Pivot interface enables non-technical and technical users alike to quickly generate sophisticated charts, visualizations and dashboards using simple drag and drop. Users can access different chart types from the Splunk toolbox to easily visualize their data different ways. Queries using the Pivot interface are powered by underlying data models, which are usually designed and implemented by users who understand the format and semantics of their indexed data, and who are familiar with the Splunk Search Processing Language (SPL). Unlike traditional BI visualization tools focused on structured data analytics, the Pivot interface enables both non-technical and technical users to easily explore, manipulate and visualize raw, unstructured and polystructured data. It complements existing BI technologies by providing relevant business insights from a rapidly exploding new class of data.
Generate reports on the fly from hard- to-understand data. Create powerful, information-rich reports to do analysis, without an advanced knowledge of search commands. Schedule delivery of any report via PDF and share it with management, business users or other stakeholders. Combine multiple charts, views, reports and external data.View and edit on any desktop, tablet and mobile device.
Splunk Enterpriseis a standalone solution and the industry-leading platform for machine data with all of Splunk’s core use cases. For customers who are storing historical data in Hadoop, we offer Hunk to run analytics on data stored natively in Hadoop. Hunk targets new use cases, including:– Data analytics for new product and service launches – Synthesis of data from all customer touch points– Comprehensive security analytics for modern threats– Easier big data app development than in raw Hadoop Furthermore, you can use Splunk Enterprise Hadoop Connect to send data between Splunk Enterprise and Hadoop. Many accounts may decide to buy both Splunk Enterprise for real-time monitoring and real-time search together with Hadoop for exploratory analytics of historical data stored in Hadoop. With this combination, you can run searches across native indexes in Splunk Enterprise and Hunk virtual indexes for data in Hadoop.
Splunk Enterprise enables real-time analytics with managed forwarders for data ingest. For the most part, you can use monitor to add nearly all your data sources from files and directories. However, you might want to use upload to add one-time inputs, such as an archive of historical data. You can enable Splunk to accept an input on any TCP or UDP port. Splunk consumes any data sent on these ports. Use this method for syslog (default port is UDP 514), or set up netcat and bind to a port. TCP is the protocol underlying Splunk's data distribution and is the recommended method for sending data from any remote machine to your Splunk server. Splunk can index remote data from syslog-ng or any other application that transmits via TCP. However, there are times when you want to use scripts to feed data to Splunk for indexing, or prepare data from a non-standard source so Splunk can properly parse events and extract fields. You can use shell scripts, python scripts, Windows batch files, PowerShell, or any other utility that can format and stream the data that you want Splunk to index. You can stream the data to Splunk or write the data from a script to a file. All data that comes into Splunk enters through the parsing pipeline as large chunks. During parsing, Splunk breaks these chunks into events which it hands off to the indexing pipeline, where final processing occurs. During both parsing and indexing, Splunk acts on the data, transforming it in various ways. Most of these processes are configurable, so you have the ability to adapt them to your needs.To kick off a real-time search in Splunk Web, use the time range menu to select a preset Real-time time range window, such as 30 seconds or 1 minute. You can also specify a sliding time range window to apply to your real-time search. This defines a real-time buffer. The Splunk Index is the repository for Splunk Enterprise data. Splunk Enterprise transforms incoming data into events,which it stores in indexes.
Hunk brings Splunk software's big data analytics stack to your data in Hadoop. Explore, analyze and visualize data, create dashboards and share reports from one integrated platform that works with Apache Hadoop or the Hadoop distribution of your choice. The Splunk Virtual Index decouples the data storage tier from the data access and analytics tiers, so that Hunk can transparently route requests to different data stores. Hunk uses this foundational patent-pending technology to enable seamless interactive exploration, analysis and visualization for data stored in Hadoop. You can create multiple virtual indexes that extend across one or more Hadoop clusters. Virtual indexes contain pointers to the data, such as assigning all files in a directory as an index, so you can prune partitions for faster search performance. With Hunk, even time stamp extraction and event breaking are done at search time.
One of the key innovations in this product is Splunk Virtual Index technology. This patent-pending capability enables the seamless use of almost the entire Splunk technology stack, including the Splunk Search Processing Language for interactive exploration, analysis and visualization of data stored anywhere, as if it was stored in a Splunk Index. Splunk Analytics for Hadoop uses this foundational technology and is the first product to come from this innovation.To configure the virtual index, specify the external resource provider the virtual index is serviced by and specify the data paths that belong to this virtual index.
A virtual index is a search time concept that allows a Splunk search to access data and optionally push computation to external systems. A virtual indexbehaves as an addressable data container that can be referenced by a search. Virtual indexes contain pointers to the data – such as all files in this directory belong in this index. Since the data that resides in the external system is not under direct management of Splunk, retention policies cannot be applied to the datasets that make up virtual indexes. And data in external systems such as Hadoop will often not be optimized for search. Hunk is able to provide access to and perform analytics on data that resides in external system by encapsulating the data into addressable units using virtual indexes, while utilizing external resource processes to handle the details of pushing down computations to the external system. There are several key reasons for having multiple indexes: To control user access. To organize how you search data across disparate data sets. To speed searches.You can define a virtual index as the contents of an entire Hadoop cluster, or sub-sets of data in that cluster such as by data type.
Hunk starts the streaming and reporting modes concurrently. Streaming results show until the reporting results come in.Allows users to search interactively by pausing and refining queries.This is a major, unique advantage of Hunk compared to alternative approaches such as Hive or SQL on Hadoop which require fixed schema in an effort to speed up searches, while Hunk retains the combination of schema on the fly with results preview.
Before data is processed by Hunk you can plug in your own data preprocessor. The preprocessors have to be written in Java and can transform the data in some way before Hunk gets a chance to. Data preprocessors can vary in complexity from simple translators (say Avro to JSON) to as complex as doing image/video/document processing.Hunk translates Avro to JSON. These translations happen on the fly and are not persisted.
Transcript of "December 2013 HUG: Hunk - Splunk over Hadoop"
Splunk is a Different Approach
Built by IT pros for IT pros
It’s all about the user from novice to guru
One code base
Laptop to datacenter, Unix to Windows, agent to server
Files versus database, scriptable. APIs. SDKs, standards
Flexible and extensible
Any data, any format, different views, built to be extended
Scales to big data
Not filtered, not “dumbed” down, not locked into a fixed schema
Public documentation, public roadmap, real engineers on IRC
Inside Search-time Knowledge Extraction
Automatically discovered fields
And user-defined fields
... enable statistics and precise search on specific fields:
Inside Search-time Knowledge Extraction
Searches saved as event types
Plus tagging of event types, hosts and other fields
... enable normalized reporting, knowledge
sharing and granular access control.
Powerful, Easy-to-use Analytics for Everyone
Data Models and Pivot
• Data models describes how
underlying data is
represented and accessed
• Drag-and-drop interface
enables anyone to analyze
raw, unstructured data
• Click to visualize any chart
type; reports dynamically
update when fields change
All chart types available in the chart toolbox
Add constraints to
filter out events
Select fields from
Data models: hierarchical object view of underlying data
Visualize and Share Data with Role-based Security
Build and Personalize
• Rapidly build advanced graphs
and charts on-the-fly
Combine charts, views and
external data in dashboards
View and edit on any desktop
or mobile device
Drill down to raw data
Protect data with role-based
Dashboards and Views
• Simple XML,
• REST API
• Custom styling,
behavior & visuals
• iframe embed
• Integrate charts, dashboards and query results into other applications
• Create workflows that trigger an action in an external system or use REST endpoints
• ODBC driver (beta) to integrate with 3rd-party visualization software
Analytics Use Cases by Splunk Product
Ad hoc analytics of
historical data in Hadoop
Developers building big data apps on top of Hadoop
Vibrant and passionate developer community
Splunk Hadoop Connect
Real-Time Analytics with Managed Forwarders
• Source, event typing
• Character set
• Line breaking
• Timestamp identification
• Regex transforms
Naturally suitable for MapReduce
Reduces adoption time
Challenge: Hadoop “apps” written in Java & all SPL code is in C++
Porting SPL to Java would be a daunting task (120+ commands)
Reuse the C++ code somehow
– JNI – not easy nor stable
– use “splunkd” (the binary) to process the data
Schema on read
Apply Splunk’s index-time schema at search time
– Event breaking, time stamping etc
Anything else would be brittle & maintenance nightmare
Runtime overhead (manpower >>$ computation)
Challenge: Hadoop “apps” written in Java & all index-time schema logic
is implemented in C++
No one likes to stare at a blank screen!
Challenge: Hadoop is designed for batch-like jobs
• Transfers first several blocks from
• Pushes computation to the
HDFS to the Hunk Search Head
for immediate processing
DataNodes and TaskTrackers for
the complete search
• Hunk starts the streaming and reporting modes concurrently
• Streaming results show until the reporting results come in
• Allows users to search interactively by pausing and refining queries
Data Processing Pipeline
You can plug in
e.g. Apache Avro or