December 2013 HUG: Hunk - Splunk over Hadoop


Published on

December 2013 HUG: Hunk - Splunk over Hadoop

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Splunk is a different kind of company with a different kind of product. Our technology is built by IT pros for IT pros to be software people will want to use, from novice to guru. The product features one code base. Splunk software is standards-based and built on an open architecture. In addition Splunk is flexible and extensible allowing you to access any data from any format and provide it for viewing across an organization. The Splunk architecture was designed to scale from a single user to truly massive and distributed global deployments. Splunk software doesn’t dumb down or normalize data to fit into a database, potentially removing context. And finally we are easy to work with and provide a transparent support environment. Our documentation is all public, as well as our product roadmap, we even have real engineers on our IRC channel.
  • Splunk automatically extracts a set of default fields for each event it indexes. You can "create" more "custom" fields by defining additional index-time and search-time field extractions. You can accomplish this manual field extraction through the use of search commands, the Interactive Field Extractor, and configuration files.
  • Using Splunk's Common Information Model as a guide, you can normalize field names in your IT data so that loading external applications like firewall reports will "just work" with your existing fields. Tag event types to add information to your data. Any event type can have multiple tags. For example, you can tag all firewall event types as firewall, tag a subset of firewall event types as deny and tag another subset as allow. Once an event type is tagged, any event type matching the tagged pattern will also be tagged.
  • Splunk software enables organizations to gain new insights from this data and a key focus for Splunk 6 is to empower a broader base of users in the organization with this insight – users that extend beyond core IT users.The Pivot interface enables non-technical and technical users alike to quickly generate sophisticated charts, visualizations and dashboards using simple drag and drop. Users can access different chart types from the Splunk toolbox to easily visualize their data different ways. Queries using the Pivot interface are powered by underlying data models, which are usually designed and implemented by users who understand the format and semantics of their indexed data, and who are familiar with the Splunk Search Processing Language (SPL). Unlike traditional BI visualization tools focused on structured data analytics, the Pivot interface enables both non-technical and technical users to easily explore, manipulate and visualize raw, unstructured and polystructured data. It complements existing BI technologies by providing relevant business insights from a rapidly exploding new class of data.
  • Generate reports on the fly from hard- to-understand data. Create powerful, information-rich reports to do analysis, without an advanced knowledge of search commands. Schedule delivery of any report via PDF and share it with management, business users or other stakeholders. Combine multiple charts, views, reports and external data.View and edit on any desktop, tablet and mobile device.
  • Dashboards and Views Build interactive dashboards and user workflows with Simple XML, JavaScript and Django Easily add custom styling, behavior and visualizationsOne-click access to develop in the Splunk web frameworkMore Options for UI Extensibility Integrate charts, dashboards, and query results into other applicationsCreate workflow actions that trigger an action in an external systemAlert creates a change request in a help-desk systemExternal / scripted lookups from a database or other systemApplications or interfaces developed on Splunk's REST API and SDKsODBC driver in beta to integrate with 3rd party visualization software such as Tableau, QlikTech and TibcoSpotfire.
  • Splunk Enterpriseis a standalone solution and the industry-leading platform for machine data with all of Splunk’s core use cases. For customers who are storing historical data in Hadoop, we offer Hunk to run analytics on data stored natively in Hadoop. Hunk targets new use cases, including:– Data analytics for new product and service launches – Synthesis of data from all customer touch points– Comprehensive security analytics for modern threats– Easier big data app development than in raw Hadoop Furthermore, you can use Splunk Enterprise Hadoop Connect to send data between Splunk Enterprise and Hadoop. Many accounts may decide to buy both Splunk Enterprise for real-time monitoring and real-time search together with Hadoop for exploratory analytics of historical data stored in Hadoop. With this combination, you can run searches across native indexes in Splunk Enterprise and Hunk virtual indexes for data in Hadoop.
  • Splunk Enterprise enables real-time analytics with managed forwarders for data ingest. For the most part, you can use monitor to add nearly all your data sources from files and directories. However, you might want to use upload to add one-time inputs, such as an archive of historical data. You can enable Splunk to accept an input on any TCP or UDP port. Splunk consumes any data sent on these ports. Use this method for syslog (default port is UDP 514), or set up netcat and bind to a port. TCP is the protocol underlying Splunk's data distribution and is the recommended method for sending data from any remote machine to your Splunk server. Splunk can index remote data from syslog-ng or any other application that transmits via TCP. However, there are times when you want to use scripts to feed data to Splunk for indexing, or prepare data from a non-standard source so Splunk can properly parse events and extract fields. You can use shell scripts, python scripts, Windows batch files, PowerShell, or any other utility that can format and stream the data that you want Splunk to index. You can stream the data to Splunk or write the data from a script to a file. All data that comes into Splunk enters through the parsing pipeline as large chunks. During parsing, Splunk breaks these chunks into events which it hands off to the indexing pipeline, where final processing occurs. During both parsing and indexing, Splunk acts on the data, transforming it in various ways. Most of these processes are configurable, so you have the ability to adapt them to your needs.To kick off a real-time search in Splunk Web, use the time range menu to select a preset Real-time time range window, such as 30 seconds or 1 minute. You can also specify a sliding time range window to apply to your real-time search. This defines a real-time buffer. The Splunk Index is the repository for Splunk Enterprise data. Splunk Enterprise transforms incoming data into events,which it stores in indexes.
  • Hunk brings Splunk software's big data analytics stack to your data in Hadoop. Explore, analyze and visualize data, create dashboards and share reports from one integrated platform that works with Apache Hadoop or the Hadoop distribution of your choice. The Splunk Virtual Index decouples the data storage tier from the data access and analytics tiers, so that Hunk can transparently route requests to different data stores. Hunk uses this foundational patent-pending technology to enable seamless interactive exploration, analysis and visualization for data stored in Hadoop. You can create multiple virtual indexes that extend across one or more Hadoop clusters. Virtual indexes contain pointers to the data, such as assigning all files in a directory as an index, so you can prune partitions for faster search performance. With Hunk, even time stamp extraction and event breaking are done at search time.
  • One of the key innovations in this product is Splunk Virtual Index technology. This patent-pending capability enables the seamless use of almost the entire Splunk technology stack, including the Splunk Search Processing Language for interactive exploration, analysis and visualization of data stored anywhere, as if it was stored in a Splunk Index. Splunk Analytics for Hadoop uses this foundational technology and is the first product to come from this innovation.To configure the virtual index, specify the external resource provider the virtual index is serviced by and specify the data paths that belong to this virtual index.
  • A virtual index is a search time concept that allows a Splunk search to access data and optionally push computation to external systems. A virtual indexbehaves as an addressable data container that can be referenced by a search. Virtual indexes contain pointers to the data – such as all files in this directory belong in this index. Since the data that resides in the external system is not under direct management of Splunk, retention policies cannot be applied to the datasets that make up virtual indexes. And data in external systems such as Hadoop will often not be optimized for search. Hunk is able to provide access to and perform analytics on data that resides in external system by encapsulating the data into addressable units using virtual indexes, while utilizing external resource processes to handle the details of pushing down computations to the external system. There are several key reasons for having multiple indexes: To control user access. To organize how you search data across disparate data sets. To speed searches.You can define a virtual index as the contents of an entire Hadoop cluster, or sub-sets of data in that cluster such as by data type.
  • Hunk starts the streaming and reporting modes concurrently. Streaming results show until the reporting results come in.Allows users to search interactively by pausing and refining queries.This is a major, unique advantage of Hunk compared to alternative approaches such as Hive or SQL on Hadoop which require fixed schema in an effort to speed up searches, while Hunk retains the combination of schema on the fly with results preview.
  • Before data is processed by Hunk you can plug in your own data preprocessor. The preprocessors have to be written in Java and can transform the data in some way before Hunk gets a chance to. Data preprocessors can vary in complexity from simple translators (say Avro to JSON) to as complex as doing image/video/document processing.Hunk translates Avro to JSON. These translations happen on the fly and are not persisted.
  • December 2013 HUG: Hunk - Splunk over Hadoop

    1. 1. © 2012 Splunk, Inc. Inside Splunk Enterprise and Hunk: Architecture, Analytics and Use Cases Todd Papaioannou, CTO Ledion Bitincka, Principal Architect Brett Sheppard, Big Data PMM Director December 2013
    2. 2. Splunk is a Different Approach Built by IT pros for IT pros It’s all about the user from novice to guru One code base Laptop to datacenter, Unix to Windows, agent to server Open architecture Files versus database, scriptable. APIs. SDKs, standards Flexible and extensible Any data, any format, different views, built to be extended Scales to big data Not filtered, not “dumbed” down, not locked into a fixed schema Transparent support Public documentation, public roadmap, real engineers on IRC 2
    3. 3. Inside Search-time Knowledge Extraction Automatically discovered fields And user-defined fields ... enable statistics and precise search on specific fields: 3
    4. 4. Inside Search-time Knowledge Extraction Searches saved as event types Plus tagging of event types, hosts and other fields ... enable normalized reporting, knowledge sharing and granular access control. 4
    5. 5. Powerful, Easy-to-use Analytics for Everyone Data Models and Pivot • Data models describes how underlying data is represented and accessed • Drag-and-drop interface enables anyone to analyze raw, unstructured data • Click to visualize any chart type; reports dynamically update when fields change All chart types available in the chart toolbox Save report to share Time window Add constraints to filter out events Select fields from data model Data models: hierarchical object view of underlying data 5
    6. 6. Visualize and Share Data with Role-based Security Build and Personalize • Rapidly build advanced graphs • • • • and charts on-the-fly Combine charts, views and external data in dashboards and reports View and edit on any desktop or mobile device Drill down to raw data Protect data with role-based access controls 6
    7. 7. Integration Methods Dashboards and Views UI Extensibility • Simple XML, JavaScript, Django • Interactive dashboards and user workflows • REST API • Custom styling, behavior & visuals • iframe embed • Integrate charts, dashboards and query results into other applications • Create workflows that trigger an action in an external system or use REST endpoints • ODBC driver (beta) to integrate with 3rd-party visualization software 7
    8. 8. Analytics Use Cases by Splunk Product Real-time indexing Real-time search App Dev & App Mgmt. Ad hoc analytics of historical data in Hadoop IT Ops. Digital Intelligence Security & Compliance Product and Service Analytics Business Analytics Complete 3600 Customer Security Analytics View Developers building big data apps on top of Hadoop Splunk Apps Vibrant and passionate developer community 8 Splunk Hadoop Connect
    9. 9. Real-Time Analytics with Managed Forwarders Data Scripted Input Parsing Pipeline • Source, event typing • Character set normalization • Line breaking • Timestamp identification • Regex transforms 9 Index Queue TCP/UDP Input Parsing Queue Monitor Input Real-time Buffer Indexing Pipeline Real-time Search Process Raw data Index Files Splunk Index
    10. 10. Hunk: Splunk Analytics for Hadoop 10
    11. 11. Inside Hunk
    12. 12. The Problem Easy to get data in Large amounts of data already in Hadoop Hard to get value out 12
    13. 13. Data -> Value (today) Collect Prepare 13 Ask
    14. 14. Data -> Value (ideally) Collect Prepare Ask 14
    15. 15. What if? Hadoop + Splunk = 15
    16. 16. Hadoop + Splunk = Hunk 16
    17. 17. Free Download Go now to and download your 60-day free trial, with no limit on the size of the Hadoop cluster 17
    18. 18. Goals 18
    19. 19. Process the data in place Maintain support for Splunk Processing Language (SPL) True schema on read Interactive Ease of setup & use 19
    20. 20. Challenges 20
    21. 21. GOALS Support SPL Naturally suitable for MapReduce Reduces adoption time Challenge: Hadoop “apps” written in Java & all SPL code is in C++ Porting SPL to Java would be a daunting task (120+ commands) Reuse the C++ code somehow – JNI – not easy nor stable – use “splunkd” (the binary) to process the data 21
    22. 22. GOALS Schema on read Apply Splunk’s index-time schema at search time – Event breaking, time stamping etc Anything else would be brittle & maintenance nightmare Extremely flexible Runtime overhead (manpower >>$ computation) Challenge: Hadoop “apps” written in Java & all index-time schema logic is implemented in C++ 22
    23. 23. GOALS Interactive No one likes to stare at a blank screen! Challenge: Hadoop is designed for batch-like jobs 23
    24. 24. Virtual Indexes 24
    25. 25. Hunk Uses Virtual Indexes • Enables seamless use of the Splunk stack on data in Hadoop • Automatically handles MapReduce • Technology is patent pending 25
    26. 26. Examples of Virtual Indexes External System 1 index = syslog (/home/syslog/…) Hunk Search Head > External System 2 External System 3 26 index = apache_logs index = sensor_data index = twitter
    27. 27. Deployment Overview 27
    28. 28. Data processing 28
    29. 29. Mixed-mode Search Streaming Reporting • Transfers first several blocks from • Pushes computation to the HDFS to the Hunk Search Head for immediate processing DataNodes and TaskTrackers for the complete search • Hunk starts the streaming and reporting modes concurrently • Streaming results show until the reporting results come in • Allows users to search interactively by pausing and refining queries 29
    30. 30. Data Processing Pipeline Raw data (HDFS) Custom processing stdin You can plug in data preprocessors e.g. Apache Avro or format readers Indexing pipeline Event breaking Timestamping Search pipeline Event typing Lookups Tagging Search processors splunkd/C++ MapReduce/Java 30 30
    31. 31. Demo
    32. 32. Thank You