Hunk: Splunk Analytics for Hadoop


Published on

Published in: Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This session is designed for audiences who have seen an introduction to Hunk and would like a more comprehensive understanding of how Hunk works. I’ll cover each of these eight topics.
  • Hunk is a new product for organizations deploying Hadoop and is priced and packaged separately from Splunk Enterprise. A Splunk Enterprise license is not required to run Hunk. Hunk is the integrated analytics platform for data in Hadoop. Supports business use cases to unlock value of data stored in Hadoop– Data analytics to launch and optimize products and services – Synthesis of data from all customer touch points– Comprehensive security analytics for modern threats– Easier app development than in raw Hadoop, with tools and frameworks that developers already know Easy to use for any business or IT user– Versus scarce skills to manually write MapReduce jobs or define Hive data schemasFully integrated analytics product– Explore, analyze, visualize, create dashboards, create data models, pivot, and share No fixed schema to search raw and unstructured dataPreview results while MapReduce jobs startEasier app development than in raw Hadoop
  • Hunk is essentially the Splunk Enterprise technology stacksitting on top of Hadoop, with some limitations (no real time, and several functions in the Splunk processing language that do not apply to virtual indexes). Hunk is a high performance, scalable software server written in C/C++ and Python. It indexes and searches logs and other big data stored in the Hadoop Distributed File System, called HDFS, or MapR’s proprietary variant of HDFS. Hunk works with machine data generated by any application, server or device. The Splunk Developer API is accessible via REST, SOAP or the command line. After downloading Hunk, installing Hunk on your choice of 64-bit Linux operating system, and starting Hunk, you'll find two Hunk Server processes running on your host:splunkd and splunkweb. splunkweb is a Python-based application server providing the Splunk Web user interface. It allows users to search and navigate machine data virtually indexed by Hunk servers and to manage your Hunk deployment through the browser interface. splunkd is a distributed C/C++ server that creates a virtual index from machine data and handles search requests. An ODBC driver (in beta as of September 2013) will provide integration with 3rd party data visualization software.
  • Connect Hunk to your Hadoop cluster as an external results provider. The external results provider is a search-time helper process responsible for: accessing the external system (Hadoop); translating or interpreting the search request; and pushing as much of the computation as possible to the external system. Connect to the Hadoop Distributed File System (HDFS) and MapReduce from Apache downloads or from your choice of Hadoop distribution, including the option for Cloudera, Hortonworks, MapR or Pivotal. Hunk only requires basic Hadoop: HDFS and MapReduce. You can continue to use additional projects and subprojects with your Hadoop cluster but what’s required by Hunk is just MapReduce and HDFS (or MapR’s proprietary variant of HDFS).
  • Connect Hunk to multiple Hadoop clusters.
  • There are significant challenges with theseapproaches to ask and answer questions of data in Hadoop. Not shown is a less common option, spreadsheet-like interfaces, that raise their own problems: these are batch job builders, with no interactive engine, and use “spreadsheet like” interfaces, not Microsoft Excel or Apple Numbers.
  • Hunk (Splunk Analytics for Hadoop) is a full-featured, integrated product offering – that delivers interactive data exploration, analysis and visualization for Hadoop. Full-featured, integrated product: Delivers interactive data exploration, analysis and visualization for HadoopInsights for everyone: Empowers broader user groups to derive actionable insights from raw data in HadoopWorks with what you have today: Works with leading Hadoopdistributions to maximize enterprise technology investments
  • Hunk does not replace your Hadoop distributionHunk coexists with your Apache HDFS & MapReduce downloads or your Hortonworks, Cloudera, or MapR distributionHunk does not replace or require Splunk Enterprise– Hunk is a separate product designed for new use cases involving data in HadoopIterative search but not real time or needle in the haystack searches– That’s Splunk EnterpriseNo data ingest management– That’s using tools from Apache Hadoop or from your Hadoop distribution vendor, or Hadoop connectors by enterprise software or business intelligence vendors Notes: Needle in a haystack – one in a million searches.  
  • Splunk Enterpriseis a standalone solution and the industry-leading platform for machine data with all of Splunk’s core use cases. For customers who are storing historical data in Hadoop, we offer Hunk to run analytics on data stored natively in Hadoop. Hunk targets new use cases, including:– Data analytics for new product and service launches – Synthesis of data from all customer touch points– Comprehensive security analytics for modern threats– Easier big data app development than in raw Hadoop Furthermore, you can use Splunk Enterprise Hadoop Connect to send data between Splunk Enterprise and Hadoop. Many accounts may decide to buy both Splunk Enterprise for real-time monitoring and real-time search together with Hadoop for exploratory analytics of historical data stored in Hadoop. With this combination, you can run searches across native indexes in Splunk Enterprise and Hunk virtual indexes for data in Hadoop.
  • A rich developer platform and tool chain that includes a robust API and software developer kits in Java, JavaScript, Python, PHP, C# and Ruby to enable developer teams to rapidly build powerful big data applications. DEV.SPLUNK.COM activity highlights a strong developer community.
  • What you’ll need to get started. Data in Hadoop to analyze Hadoop client libraries From your Hadoop distribution vendor or from access rightsHunk requires permission to read from HDFS and run MapReduce jobsJava 1.6+HDFS scratch spaceThe amount depends on the size of the interim results. Between 10 and 20 Gigs is common. DataNode local temp disk spaceAt most 5 Gigs per DataNode
  • On the first search, MapReduce auto-populates the Splunk binaries. The orchestration process begins when Hunk copies the Hunk binary .tgz file to HDFS. Hunk supports both the MapReduce JobTracker and the YARN MapReduce Resource Manager.Each TaskTracker (called ApplicationContainer in YARN) fetches the binary.The binary files expand in the specified location on each TaskTracker; the default location is configurable. TaskTrackers not involved in the 1st search will receive the Hunk binary in a subsequent search that involves those TaskTrackers. This process is one example of why Hunk needs some scratch space in HDFS and in the local file system (TaskTrackers / DataNodes). Background on Hadoop: Typically a Hadoop cluster has a single master and multiple worker nodes. The master node (also referred to as NameNode) coordinates the reads and writes to worker nodes (also referred to as DataNodes). HDFS reliability is achieved by replicating the data across multiple machines. By default the replication value is 3 and chunk size is 64MB.The JobTracker dispatches tasks to worker nodes (TaskTracker) in the cluster. Priority is given to nodes that host the data upon which said task will operate on. If the task cannot be run on that node, next priority is given to neighboring nodes (in order to minimize network traffic). Upon job completion, each worker node writes own results locally and the HDFS ensures replication across the cluster.HDFS = NameNode + DataNodes
MapReduce Engine = JobTracker + TaskTracker
  • Search execution: The Hunk Search head takes the list of content of directories in the virtual index. The search head filtersdirectories & files based on the search & time range(partition pruning)The NameNode and JobTracker (MapReduce Resource Manager in YARN) read data from MapReduce framework and feed it to search process. The process computes File Splits, constructs and submits the MapReduce jobs.Hunk streams a few File Splits from HDFS and processes them in the Search Head to provider quick previews. The search head consumes and merges the MapReduce results (provide incremental previews) while the MapReduce jobs kick off. The data nodes run a copy of splunkd to process the the jobs and write them to a working directory in HDFS. Final results are stored in the Hunk search head. Hunk utilizes the Splunk Search Processing Language, the industry-leading method to enable interactive data exploration across large, diverse data sets. There is no requirement to "understand" data up front. For customers of Splunk Enterprise, reuse your Search Processing Language knowledge and skill set for data stored in Hadoop. Any commands whose output depends on the event input order would yield different results – this is because Splunk guarantees events to be delivered in descending time order. Hunk doesn’t. This is the reason why transaction and localize do not work.We can see the results from the intermediate Hadoop Map jobs getting steamed into the Splunk UI even before all the Map jobs are finished, and once all the Hadoop Maps are done processing the results, Splunk displays the full results. In essence, Splunk acts as the Hadoop Reduce phase and there is no need to use Hadoop for that phase.
  • Before data is processed by Hunk you can plug in your own data preprocessor. The preprocessors have to be written in Java and can transform the data in some way before Hunk gets a chance to. Data preprocessors can vary in complexity from simple translators (say Avro to JSON) to as complex as doing image/video/document processing.Hunk translates Avro to JSON. These translations happen on the fly and are not persisted.
  • Hunk applies structure at search timeDesigned for data exploration across large datasets – preview data & iterate quicklyNo requirement to understand the data upfrontNo limit to the number of results returned by Hadoop or the number of searchesNo brittle schema to maintain or update Find patterns and trends across disparate data sets in a “grab bag” Hadoop clusterUse the Search Processing Language or create data models and pivot Unlike Splunk Enterprise, Hunk applies schema for all fields – including transactions and localizations – at search time.
  • MapReduce considerations: Stats/chart/timechart/top/etc. commands work well in a distributed environmentThey MapReduce wellTime and order commands don’t work well in a distributed environmentThey don’t MapReduce wellFor large summary indexes, consider a dedicated "summarizer" instance with plenty of CPU to execute search jobs Summary jobs won't interfere with user searchesAggregates and stores the results away from indexers Report acceleration is not supported by Hunk 6.0 but may be supported in a future release.
  • Hunk starts the streaming and reporting modes concurrently. Streaming results show until the reporting results come in.Allows users to search interactively by pausing and refining queries.This is a major, unique advantage of Hunk compared to alternative approaches such as Hive or SQL on Hadoop which require fixed schema in an effort to speed up searches, while Hunk retains the combination of schema on the fly with results preview.
  • Pause or stop Jobs in progress and revise queries interactively. We’re mindful of the resources we use in Hadoop. Pause in Hunk:This pauses in the Search Head. Hadoop jobs keep running until the TCP header runs out. If you abandon a search for more than 30 seconds it will kill the search.
  • There’s no one path to explore data. Preview results and refine your queries. Hunk applies normalization as it’s needed for faster implementation and flexibility. Hunk supports the easy-to-use Splunk search processing language along with data models and pivot to provide multiple views into the same data. Find insights following a flexible, iterative workflow. I’ll touch on each of the components of the data workflow. There is no one set way to explore data. Go back and forth across components at the speed of thought. Explore and search data from one placePowerful Search Processing Language (SPL)Designed for data exploration across large datasetsPreview data, iterate quicklyNo fixed schemaNo requirement to “understand” data upfrontEasy to use interactive analytics Deep analysisPattern detectionFind anomaliesOver 100 statistical commandsModel: make unstructured data more valuable Describes how underlying machine data is represented and accessedDefines hierarchical relationships Enables single authoritative view of underlying raw dataPivot: powerful analytics anyone can useDrag and drop interface Easily build complex queries and reports Click to visualize chart typesReports dynamically updateVisualize: interactive reporting and visualization of dataInteractive reports viewRapidly build advanced graphs and chartsGenerate visualizations on-the-fly Drill down to raw data in HadoopOBDC connector to 3rd-party data visualization softwareShare:Build, personalize and share custom dashboards and PDFsCombine multiple charts, views, reports and external dataSet role and group access security for web dashboardsView and edit on any desktop, tablet or mobile deviceAnd do all of this from one integration platform for data in Hadoop.
  • Hunk: Splunk Analytics for Hadoop

    1. 1. Copyright © 2013 Splunk Inc. Hunk: Technical Overview Juergen Magiera, Sales Engineer
    2. 2. Legal Notices During the course of this presentation, we may make forward-looking statements regarding future events or the expected performance of the company. We caution you that such statements reflect our current expectations and estimates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-looking statements, please review our filings with the SEC. The forward-looking statements made in this presentation are being made as of the time and date of its live presentation. If reviewed after its live presentation, this presentation may not contain current or accurate information. We do not assume any obligation to update any forward-looking statements we may make. In addition, any information about our roadmap outlines our general product direction and is subject to change at any time without notice. It is for informational purposes only and shall not, be incorporated into any contract or other commitment. Splunk undertakes no obligation either to develop the features or functionality described or to include any such feature or functionality in a future release. Splunk, Splunk>, Splunk Storm, Listen to Your Data, SPL and The Engine for Machine Data are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. ©2014 Splunk Inc. All rights reserved.
    3. 3. Agenda 1. What is Hunk? 2. Powerful Developer Platform 3. Preparation 4. Connect Hunk to HDFS and MapReduce 5. Create Virtual Indexes 6. MapReduce as the Orchestration Framework 7. Search Data in Hadoop 8. Flexible, Iterative Workflow for Business Users 3
    4. 4. Explore, Analyze, Visualize Data in Hadoop No fixed schema to search unstructured data Preview results while MapReduce jobs start Easier app development than in raw Hadoop 4 Unlock business value of data in Hadoop Fast to learn instead of scarce skills Integrated – explore, analyze and visualize
    5. 5. Hunk Server 5
    6. 6. Connect to HDFS and MapReduce 6 Connect to Apache HDFS and MapReduce or your choice of Hadoop distribution Hadoop Cluster 1
    7. 7. Extract to in-memory store Unmet Needs for Hadoop Analytics 8 • Scarce skill sets to hire • Need to know MapReduce • Wait for slow jobs to finish • No results preview • No built-in visualization • No granular authentication • Slow time to value • Pre-defined fixed schema • Need knowledge of data • Miss data that “doesn’t fit” • No results preview • No built-in visualization • Scarce skill sets to hire • Slow time to value • Data too big to move • Limited drill down to raw data • No results preview • Another data mart • Expensive hardware “Do it yourself” Hadoop / Pig Problems OPTION 1 Hive or SQL on Hadoop Problems OPTION 2 Problems OPTION 3
    8. 8. Hadoop in Real life Using HunkMap Reduce Job for Hadoop il.public class WordCount extends Configured implements Tool { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {static enum Counters { INPUT_WORDS } • private Text word = new Text(); • private boolean caseSensitive = true; • private Set<String> patternsToSkip = new HashSet<String>(); • private long numRecords = 0; • private String inputFile; • public void configure(JobConf job) { • caseSensitive = job.getBoolean("", true); • inputFile = job.get("map.input.file"); • if (job.getBoolean("wordcount.skip.patterns", false)) { • Path[] patternsFiles = new Path[0]; • try { • patternsFiles = DistributedCache.getLocalCacheFiles(job); • } catch (IOException ioe) { • System.err.println("Caught exception while getting cached files: " + StringUtils.stringifyException(ioe)); • } • for (Path patternsFile : patternsFiles) { • parseSkipFile(patternsFile); • } • } • Index=Hadoop • |wc usestopwords=f • |stats sum(count) by word
    9. 9. Integrated Analytics Platform for Hadoop Data 10 10 Full-featured, Integrated Product Insights for Everyone Works with What You Have Today Explore Visualize Dashboards Share Hadoop (MapReduce & HDFS) Analyze
    10. 10. What Hunk Does Not Do Hunk does not replace your Hadoop distribution Hunk does not replace or require Splunk Enterprise Interactive but not real time No data ingest management (that’s Flume or Sqoop) No Hadoop operations management 11 1. 2. 3. 4. 5.
    11. 11. Product Portfolio 12 Real-time indexing Real-time search Splunk Apps Vibrant and passionate developer community IT Ops. Security & Compliance Web Intelli- gence App Dev & App Mgmt. Business Analytics Splunk Hadoop Connect DB Connect Ad hoc analytics of historical data in Hadoop Developers building big data apps on top of Hadoop 3600 Customer View Complete Security Analytics Product and Service Analytics
    12. 12. Powerful Developer Platform with Familiar Tools 13 JavaScript Java Python PHP C# Ruby API Add New UI components Integrate into Existing Systems With Known Languages and Frameworks
    13. 13. Prerequisites 14 Hadoop access rights Java 1.6+Hadoop client libraries HDFS scratch space Data in Hadoop to analyze DataNode local temp disk space
    14. 14. MapReduce as the Orchestration Framework 15 1. Copy splunkd binary HDFS.tgz TaskTracker 1 TaskTracker 2 .tgz 2. Copy 3. Expand in specified location on each TaskTracker TaskTracker 3 .tgz 4. Receive binary in subsequent searches Hunk Search Head >
    15. 15. Data Processing Pipeline 17 17 Raw data (HDFS) Custom processing Indexing pipeline Search pipeline You can plug in data preprocessors e.g. Apache Avro or format readers MapReduce/Java stdin Event breaking Timestamping Event typing Lookups Tagging Search processors splunkd/C++
    16. 16. Hunk applies schema for all fields – including transactions – at search time Hunk Applies Schema on the Fly 18 • Structure applied at search time • No brittle schema to work around • Automatically find patterns and trends
    17. 17. Mixed-mode Search ReportingStreaming • Transfers first several blocks from HDFS to the Hunk Search Head for immediate processing • Pushes computation to the DataNodes and TaskTrackers for the complete search 20 • Hunk starts the streaming and reporting modes concurrently • Streaming results show until the reporting results come in • Allows users to search interactively by pausing and refining queries
    18. 18. Flexible, Iterative Workflow for Business Users 22 Explore Analyze Model Pivot Visualize Share Interactive Analytics • Preview results • Normalization as it’s needed • Faster implementation and flexibility • Easy search language + data models & pivot • Multiple views into the same data
    19. 19. Thank You