Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs


Published on

Most organizations still rely on batch and offline processing of data streams to gain meaningful analysis and insight into their business. However, in our instant gratification world, real-time computation and analysis of streaming data is crucial in gaining insight into patterns and threats. A trend is emerging for real-time and instant analysis from live data streams, promoting the value of logs and a move toward functional programming.

This shift in technology is not about what and how to store the data, but what we can do with it to see emerging patterns and trends across multiple resources, applications, services and environments. Log data represents a wealth of information, yet is often sporadic, unstructured, scattered across the enterprise and difficult to track.

These slides provide insights into some of the most helpful Big Data tools used by the largest social media and data-centric organizations for competitive trends, instant analysis and feedback from large volume data streams. We show how how using Big Data tools Storm, ElasticSearch and an elastic UI can turn application logs into real-time analytical views.

You will also learn how Big Data:

Contains data that is elastic, minimally structured, flexible and scalable

Helps process live streams into meaningful data

Promotes a move toward functional programming

Effects the enterprise data architecture

Works with real-time CEP tools like Storm for functional programming

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

  1. 1. Big Data Open Source Tools and Trends: Enable Real- Time Business Intelligence from Machine Logs Eric Roch, Principal & Ben Hahn, Senior Technical Architect
  2. 2. Perficient is a leading information technology consulting firm serving clients throughout North America. We help clients implement business-driven technology solutions that integrate business processes, improve worker productivity, increase customer loyalty and create a more agile enterprise to better respond to new business opportunities. About Perficient
  3. 3. • Founded in 1997 • Public, NASDAQ: PRFT • 2013 revenue $373 million • Major market locations throughout North America • Atlanta, Boston, Charlotte, Chicago, Cincinnati, Columbus, Dallas, Denver, Detroit, Fairfax, Houston, Indianapolis, Los Angeles, Minneapolis, New Orleans, New York City, Northern California, Philadelphia, Southern California, St. Louis, Toronto and Washington, D.C. • Global delivery centers in China, Europe and India • >2,100 colleagues • Dedicated solution practices • ~90% repeat business rate • Alliance partnerships with major technology vendors • Multiple vendor/industry technology and growth awards Perficient Profile
  4. 4. BUSINESS SOLUTIONS Business Intelligence Business Process Management Customer Experience and CRM Enterprise Performance Management Enterprise Resource Planning Experience Design (XD) Management Consulting TECHNOLOGY SOLUTIONS Business Integration/SOA Cloud Services Commerce Content Management Custom Application Development Education Information Management Mobile Platforms Platform Integration Portal & Social Our Solutions Expertise
  5. 5. Eric Roch Principal Eric leads Perficient's national connected solutions practice • Includes focus on SOA/integration, cloud, mobile and Big Data • Author & industry speaker • 25 years+ of experience in various aspects of information technology including: • Executive-level management • Enterprise architecture • Application development Speakers Ben Hahn Sr. Technical Architect Ben Hahn is a Sr. Technical Architect • Includes focus on transactions, logging & exceptions processing • Author & speaker • 20+ years of experience in various aspects of information technology including: • Software solutions • Enterprise infrastructure • Product management • Open Source software community contributor
  6. 6. • Often defined as data that exceeds the capacities of conventional database systems because it’s too large and moves too fast for traditional database systems to handle in an architecturally cohesive way. The three V’s of Big Data are: • Volume • Most companies have 100 TB of data • Facebook ingests 500 TB in a single day • 40 ZettaBytes (that’s 43 trillion GB) of data by 2020 • Velocity • NYSE captures 4-5 TB of data in a single day • A Boeing 737 generates 243 TB in a single flight • The Google self-driving car generates 750MB of data per second! • Variety • Twitter, Clickstreams, Audio, Video • GPS, Sensor data, Facebook content • Infrastructure and application logs What is Big Data?
  7. 7. POLL QUESTION: What is your current adoption level for big data? • Evaluation • Prototype • Production
  8. 8. But Not Everyone is Google! Where’s the Big Data coming from?
  9. 9. POLL QUESTION Have you used open source software for big data solutions? • Yes • No
  10. 10. Machine Data definitely has the three V’s of Big Data Machine Data is Big Data
  11. 11. What Can We Gain From Machine Data? Valuable information can be mined from machine data, including: • Transaction monitoring • Error detection • Behavior trends • Audit logging • Infrastructure states • Anomaly detection • Geospatial analysis • Network analysis
  12. 12. Log Analysis vs. Business Analytics • Ingest - Versus ETL • Big Data - Bidirectional integration with Hadoop • Query language - MapReduce function on unstructured data • Drill anywhere - Investigate on all the data versus a predefined schema or cube • Information discovery - Discover relationships based on patterns in the data • Ad-hoc versus dimensional - Log analysis is not based a predefined structure based a point-in-time set of requirements • Explicit logging - Versus implicit correlation
  13. 13. Polling Question: Do you mine machine data for business  insights? • Yes • No
  14. 14. Innovations From Cloud and OSS • Hadoop and MapReduce - Derived from Google's MapReduce and Google File System • Storm – Distributed event processor open sourced by Twitter • Presto - Facebook has released as open source a SQL query engine built to work with petabyte-sized data warehouses • Google BigQuery - Run SQL-like queries against terabytes of data in seconds • Amazon DynamoDB - NoSQL database service to store and retrieve any amount of data, and serve any level of request traffic • Elasticsearch – Distributed full-text search OSS community
  15. 15. POLLING QUESTION Do you plan to use cloud based solutions for  big data? • Yes • No
  16. 16. • 2004 - Google published a paper called MapReduce: Simplified Data Processing on Large Clusters characterized by: • Map and shuffle key-values data pairs and then aggregate/reduce these intermediate data pairs • Origins in map and reduce primitives in functional languages • Massive parallelism and elasticity via commodity hardware • Fault tolerance via master-worker nodes Big Data Processing: MapReduce 2
  17. 17. • Based on Lambda (λ) calculus • ALL computational functions and data can be expressed as a series of functions and predicates of functions • Declarative language rather than imperative • First-order functions – Functions can be passed just like values as arguments and returned as arguments. This also allows currying and partial functions. • Call by name – Function expressions are not evaluated until they are actually used. • Recursion – Functions evaluate to itself potentially in an endless loop. • Immutable state and values – Pure functional programming does not consider variables but rather immutable values as they appear in any moment in time. This has big effects on scalability and concurrency. • Referential Transparency - Functions can be replaced by their values with no side effects. • Pattern matching – Data type matching as well as data structure composition and deep object type matching • Erlang, Haskell, Lisp, Clojure, Scala What are functional languages? And MapReduce is Better with Functional Languages 2
  18. 18. Imperative Model: Pascal, C. Basic, etc. Evolution (or Devolution?) of Databases 2
  19. 19. Object Oriented Programming Model: Java, C++,C#. Evolution (or Devolution?) of Databases 2
  20. 20. Functional Programming Model: Scala, Clojure, F# Evolution (or Devolution?) of Databases 2 • Because commodity hardware in the cloud is infinitely elastic, resource needs to query and run transactions can be scaled in response to the data volumes at the store level. • Data is stored using functional programming concept of immutability by only appending data as point-in-time values. • MapReduce functions can be balanced and distributed across machines as nodes fail or new nodes are added. • First-class functions and call by name allows function, lambda expressions to be passed into MapReduce calls as arguments allowing ad-hoc functionality to be added. • Pattern matching allows very complex pattern matches on complex structures like XML. • Transactions use functional expressions like compare and swap operations to ensure ACIDity. • SQL or query expressions can be reduced to MapReduce functions or lambda expressions and/or patterns and distributed in parallel across the nodes. • Using recursion, complex structures like XML can be mapped and reduced from a single expression.
  21. 21. MapReduce Machine Data: What Do We Need? • A dynamic process for parsing and mapping unstructured data to structured data in real-time • Wide range of data formats (text, XML, JSON, CSV, EDI, etc.) • Need intelligent pattern matching capabilities • Ability to correlate meaningful transactional data and metrics from disparate data (reducing) • Machine data is static and immutable. Append-only fast writes with eventual consistency is ideal • Need fast filter, search, query capabilities to display results
  22. 22. Open Source Big Data Landscape Source: www.bigdata‐
  23. 23. Apache Hadoop: The Elephant in the Room • What about Apache Hadoop? • Apache Hadoop comprises HDFS and the  Hadoop MapReduce both based on Google’s GFS  and MapReduce • Batch oriented MapReduce jobs through  Schedulers and JobTrackers • Require real‐time MapReduce processes • Need index, query, search on data in real‐time  with a well‐defined interface • We can use for secondary storage of long‐term  persistent logs – Lambda Architecture (Batch vs Speed Layer)
  24. 24. Apache Storm: Use Real-time MapReduce for Machine Data Streams • Developed by Backtype and acquired by Twitter • Distributed computational framework that allows real- time MapReduce functionality from any data source streams using concept of Spouts and Bolts • Read From Any Data Stream using Spouts (Kafka, JMS, HTTP, etc.) • Transactional and guaranteed message processing • Parallelism and scalability • Fault Tolerance (Master-Worker for MapReduce) • MapReduce Topologies • Offers Real-time MapReduce jobs (Or Bolts) • Other tools: Apache Spark
  25. 25. Apache Storm: Use Real-time MapReduce for Machine Data Streams MapReduce - Declarative and simplicity of functional languages within Storm
  26. 26. Elasticsearch: Distributed Document Search • Distributed search server engine using Apache Lucene • It’s a Schema-less document store using JSON as it’s document format. New fields can be added dynamically. All fields are indexed by default • Uses index shards to distribute queries and searches across clusters. Queries and searches are run in parallel • Cluster can host multiple indexes and can be queried as a group or singly. Index aliases allows indexes to be added or dropped dynamically • Append-only model using versioning. Writes very fast depending on wait model (wait for all shards to be written or a quorom or none) • Well-defined RESTful API interface. Very powerful query features • Other tools: Apache Solr
  27. 27. Elasticsearch: Distributed Document Search Elasticsearch: Distributed Query and searches using index shards and replicas
  28. 28. A Really Cool UI to Show This Off • Kibana – Works seamlessly with Elasticsearch, queries Elasticsearch directly from Javascript • Everything is user driven, very little coding except some configuration settings in yaml • Very dynamic screen interface • Screen layout, queries, filters, graphs, histograms are saved directly to Elasticsearch • Great design and user interface
  29. 29. Putting It In Action: Demo
  30. 30. As a reminder, please submit your questions in the chat box We will get to as many as possible! 4/1/2014
  31. 31. Daily unique content about content management, user experience, portals and other enterprise information technology solutions across a variety of industries.
  32. 32. Thank you for your participation today. Please fill out the survey at the close of this session. 4/1/2014