Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop


Published on

Two of the major barriers to effective Hadoop deployments in the enterprise are the complexity and limited applicability of MapReduce. Software developers with Hadoop and MapReduce experience are in short supply, slowing big data initiatives. Faster results to a broad range of analytic scenarios require working at a higher level of abstraction, supported by new programming paradigms and tools. In this talk we present one such approach based on our experience developing a visual workbench for big data analytics on Hadoop. This approach enables data scientists and analysts to build and execute complex big data workflows for Hadoop with minimal training and without MapReduce knowledge. Libraries of pre-built operators for data preparation and analytics reduce the time and effort required to develop big data projects on Hadoop. The framework is extensible allowing the addition of new operators as needed. Due to the efficiency of the underlying dataflow framework, the run times are shortened, allowing faster iterations of discovery and analysis.

Presenter: Jim Falgout, Chief Technologist, Pervasive Big Data & Analytics

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop

  1. 1. A Visual Workbench for Big Data Analytics on Hadoopbigdata.pervasive.com •+1.855.356.DATA
  2. 2. Visual Workbench for Hadoop• Agenda – Pervasive Software – History of DataRush – Dataflow Concepts – Hadoop Integration – Demo – Performance Testing bigdata.pervasive.com •+1.855.356.DATA 2
  3. 3. Who is Pervasive?Global Software Company • Tens of thousands of users across the globe • Operations in Americas, EMEA, Asia • ~260 employeesStrong Financials • $51 million revenue (trailing 12-month) • 48 consecutive quarters of profitability • $46 million in the bank • NASDAQ:PVSW since 1997Leader in Data Innovation • 25% of top-line revenue re-invested in R&D • Software to manage, integrate and analyze data, in the cloud or on-premises, throughout the entire data lifecycle bigdata.pervasive.com •+1.855.356.DATA 3
  4. 4. History of DataRush• Initially developed as next-gen data engine for integration• Requirements – High data throughput – Scalable (data, multicore) – Based on dataflow concepts – Component based architecture – Easy to extend – Easily fits in visual development environment• Embedded in Pervasive products (DataProfiler)• Extended with SDK for more general use bigdata.pervasive.com •+1.855.356.DATA 4
  5. 5. Dataflow Concepts • Operators (nodes) linked together in a directed graph • Data flows along edges • Shared nothing architecture • Provides pipeline parallelism • Supports data parallelism • Data scalable bigdata.pervasive.com •+1.855.356.DATA 5
  6. 6. Compilation to Execution Plan Compiled to a set of physical graphsPhase 1 Phase 2 Reader FilterRows DeriveFields Group(partial) Repartition Group(final) Writer Reader FilterRows DeriveFields Group(partial) Repartition Group(final) Writer Reader FilterRows DeriveFields Group(partial) Repartition Group(final) Writer Reader FilterRows DeriveFields Group(partial) Repartition Group(final) Writer bigdata.pervasive.com •+1.855.356.DATA
  7. 7. Operator Library bigdata.pervasive.com •+1.855.356.DATA
  8. 8. KNIME• KNIME – Open source analytics workflow tool for the desktop – Web site: www.knime.org – Supports team collaboration and resource sharing: • KNIME Teamspace • KNIME Server • KNIME Report• Integrated with DataRush – DataRush dataflow executor integrated as a plug-in extension – Includes DataRush operators – Product: RushAnalytics for KNIME bigdata.pervasive.com •+1.855.356.DATA 8
  9. 9. DataRush + KNIME bigdata.pervasive.com •+1.855.356.DATA 9
  10. 10. Integration with Hadoop• Data Level – HDFS access • File system abstraction – works with all I/O operators • Distributed execution – uses splits much like MR – HBase • Temporal key-value data store based on column families • Fast loading using HFile integration • Fast temporal queries• Execution – Distributed execution uses distribute DataRush engines (not MapReduce) – Integrating with YARN for resource sharing bigdata.pervasive.com •+1.855.356.DATA 10
  11. 11. Distributed Execution Perf Cluster NodeMonitor Manager Allocates Resources ManagerWeb Browser Spawns Initiates Job Data Client Executor HDFS Local Phase Graph Phase Graph bigdata.pervasive.com •+1.855.356.DATA 11
  12. 12. Distributed I/O ReadSplit • Allows downstream operators to be parallelized ReadSplit • Parallelization concepts are theAssignSplits same whether the graph is run locally or ReadSplit distributed ReadSplit bigdata.pervasive.com •+1.855.356.DATA 12
  13. 13. Demobigdata.pervasive.com •+1.855.356.DATA
  14. 14. Performance Test TPC-H : 1 Terabyte Test : Run times• DataRush versus PIG 892 – Used TPC-H data Q21 3528 – Generated 1TB data 543 set in HDFS Q18 1742 – Ran several “queries” coded in DataRush and 626 Q10 1027 PIG – Run times in seconds Q9 1198 2356 DataRush (smaller is better) PIG 273 Q6 363 660 Q3Cluster Configuration: 1414• 5 worker nodes• 2 X Intel E5-2650 (8 core) 401 Q1 2036• 64GB RAM• 24 X 1TB SATA 7200 rpm 0 500 1000 1500 2000 2500 3000 3500 4000 Run time in seconds bigdata.pervasive.com •+1.855.356.DATA 14
  15. 15. DataRush/RushAnalytics Solutions• Opera Solutions – Data science solutions provider – Embedding DataRush in engineered solutions• Healthcare – Claims cleansing & processing• Retail – Market basket analysis – Product category resolution (MDM)• Telecom – CDR processing & analysis“Pervasive DataRush’s efficiency and ability to automaticallyscale, whether on a single server or a Hadoop cluster, supports ourvision for consistent, reusable, scalable Big Data analytics.” – Armando Escalante, Chief Operating Officer, Opera Solutions bigdata.pervasive.com •+1.855.356.DATA 15
  16. 16. Summary• Easy development of Hadoop workloads – Using drag-and-drop desktop GUI – Team oriented - Supports collaboration with others – No code to write - MapReduce included• Scalable Execution – Executes within Hadoop cluster – Scales from desktop to server to cluster with no workflow changes – Scales as cluster does – Handles small to very large data sizes – TPC-H performance testing shows improved performance over comparable PIG scripts bigdata.pervasive.com •+1.855.356.DATA 16
  17. 17. Questions?• My contact info: jfalgout@pervasive.com @jimfalgout• Website bigdata.pervasive.com bigdata.pervasive.com •+1.855.356.DATA 17