Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BIG Data processing using HPCC Systems
Above and Beyond Hadoop
Arun Rathinasabapathy
Senior Software Engineer
September 23...
Welcome!
• Module 1 : LexisNexis – Introduction
• Module 2 : BIG DATA --- ???
• Module 3 : Introducing HPCC
• Module 4 : H...
About LexisNexis Risk Solutions
3
Data, Analytics and Technology
LexisNexis Risk Solutions leverages its industry-leading ...
Vast Data Assets
4
10
billion
unique
name/address
combinations
4.8
billion
property
records
4.2
billion
motor vehicle
regi...
Module 2 – BIG DATA – Understanding the Basics
5 Big Data Processing with HPCC Systems
HPCC Systems Technology: Big Data Is Our Core Competency
Big Data Processing with HPCC Systems6
SPEED
• Scales to extreme
...
Module 4 – HPCC System Components
• Data Refinery (THOR) – Used to process every one of billions of records in order to cr...
Module 5 – ECL IDE & Data Graphs
• Many complex data problems require a series of
advanced functions to solve them.
• With...
Module 6 – ECL Language
9 Big Data Processing with HPCC Systems
• An easy to use, data-centric programming language optimi...
Module 7 – Six Degrees of Kevin Bacon
10 Big Data Processing with HPCC Systems
Module 8 – HPCC Modules vs Apache Hadoop Modules
HPCC Systems Modules
• File Systems
• Distributed File System
• Thor dist...
Module 9 – HPCC Systems vs Hadoop Language
• ECL is the primary programming language for the HPCC environment. ECL is comp...
Module 10 – Dali server Vs Task tracker & Data node service
13 Big Data Processing with HPCC Systems
Each slave node in Ha...
Module 11 – Fault Resilience
The DFS for Thor and Roxie stores
replicas of file parts on other nodes
(configurable) to pro...
Module 12 – Components Comparison
15 Big Data Processing with HPCC Systems
Hadoop
Component Purpose
HPCC
Equivalent Notes
...
Module 13 – ECL vs Hadoop Performance comparison
16 Big Data Processing with HPCC Systems
Module 14 – Why HPCC - Case Studies vs Hadoop
• Please find the testimonial video link at http://hpccsystems.com/why-HPCC/...
Module 15 – Join our academic community
Benefits of joining the community:
• Internship opportunities
• Invitation-only co...
Questions?
Arun.Rathinasabapathy@lexisnexis.com
19 Big Data Processing with HPCC Systems
20
Upcoming SlideShare
Loading in …5
×

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

587 views

Published on

Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

  1. 1. BIG Data processing using HPCC Systems Above and Beyond Hadoop Arun Rathinasabapathy Senior Software Engineer September 23, 2016
  2. 2. Welcome! • Module 1 : LexisNexis – Introduction • Module 2 : BIG DATA --- ??? • Module 3 : Introducing HPCC • Module 4 : HPCC System components • Module 5 : ECL IDE & Data Graphs • Module 6 : ECL language • Module 7 : Six Degrees of Kevin Bacon • Module 8: HPCC modules vs Apache Hadoop Modules • Module 9 :HPCC vs Hadoop Language • Module 10 :Dali server Vs Task tracker & Data node service • Module 11: Fault Resilience • Module 12: Components Comparison • Module 13: ECL vs Hadoop Performance comparison • Module 14: Why HPCC – Case studies and why it’s superior to Hadoop • Module 15: Join our academic community
  3. 3. About LexisNexis Risk Solutions 3 Data, Analytics and Technology LexisNexis Risk Solutions leverages its industry-leading Big Data computing platform with vast data assets and a proprietary fast-linking technology. Solutions We Provide Our solution lines help detect and prevent fraud, streamline processes, investigate suspicious activity, and provide timely insights for business decisions. Markets We Serve We serve multiple industries, including Insurance, Financial Services, Receivables Management, Retail, Health Care and Communications. As well as local, state, and federal governments. Big Data Processing with HPCC Systems We work with Fortune 1000 and mid-market clients globally across industries, and federal and state governments. • Customers in more than 100 countries • 8 of the world’s top 10 banks • 100% of the top 50 U.S. banks • 80% of the Fortune 500 companies • 100% of U.S. P&C insurance carriers
  4. 4. Vast Data Assets 4 10 billion unique name/address combinations 4.8 billion property records 4.2 billion motor vehicle registrations 41 million active U.S. business entities (LexIDs) 1.5 billion bankruptcy records monitored 272.8 million unique cell phones 21.9 billion insuranc e records 477 million criminal records ... 18.6 billion consumer records 1 billion vehicle title records Partial snapshot of our U.S. data sets as of 08/01/2016 • Over 6 Petabytes of Data • 45 Billion Public Records Big Data Processing with HPCC Systems
  5. 5. Module 2 – BIG DATA – Understanding the Basics 5 Big Data Processing with HPCC Systems
  6. 6. HPCC Systems Technology: Big Data Is Our Core Competency Big Data Processing with HPCC Systems6 SPEED • Scales to extreme workloads quickly and easily • Increases speed of development leads to faster production/ delivery • Improves developer productivity CAPACITY • Enables massive joins, merges, transformations, sorts, or tough N2 problems • Increases business responsiveness • Accelerates creation of new services via rapid prototyping capabilities • Offers a platform for collaboration and innovation leading to better results COST SAVINGS • Leverages commodity hardware so fewer people can do much more in less time • Uses IT resources efficiently via sharing and higher system utilization COMPLEX PROCESSING • Disambiguates entities with a high level of speed and accuracy • Constructs graphs from complex, large data sets for easier data analytics • Enables graph traversal to recognize areas of hidden value • Identifies important attributes that contribute to predictive models
  7. 7. Module 4 – HPCC System Components • Data Refinery (THOR) – Used to process every one of billions of records in order to create billions of "improved" records. • ECL Agent is also used to process simple jobs that would be an inefficient use of the THOR cluster. • Rapid Data Delivery Engine (ROXIE) – Used to search quickly for a particular record or set of records. • Enterprise Control Language (ECL) – Declarative, data-centric, distributed processing language for BigData. • Enterprise Services Platform (ESP) – Provides an easy interface to access ECL queries using XML,HTTP,SOAP(Simple Object Access Protocol) and REST (Representational State Transfer) 7 Big Data Processing with HPCC Systems
  8. 8. Module 5 – ECL IDE & Data Graphs • Many complex data problems require a series of advanced functions to solve them. • With HPCC systems technology, complex data challenges can be represented naturally with a transformative data graph. • The nodes of the data graph can be processed in parallel as distinct data flows. • ECL IDE turns code into graphs that facilitate the understanding and processing of large-scale, complex data analytics. • Each section of the graph includes information such as function, records processed or skew. • Each node can be drilled into specific details. 8 Big Data Processing with HPCC Systems
  9. 9. Module 6 – ECL Language 9 Big Data Processing with HPCC Systems • An easy to use, data-centric programming language optimized for large-scale data management and query processing • Highly efficient — automatically distributes workload across all nodes • 80% more efficient than C++, Java and SQL — 1/3 reduction in programmer time to maintain/enhance existing applications • Benchmark against SQL (5 times more efficient) for code generation • Automatic parallelization and synchronization of sequential algorithms for parallel and distributed processing • Large library of built-in modules to handle common data manipulation tasks Declarative programming language … powerful, extensible, implicitly parallel, maintainable, complete and homogeneous
  10. 10. Module 7 – Six Degrees of Kevin Bacon 10 Big Data Processing with HPCC Systems
  11. 11. Module 8 – HPCC Modules vs Apache Hadoop Modules HPCC Systems Modules • File Systems • Distributed File System • Thor distributed file system (Thor DFS) is optimized for Big Data ETL • ROXIE distributed file system (Roxie DFS) is optimized for high concurrent query processing Hadoop Modules • Hadoop Common – contains libraries and utilities needed by other Hadoop modules • Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. • Hadoop YARN – a resource-management platform responsible for managing resources in clusters and using them for scheduling of users' applications. • Hadoop MapReduce – a programming model for large scale data processing. • All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework 11 Big Data Processing with HPCC Systems
  12. 12. Module 9 – HPCC Systems vs Hadoop Language • ECL is the primary programming language for the HPCC environment. ECL is compiled into optimized C++ which is then compiled into DLLs for execution on the Thor and ROXIE platforms. • ECL can include inline C++ code encapsulated in functions. External services can be written in any language and compiled into shared libraries of functions callable from ECL. • A Pipe interface allows execution of external programs written in any language to be incorporated into jobs. • The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts 12 Big Data Processing with HPCC Systems
  13. 13. Module 10 – Dali server Vs Task tracker & Data node service 13 Big Data Processing with HPCC Systems Each slave node in Hadoop includes a Task tracker service and Data node service. A separate server called the Dali server provides file system name services and manages work units for jobs in the HPCC environment. A master node includes a Job tracker service which can be configured as a separate hardware node or run on one of the slave hardware nodes. A Thor cluster is also configured with a master node and multiple slave nodes. A master Name node service is also required to provide name services and can be run on one of the slave nodes or a separate node. A ROXIE cluster is a peer-coupled cluster where each node runs Server and Agent tasks for query execution and key and file processing
  14. 14. Module 11 – Fault Resilience The DFS for Thor and Roxie stores replicas of file parts on other nodes (configurable) to protect against disk and node failure. Replicas are automatically used while copying data to the new node. HDFS stores multiple replicas (user- specified) of data blocks on other nodes (configurable) to protect against disk and node failure with automatic recovery. ROXIE system continues running following a node failure with a reduced number of nodes. Map Reduce architecture includes speculative execution, when a slow or failed Map task is detected, additional Map tasks are started to recover from node failures. 14 Big Data Processing with HPCC Systems
  15. 15. Module 12 – Components Comparison 15 Big Data Processing with HPCC Systems Hadoop Component Purpose HPCC Equivalent Notes HDFS Distributed file system to store files for Hadoop None HPCC uses native filesystem to store files Name node Keep track of all files stored in HDFS including all the blocks allocated to each file Thor master node The DFU is responsible for tracking file parts across nodes Data node Sub node that stores Hadoop files Thor slave nodes Like Hadoop name node, Thor can store data in both the master and slave nodes Job tracker Scheduling job runs and managing resources Dali Task tracker Run subtasks assigned to the sub node Dali monitors task completion on each Thor sub node Hive Provides DW structure to HDFS files and SQL-like declarative access to DW Roxie + Thor Thor is used to perform data warehousing functions like aggregations and create keyed B+ Tree indexes. Roxie is used to provide fast keyed access to aggregated data Pig/Sqoop Provide easy declarative language constructs to perform jobs on Hadoop ECL ECL is a declarative SQL-like language
  16. 16. Module 13 – ECL vs Hadoop Performance comparison 16 Big Data Processing with HPCC Systems
  17. 17. Module 14 – Why HPCC - Case Studies vs Hadoop • Please find the testimonial video link at http://hpccsystems.com/why-HPCC/case-studies as some of them do compare use cases between HPCC and Hadoop. • Please find the link http://hpccsystems.com/Why-HPCC/HPCC-vs-Hadoop/Superior-to-Hadoop as it lists out what makes HPCC Superior to Hadoop. 17 Big Data Processing with HPCC Systems
  18. 18. Module 15 – Join our academic community Benefits of joining the community: • Internship opportunities • Invitation-only conferences • Free training for qualifying projects • Access to an external cluster, as available • How to join: Click on: http://hpccsystems.com/community/academic/join Benefits of attending classes: • FREE: fits all budgets • Professional development at your own pace: attend class as your schedule allows • Increase your proficiency of solving BIG data challenges: successive classes gradually build your expertise • Be a part of a growing community: meet other programmers and share your experience, trade tips and tricks • How to start: Click on: http://learn.lexisnexis.com/hpcc 18 Big Data Processing with HPCC Systems LexisNexis offers free online introductory program classes to learn HPCC Systems, the open source platform for BIG Data processing and analytics.
  19. 19. Questions? Arun.Rathinasabapathy@lexisnexis.com 19 Big Data Processing with HPCC Systems
  20. 20. 20

×