Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

BIG Data processing using HPCC Systems
Above and Beyond Hadoop
Arun Rathinasabapathy
Senior Software Engineer
September 23, 2016

Welcome!
• Module 1 : LexisNexis – Introduction
• Module 2 : BIG DATA --- ???
• Module 3 : Introducing HPCC
• Module 4 : HPCC System components
• Module 5 : ECL IDE & Data Graphs
• Module 6 : ECL language
• Module 7 : Six Degrees of Kevin Bacon
• Module 8: HPCC modules vs Apache Hadoop Modules
• Module 9 :HPCC vs Hadoop Language
• Module 10 :Dali server Vs Task tracker & Data node service
• Module 11: Fault Resilience
• Module 12: Components Comparison
• Module 13: ECL vs Hadoop Performance comparison
• Module 14: Why HPCC – Case studies and why it’s superior to Hadoop
• Module 15: Join our academic community

About LexisNexis Risk Solutions
3
Data, Analytics and Technology
LexisNexis Risk Solutions leverages its industry-leading Big
Data computing platform with vast data assets and a
proprietary fast-linking technology.
Solutions We Provide
Our solution lines help detect and prevent fraud,
streamline processes, investigate suspicious activity, and
provide timely insights for business decisions.
Markets We Serve
We serve multiple industries, including Insurance, Financial
Services, Receivables Management, Retail, Health Care and
Communications. As well as local, state, and federal
governments.
Big Data Processing with HPCC Systems
We work with Fortune 1000 and mid-market
clients globally across industries, and federal
and state governments.
• Customers in more than 100 countries
• 8 of the world’s top 10 banks
• 100% of the top 50 U.S. banks
• 80% of the Fortune 500 companies
• 100% of U.S. P&C insurance carriers

Vast Data Assets
4
10
billion
unique
name/address
combinations
4.8
billion
property
records
4.2
billion
motor vehicle
registrations
41
million active
U.S. business
entities (LexIDs)
1.5
billion
bankruptcy
records
monitored 272.8
million
unique
cell phones
21.9
billion
insuranc
e
records
477
million
criminal
records
...
18.6
billion
consumer
records
1
billion
vehicle title
records
Partial snapshot of our U.S. data sets
as of 08/01/2016
• Over 6 Petabytes of Data
• 45 Billion Public Records
Big Data Processing with HPCC Systems

Module 2 – BIG DATA – Understanding the Basics
5 Big Data Processing with HPCC Systems

HPCC Systems Technology: Big Data Is Our Core Competency
Big Data Processing with HPCC Systems6
SPEED
• Scales to extreme
workloads quickly
and easily
• Increases speed of
development leads
to faster production/
delivery
• Improves developer
productivity
CAPACITY
• Enables massive
joins, merges,
transformations,
sorts, or tough N2
problems
• Increases business
responsiveness
• Accelerates creation of
new services via rapid
prototyping capabilities
• Offers a platform for
collaboration and
innovation leading to
better results
COST SAVINGS
• Leverages commodity
hardware so fewer
people can do much
more in less time
• Uses IT resources
efficiently via sharing
and higher system
utilization
COMPLEX PROCESSING
• Disambiguates entities
with a high level of
speed and accuracy
• Constructs graphs
from complex, large
data sets for easier
data analytics
• Enables graph
traversal to recognize
areas of hidden value
• Identifies important
attributes that
contribute to
predictive models

Module 4 – HPCC System Components
• Data Refinery (THOR) – Used to process every one of billions of records in order to create billions of
"improved" records.
• ECL Agent is also used to process simple jobs that would be an inefficient use of the THOR cluster.
• Rapid Data Delivery Engine (ROXIE) – Used to search quickly for a particular record or set of records.
• Enterprise Control Language (ECL) – Declarative, data-centric, distributed processing language for BigData.
• Enterprise Services Platform (ESP) – Provides an easy interface to access ECL queries using
XML,HTTP,SOAP(Simple Object Access Protocol) and REST (Representational State Transfer)

Module 5 – ECL IDE & Data Graphs
• Many complex data problems require a series of
advanced functions to solve them.
• With HPCC systems technology, complex data
challenges can be represented naturally with a
transformative data graph.
• The nodes of the data graph can be processed in
parallel as distinct data flows.
• ECL IDE turns code into graphs that facilitate the
understanding and processing of large-scale,
complex data analytics.
• Each section of the graph includes information
such as function, records processed or skew.
• Each node can be drilled into specific details.

Module 6 – ECL Language
• An easy to use, data-centric programming language optimized for large-scale
data management and query processing
• Highly efficient — automatically distributes workload across all nodes
• 80% more efficient than C++, Java and SQL — 1/3 reduction in programmer
time to maintain/enhance existing applications
• Benchmark against SQL (5 times more efficient) for code generation
• Automatic parallelization and synchronization
of sequential algorithms for parallel and distributed processing
• Large library of built-in modules to handle common data manipulation tasks
Declarative programming language …
powerful, extensible, implicitly parallel,
maintainable, complete and
homogeneous

Module 7 – Six Degrees of Kevin Bacon

Module 8 – HPCC Modules vs Apache Hadoop Modules
HPCC Systems Modules
• File Systems
• Distributed File System
• Thor distributed file system (Thor DFS) is
optimized for Big Data ETL
• ROXIE distributed file system (Roxie DFS) is
optimized for high concurrent query
processing
Hadoop Modules
• Hadoop Common – contains libraries and utilities
needed by other Hadoop modules
• Hadoop Distributed File System (HDFS) – a distributed
file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the
cluster.
• Hadoop YARN – a resource-management platform
responsible for managing resources in clusters and
using them for scheduling of users' applications.
• Hadoop MapReduce – a programming model for large
scale data processing.
• All the modules in Hadoop are designed with a
fundamental assumption that hardware failures (of
individual machines, or racks of machines) are common
and thus should be automatically handled in software
by the framework

Module 9 – HPCC Systems vs Hadoop Language
• ECL is the primary programming language for the HPCC environment. ECL is compiled into
optimized C++ which is then compiled into DLLs for execution on the Thor and ROXIE platforms.
• ECL can include inline C++ code encapsulated in functions. External services can be written in any
language and compiled into shared libraries of functions callable from ECL.
• A Pipe interface allows execution of external programs written in any language to be incorporated
into jobs.
• The Hadoop framework itself is mostly written in the Java programming language, with some
native code in C and command line utilities written as shell-scripts

Module 10 – Dali server Vs Task tracker & Data node service
Each slave node in Hadoop includes a Task
tracker service and Data node service.
A separate server called the Dali server
provides file system name services and
manages work units for jobs in the HPCC
environment.
A master node includes a Job tracker service
which can be configured as a separate
hardware node or run on one of the slave
hardware nodes.
A Thor cluster is also configured with a
master node and multiple slave nodes.
A master Name node service is also required
to provide name services and can be run on
one of the slave nodes or a separate node.
A ROXIE cluster is a peer-coupled cluster
where each node runs Server and Agent
tasks for query execution and key and file
processing

Module 11 – Fault Resilience
The DFS for Thor and Roxie stores
replicas of file parts on other nodes
(configurable) to protect against disk
and node failure. Replicas are
automatically used while copying
data to the new node.
HDFS stores multiple replicas (user-
specified) of data blocks on other
nodes (configurable) to protect
against disk and node failure with
automatic recovery.
ROXIE system continues running
following a node failure with a
reduced number of nodes.
Map Reduce architecture includes
speculative execution, when a slow
or failed Map task is detected,
additional Map tasks are started to
recover from node failures.

Module 12 – Components Comparison
Hadoop
Component Purpose
HPCC
Equivalent Notes
HDFS Distributed file system to store files for Hadoop None HPCC uses native filesystem to store files
Name node
Keep track of all files stored in HDFS including all
the blocks allocated to each file
Thor
master node The DFU is responsible for tracking file parts across nodes
Data node Sub node that stores Hadoop files
Thor
slave nodes
Like Hadoop name node, Thor can store data in both the master
and slave nodes
Job tracker Scheduling job runs and managing resources Dali
Task tracker Run subtasks assigned to the sub node Dali monitors task completion on each Thor sub node
Hive
Provides DW structure to HDFS files and SQL-like
declarative access to DW Roxie + Thor
Thor is used to perform data warehousing functions like
aggregations and create keyed B+ Tree indexes. Roxie is used to
provide fast keyed access to aggregated data
Pig/Sqoop
Provide easy declarative language constructs to
perform jobs on Hadoop ECL ECL is a declarative SQL-like language

Module 13 – ECL vs Hadoop Performance comparison

Module 14 – Why HPCC - Case Studies vs Hadoop
• Please find the testimonial video link at http://hpccsystems.com/why-HPCC/case-studies as some
of them do compare use cases between HPCC and Hadoop.
• Please find the link http://hpccsystems.com/Why-HPCC/HPCC-vs-Hadoop/Superior-to-Hadoop as
it lists out what makes HPCC Superior to Hadoop.

Module 15 – Join our academic community
Benefits of joining the community:
• Internship opportunities
• Invitation-only conferences
• Free training for qualifying projects
• Access to an external cluster, as available
• How to join: Click on:
http://hpccsystems.com/community/academic/join
Benefits of attending classes:
• FREE: fits all budgets
• Professional development at your own pace:
attend class as your schedule allows
• Increase your proficiency of solving BIG data
challenges: successive classes gradually build
your expertise
• Be a part of a growing community: meet other
programmers and share your experience, trade
tips and tricks
• How to start: Click on:
http://learn.lexisnexis.com/hpcc
LexisNexis offers free online introductory program classes to learn HPCC Systems,
the open source platform for BIG Data processing and analytics.

Questions?
Arun.Rathinasabapathy@lexisnexis.com

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

Similar to Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016