SlideShare a Scribd company logo
1 of 20
BIG Data processing using HPCC Systems
Above and Beyond Hadoop
Arun Rathinasabapathy
Senior Software Engineer
September 23, 2016
Welcome!
• Module 1 : LexisNexis – Introduction
• Module 2 : BIG DATA --- ???
• Module 3 : Introducing HPCC
• Module 4 : HPCC System components
• Module 5 : ECL IDE & Data Graphs
• Module 6 : ECL language
• Module 7 : Six Degrees of Kevin Bacon
• Module 8: HPCC modules vs Apache Hadoop Modules
• Module 9 :HPCC vs Hadoop Language
• Module 10 :Dali server Vs Task tracker & Data node service
• Module 11: Fault Resilience
• Module 12: Components Comparison
• Module 13: ECL vs Hadoop Performance comparison
• Module 14: Why HPCC – Case studies and why it’s superior to Hadoop
• Module 15: Join our academic community
About LexisNexis Risk Solutions
3
Data, Analytics and Technology
LexisNexis Risk Solutions leverages its industry-leading Big
Data computing platform with vast data assets and a
proprietary fast-linking technology.
Solutions We Provide
Our solution lines help detect and prevent fraud,
streamline processes, investigate suspicious activity, and
provide timely insights for business decisions.
Markets We Serve
We serve multiple industries, including Insurance, Financial
Services, Receivables Management, Retail, Health Care and
Communications. As well as local, state, and federal
governments.
Big Data Processing with HPCC Systems
We work with Fortune 1000 and mid-market
clients globally across industries, and federal
and state governments.
• Customers in more than 100 countries
• 8 of the world’s top 10 banks
• 100% of the top 50 U.S. banks
• 80% of the Fortune 500 companies
• 100% of U.S. P&C insurance carriers
Vast Data Assets
4
10
billion
unique
name/address
combinations
4.8
billion
property
records
4.2
billion
motor vehicle
registrations
41
million active
U.S. business
entities (LexIDs)
1.5
billion
bankruptcy
records
monitored 272.8
million
unique
cell phones
21.9
billion
insuranc
e
records
477
million
criminal
records
...
18.6
billion
consumer
records
1
billion
vehicle title
records
Partial snapshot of our U.S. data sets
as of 08/01/2016
• Over 6 Petabytes of Data
• 45 Billion Public Records
Big Data Processing with HPCC Systems
Module 2 – BIG DATA – Understanding the Basics
5 Big Data Processing with HPCC Systems
HPCC Systems Technology: Big Data Is Our Core Competency
Big Data Processing with HPCC Systems6
SPEED
• Scales to extreme
workloads quickly
and easily
• Increases speed of
development leads
to faster production/
delivery
• Improves developer
productivity
CAPACITY
• Enables massive
joins, merges,
transformations,
sorts, or tough N2
problems
• Increases business
responsiveness
• Accelerates creation of
new services via rapid
prototyping capabilities
• Offers a platform for
collaboration and
innovation leading to
better results
COST SAVINGS
• Leverages commodity
hardware so fewer
people can do much
more in less time
• Uses IT resources
efficiently via sharing
and higher system
utilization
COMPLEX PROCESSING
• Disambiguates entities
with a high level of
speed and accuracy
• Constructs graphs
from complex, large
data sets for easier
data analytics
• Enables graph
traversal to recognize
areas of hidden value
• Identifies important
attributes that
contribute to
predictive models
Module 4 – HPCC System Components
• Data Refinery (THOR) – Used to process every one of billions of records in order to create billions of
"improved" records.
• ECL Agent is also used to process simple jobs that would be an inefficient use of the THOR cluster.
• Rapid Data Delivery Engine (ROXIE) – Used to search quickly for a particular record or set of records.
• Enterprise Control Language (ECL) – Declarative, data-centric, distributed processing language for BigData.
• Enterprise Services Platform (ESP) – Provides an easy interface to access ECL queries using
XML,HTTP,SOAP(Simple Object Access Protocol) and REST (Representational State Transfer)
7 Big Data Processing with HPCC Systems
Module 5 – ECL IDE & Data Graphs
• Many complex data problems require a series of
advanced functions to solve them.
• With HPCC systems technology, complex data
challenges can be represented naturally with a
transformative data graph.
• The nodes of the data graph can be processed in
parallel as distinct data flows.
• ECL IDE turns code into graphs that facilitate the
understanding and processing of large-scale,
complex data analytics.
• Each section of the graph includes information
such as function, records processed or skew.
• Each node can be drilled into specific details.
8 Big Data Processing with HPCC Systems
Module 6 – ECL Language
9 Big Data Processing with HPCC Systems
• An easy to use, data-centric programming language optimized for large-scale
data management and query processing
• Highly efficient — automatically distributes workload across all nodes
• 80% more efficient than C++, Java and SQL — 1/3 reduction in programmer
time to maintain/enhance existing applications
• Benchmark against SQL (5 times more efficient) for code generation
• Automatic parallelization and synchronization
of sequential algorithms for parallel and distributed processing
• Large library of built-in modules to handle common data manipulation tasks
Declarative programming language …
powerful, extensible, implicitly parallel,
maintainable, complete and
homogeneous
Module 7 – Six Degrees of Kevin Bacon
10 Big Data Processing with HPCC Systems
Module 8 – HPCC Modules vs Apache Hadoop Modules
HPCC Systems Modules
• File Systems
• Distributed File System
• Thor distributed file system (Thor DFS) is
optimized for Big Data ETL
• ROXIE distributed file system (Roxie DFS) is
optimized for high concurrent query
processing
Hadoop Modules
• Hadoop Common – contains libraries and utilities
needed by other Hadoop modules
• Hadoop Distributed File System (HDFS) – a distributed
file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the
cluster.
• Hadoop YARN – a resource-management platform
responsible for managing resources in clusters and
using them for scheduling of users' applications.
• Hadoop MapReduce – a programming model for large
scale data processing.
• All the modules in Hadoop are designed with a
fundamental assumption that hardware failures (of
individual machines, or racks of machines) are common
and thus should be automatically handled in software
by the framework
11 Big Data Processing with HPCC Systems
Module 9 – HPCC Systems vs Hadoop Language
• ECL is the primary programming language for the HPCC environment. ECL is compiled into
optimized C++ which is then compiled into DLLs for execution on the Thor and ROXIE platforms.
• ECL can include inline C++ code encapsulated in functions. External services can be written in any
language and compiled into shared libraries of functions callable from ECL.
• A Pipe interface allows execution of external programs written in any language to be incorporated
into jobs.
• The Hadoop framework itself is mostly written in the Java programming language, with some
native code in C and command line utilities written as shell-scripts
12 Big Data Processing with HPCC Systems
Module 10 – Dali server Vs Task tracker & Data node service
13 Big Data Processing with HPCC Systems
Each slave node in Hadoop includes a Task
tracker service and Data node service.
A separate server called the Dali server
provides file system name services and
manages work units for jobs in the HPCC
environment.
A master node includes a Job tracker service
which can be configured as a separate
hardware node or run on one of the slave
hardware nodes.
A Thor cluster is also configured with a
master node and multiple slave nodes.
A master Name node service is also required
to provide name services and can be run on
one of the slave nodes or a separate node.
A ROXIE cluster is a peer-coupled cluster
where each node runs Server and Agent
tasks for query execution and key and file
processing
Module 11 – Fault Resilience
The DFS for Thor and Roxie stores
replicas of file parts on other nodes
(configurable) to protect against disk
and node failure. Replicas are
automatically used while copying
data to the new node.
HDFS stores multiple replicas (user-
specified) of data blocks on other
nodes (configurable) to protect
against disk and node failure with
automatic recovery.
ROXIE system continues running
following a node failure with a
reduced number of nodes.
Map Reduce architecture includes
speculative execution, when a slow
or failed Map task is detected,
additional Map tasks are started to
recover from node failures.
14 Big Data Processing with HPCC Systems
Module 12 – Components Comparison
15 Big Data Processing with HPCC Systems
Hadoop
Component Purpose
HPCC
Equivalent Notes
HDFS Distributed file system to store files for Hadoop None HPCC uses native filesystem to store files
Name node
Keep track of all files stored in HDFS including all
the blocks allocated to each file
Thor
master node The DFU is responsible for tracking file parts across nodes
Data node Sub node that stores Hadoop files
Thor
slave nodes
Like Hadoop name node, Thor can store data in both the master
and slave nodes
Job tracker Scheduling job runs and managing resources Dali
Task tracker Run subtasks assigned to the sub node Dali monitors task completion on each Thor sub node
Hive
Provides DW structure to HDFS files and SQL-like
declarative access to DW Roxie + Thor
Thor is used to perform data warehousing functions like
aggregations and create keyed B+ Tree indexes. Roxie is used to
provide fast keyed access to aggregated data
Pig/Sqoop
Provide easy declarative language constructs to
perform jobs on Hadoop ECL ECL is a declarative SQL-like language
Module 13 – ECL vs Hadoop Performance comparison
16 Big Data Processing with HPCC Systems
Module 14 – Why HPCC - Case Studies vs Hadoop
• Please find the testimonial video link at http://hpccsystems.com/why-HPCC/case-studies as some
of them do compare use cases between HPCC and Hadoop.
• Please find the link http://hpccsystems.com/Why-HPCC/HPCC-vs-Hadoop/Superior-to-Hadoop as
it lists out what makes HPCC Superior to Hadoop.
17 Big Data Processing with HPCC Systems
Module 15 – Join our academic community
Benefits of joining the community:
• Internship opportunities
• Invitation-only conferences
• Free training for qualifying projects
• Access to an external cluster, as available
• How to join: Click on:
http://hpccsystems.com/community/academic/join
Benefits of attending classes:
• FREE: fits all budgets
• Professional development at your own pace:
attend class as your schedule allows
• Increase your proficiency of solving BIG data
challenges: successive classes gradually build
your expertise
• Be a part of a growing community: meet other
programmers and share your experience, trade
tips and tricks
• How to start: Click on:
http://learn.lexisnexis.com/hpcc
18 Big Data Processing with HPCC Systems
LexisNexis offers free online introductory program classes to learn HPCC Systems,
the open source platform for BIG Data processing and analytics.
Questions?
Arun.Rathinasabapathy@lexisnexis.com
19 Big Data Processing with HPCC Systems
20

More Related Content

What's hot

Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learningMehdi Shibahara
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
CI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel KobranCI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel KobranDatabricks
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkJan Wiegelmann
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with HadoopSangchul Song
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Turi, Inc.
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on SparkMathieu Dumoulin
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Convolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlibConvolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlibDataWorks Summit
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotSteve Moore
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionSri Ambati
 

What's hot (20)

Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
CI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel KobranCI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel Kobran
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with Hadoop
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Convolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlibConvolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlib
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
 

Similar to Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

Big data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopBig data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopHPCC Systems
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformHortonworks
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asiaMuhammad Rifqi
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialRoxycodone Online
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architectureJoseph D'Antoni
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleHarald Erb
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 

Similar to Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016 (20)

Big data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopBig data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond Hadoop
 
Hpcc
HpccHpcc
Hpcc
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hadoop
HadoopHadoop
Hadoop
 
Talend for big_data_intorduction
Talend for big_data_intorductionTalend for big_data_intorduction
Talend for big_data_intorduction
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 

More from MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016

  • 1. BIG Data processing using HPCC Systems Above and Beyond Hadoop Arun Rathinasabapathy Senior Software Engineer September 23, 2016
  • 2. Welcome! • Module 1 : LexisNexis – Introduction • Module 2 : BIG DATA --- ??? • Module 3 : Introducing HPCC • Module 4 : HPCC System components • Module 5 : ECL IDE & Data Graphs • Module 6 : ECL language • Module 7 : Six Degrees of Kevin Bacon • Module 8: HPCC modules vs Apache Hadoop Modules • Module 9 :HPCC vs Hadoop Language • Module 10 :Dali server Vs Task tracker & Data node service • Module 11: Fault Resilience • Module 12: Components Comparison • Module 13: ECL vs Hadoop Performance comparison • Module 14: Why HPCC – Case studies and why it’s superior to Hadoop • Module 15: Join our academic community
  • 3. About LexisNexis Risk Solutions 3 Data, Analytics and Technology LexisNexis Risk Solutions leverages its industry-leading Big Data computing platform with vast data assets and a proprietary fast-linking technology. Solutions We Provide Our solution lines help detect and prevent fraud, streamline processes, investigate suspicious activity, and provide timely insights for business decisions. Markets We Serve We serve multiple industries, including Insurance, Financial Services, Receivables Management, Retail, Health Care and Communications. As well as local, state, and federal governments. Big Data Processing with HPCC Systems We work with Fortune 1000 and mid-market clients globally across industries, and federal and state governments. • Customers in more than 100 countries • 8 of the world’s top 10 banks • 100% of the top 50 U.S. banks • 80% of the Fortune 500 companies • 100% of U.S. P&C insurance carriers
  • 4. Vast Data Assets 4 10 billion unique name/address combinations 4.8 billion property records 4.2 billion motor vehicle registrations 41 million active U.S. business entities (LexIDs) 1.5 billion bankruptcy records monitored 272.8 million unique cell phones 21.9 billion insuranc e records 477 million criminal records ... 18.6 billion consumer records 1 billion vehicle title records Partial snapshot of our U.S. data sets as of 08/01/2016 • Over 6 Petabytes of Data • 45 Billion Public Records Big Data Processing with HPCC Systems
  • 5. Module 2 – BIG DATA – Understanding the Basics 5 Big Data Processing with HPCC Systems
  • 6. HPCC Systems Technology: Big Data Is Our Core Competency Big Data Processing with HPCC Systems6 SPEED • Scales to extreme workloads quickly and easily • Increases speed of development leads to faster production/ delivery • Improves developer productivity CAPACITY • Enables massive joins, merges, transformations, sorts, or tough N2 problems • Increases business responsiveness • Accelerates creation of new services via rapid prototyping capabilities • Offers a platform for collaboration and innovation leading to better results COST SAVINGS • Leverages commodity hardware so fewer people can do much more in less time • Uses IT resources efficiently via sharing and higher system utilization COMPLEX PROCESSING • Disambiguates entities with a high level of speed and accuracy • Constructs graphs from complex, large data sets for easier data analytics • Enables graph traversal to recognize areas of hidden value • Identifies important attributes that contribute to predictive models
  • 7. Module 4 – HPCC System Components • Data Refinery (THOR) – Used to process every one of billions of records in order to create billions of "improved" records. • ECL Agent is also used to process simple jobs that would be an inefficient use of the THOR cluster. • Rapid Data Delivery Engine (ROXIE) – Used to search quickly for a particular record or set of records. • Enterprise Control Language (ECL) – Declarative, data-centric, distributed processing language for BigData. • Enterprise Services Platform (ESP) – Provides an easy interface to access ECL queries using XML,HTTP,SOAP(Simple Object Access Protocol) and REST (Representational State Transfer) 7 Big Data Processing with HPCC Systems
  • 8. Module 5 – ECL IDE & Data Graphs • Many complex data problems require a series of advanced functions to solve them. • With HPCC systems technology, complex data challenges can be represented naturally with a transformative data graph. • The nodes of the data graph can be processed in parallel as distinct data flows. • ECL IDE turns code into graphs that facilitate the understanding and processing of large-scale, complex data analytics. • Each section of the graph includes information such as function, records processed or skew. • Each node can be drilled into specific details. 8 Big Data Processing with HPCC Systems
  • 9. Module 6 – ECL Language 9 Big Data Processing with HPCC Systems • An easy to use, data-centric programming language optimized for large-scale data management and query processing • Highly efficient — automatically distributes workload across all nodes • 80% more efficient than C++, Java and SQL — 1/3 reduction in programmer time to maintain/enhance existing applications • Benchmark against SQL (5 times more efficient) for code generation • Automatic parallelization and synchronization of sequential algorithms for parallel and distributed processing • Large library of built-in modules to handle common data manipulation tasks Declarative programming language … powerful, extensible, implicitly parallel, maintainable, complete and homogeneous
  • 10. Module 7 – Six Degrees of Kevin Bacon 10 Big Data Processing with HPCC Systems
  • 11. Module 8 – HPCC Modules vs Apache Hadoop Modules HPCC Systems Modules • File Systems • Distributed File System • Thor distributed file system (Thor DFS) is optimized for Big Data ETL • ROXIE distributed file system (Roxie DFS) is optimized for high concurrent query processing Hadoop Modules • Hadoop Common – contains libraries and utilities needed by other Hadoop modules • Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. • Hadoop YARN – a resource-management platform responsible for managing resources in clusters and using them for scheduling of users' applications. • Hadoop MapReduce – a programming model for large scale data processing. • All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework 11 Big Data Processing with HPCC Systems
  • 12. Module 9 – HPCC Systems vs Hadoop Language • ECL is the primary programming language for the HPCC environment. ECL is compiled into optimized C++ which is then compiled into DLLs for execution on the Thor and ROXIE platforms. • ECL can include inline C++ code encapsulated in functions. External services can be written in any language and compiled into shared libraries of functions callable from ECL. • A Pipe interface allows execution of external programs written in any language to be incorporated into jobs. • The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts 12 Big Data Processing with HPCC Systems
  • 13. Module 10 – Dali server Vs Task tracker & Data node service 13 Big Data Processing with HPCC Systems Each slave node in Hadoop includes a Task tracker service and Data node service. A separate server called the Dali server provides file system name services and manages work units for jobs in the HPCC environment. A master node includes a Job tracker service which can be configured as a separate hardware node or run on one of the slave hardware nodes. A Thor cluster is also configured with a master node and multiple slave nodes. A master Name node service is also required to provide name services and can be run on one of the slave nodes or a separate node. A ROXIE cluster is a peer-coupled cluster where each node runs Server and Agent tasks for query execution and key and file processing
  • 14. Module 11 – Fault Resilience The DFS for Thor and Roxie stores replicas of file parts on other nodes (configurable) to protect against disk and node failure. Replicas are automatically used while copying data to the new node. HDFS stores multiple replicas (user- specified) of data blocks on other nodes (configurable) to protect against disk and node failure with automatic recovery. ROXIE system continues running following a node failure with a reduced number of nodes. Map Reduce architecture includes speculative execution, when a slow or failed Map task is detected, additional Map tasks are started to recover from node failures. 14 Big Data Processing with HPCC Systems
  • 15. Module 12 – Components Comparison 15 Big Data Processing with HPCC Systems Hadoop Component Purpose HPCC Equivalent Notes HDFS Distributed file system to store files for Hadoop None HPCC uses native filesystem to store files Name node Keep track of all files stored in HDFS including all the blocks allocated to each file Thor master node The DFU is responsible for tracking file parts across nodes Data node Sub node that stores Hadoop files Thor slave nodes Like Hadoop name node, Thor can store data in both the master and slave nodes Job tracker Scheduling job runs and managing resources Dali Task tracker Run subtasks assigned to the sub node Dali monitors task completion on each Thor sub node Hive Provides DW structure to HDFS files and SQL-like declarative access to DW Roxie + Thor Thor is used to perform data warehousing functions like aggregations and create keyed B+ Tree indexes. Roxie is used to provide fast keyed access to aggregated data Pig/Sqoop Provide easy declarative language constructs to perform jobs on Hadoop ECL ECL is a declarative SQL-like language
  • 16. Module 13 – ECL vs Hadoop Performance comparison 16 Big Data Processing with HPCC Systems
  • 17. Module 14 – Why HPCC - Case Studies vs Hadoop • Please find the testimonial video link at http://hpccsystems.com/why-HPCC/case-studies as some of them do compare use cases between HPCC and Hadoop. • Please find the link http://hpccsystems.com/Why-HPCC/HPCC-vs-Hadoop/Superior-to-Hadoop as it lists out what makes HPCC Superior to Hadoop. 17 Big Data Processing with HPCC Systems
  • 18. Module 15 – Join our academic community Benefits of joining the community: • Internship opportunities • Invitation-only conferences • Free training for qualifying projects • Access to an external cluster, as available • How to join: Click on: http://hpccsystems.com/community/academic/join Benefits of attending classes: • FREE: fits all budgets • Professional development at your own pace: attend class as your schedule allows • Increase your proficiency of solving BIG data challenges: successive classes gradually build your expertise • Be a part of a growing community: meet other programmers and share your experience, trade tips and tricks • How to start: Click on: http://learn.lexisnexis.com/hpcc 18 Big Data Processing with HPCC Systems LexisNexis offers free online introductory program classes to learn HPCC Systems, the open source platform for BIG Data processing and analytics.
  • 20. 20