This document provides an overview of LexisNexis Risk and HPCC Systems, an open source big data platform. It describes how HPCC Systems uses a distributed file system to store and process large datasets across clusters. The document compares HPCC Systems to other platforms and provides details on its query language (ECL), machine learning capabilities, and support for both structured and unstructured data.
2. Who is ? What is HPCC Systems?
LexisNexis is a provider of legal,
tax, regulatory, news, business
information, and analysis to
legal, corporate, government,!
accounting and academic
markets. !
!
LexisNexis has been in
business since 1977 with over
30,000 employees worldwide.
LexisNexis Risk is the division
of the LexisNexis which focuses
on data, Big Data processing,
linking and vertical expertise
and supports HPCC Systems
as an open source project
under Apache 2.0 License.
http://hpccsystems.com/
3. Comparison
Block Based File Based
JAVA C++
Petabytes
1-80,000 Jobs/day
Since 2005
Exabytes
Indexed: 2K-3K Jobs/sec*
Since 2000
? ? ? ? ? ?
Thor Roxie
In-Memory: 30 - 40 Jobs/min*
Non-Indexed: 4-1,040,000 Jobs/day
*based on job (size / result set / complexity)
4. Non-Indexed Full Data Set
1 20
Customers Development Business
http://hpccsystems.com/why-hpcc/benchmarks
5. “I’m sub-second
fast.”
“I can query all
or part of your
data.”
Cluster Architecture
Thor Roxie
Single Threaded
Hard Disk
Index(optional)
Multi-Threaded
Hard Disk
Index(optional)
In-memory
SSD
Either/Both
6. How do the platforms !
handle the same data?!
Example
300GB File
Name State Age
Kevin CA 45
Mark MI 27
Sara FL 64
Customer Data May 2010
7. Name Node
Store Data
Data Nodes
!
a?
!
b?
!
c?
big blocks
Kevin CA 45
Mark MI 27
Sara FL 64
Data is stored in
random blocks.
? ? ?
8. Name Node
Store Data
Data Nodes
block a = server 1
…… b = …….. 2
…… c = …….. 3
!
a?
!
b?
!
c?
big blocks
Kevin CA 45
Mark MI 27
Sara FL 64
Block location are
stored in memory.
? ? ?
9. Store Data
Kevin CA 45
Mark MI 27
Sara FL 64
Data is distributed
evenly in the cluster
with replica copies
and is seen as a
file (example below).
K.. CA 45 M.. MI 27 S.. FL 64
Thor Master
Thor Slaves
File Name
~/customers_2010-05
10. Store Data
Kevin CA 45
Mark MI 27
Sara FL 64
File locations are
stored on disk.
File Location & Job Scheduler
K.. CA 45 M.. MI 27 S.. FL 64
Thor Master
Thor Slaves
Dali
File Name
~/customers_2010-05
11. What state do most people live in?
Blocks are scanned
for wanted data
!
a?
!
b?
!
c?
Name Node
Data Nodes
12. What state do most people live in?
!
a?
Mapper
!
b?
!
c?
Name Node
Data Nodes
CA 1
FL 1
MI 1
FL 1
CA 1
MI 1
MI ..
Found data is sent
to Mapper(s) in
Key/Value pairs
and stored.
13. What state do most people live in?
!
a?
Mapper
!
b?
!
c?
Name Node
Data Nodes
Reducer
CA 120
MI 500
FL 7
CA 1
FL 1
MI 1
FL 1
CA 1
MI 1
MI ..
Stored data is sent
to Reducer(s) to be
aggregated.
14. What state do most people live in?
!
a?
Mapper
!
b?
!
c?
Name Node
Data Nodes
Reducer
CA 120
MI 500
FL 7
CA 1
FL 1
MI 1
FL 1
CA 1
MI 1
MI ..
Cannot use SSD in
Mapper or Reducer
due to too many
writes.
15. What state do most people live in?
1a.
File Location & Job Scheduler 1.a A pre-compiled
Thor Master K CA 45 M MI 27 S FL 64
Thor Slaves
Dali
ESP
2.
query is triggered.
(Mostly used in Roxie)
1b. Ad-hoc query.
!
2.Query is sent to Dali
to get file locations.
1b.
16. What state do most people live in?
File Location & Job Scheduler
3. ESP
Thor Master K CA 45 M MI 27 S FL 64
Thor Slaves
Dali
3. Job is placed in
que to be sent to
Thor Master. Thor
Master coordinates
job execution on
Thor Slave nodes.
17. What state do most people live in?
Thor Master K CA 45 M MI 27 S FL 64
Thor Slaves
Dali
ESP
File Location & Job Scheduler
Job are done
locally on slaves
and/or
coordinated by
master globally.
18. What state do most people live in?
Thor Master K CA 45 M MI 27 S FL 64
Thor Slaves
Dali
ESP
4.
4.
MI 500
CA 120
FL 7
File Location & Job Scheduler
4.Job is returned with
optional grouped by &
sorted by at run time.
19. What state do most people live in?
Thor Master K CA 45 M MI 27 S FL 64
Thor Slaves
Dali
ESP
MI 500
CA 120
FL 7
File Location & Job Scheduler
SORT!
GROUP!
DEDUP!
JOIN!
MERGE!
BETWEEN!
LENGTH!
REGEX!
ROUND!
SUM!
COUNT!
TRIM!
WHEN!
AVE!
CASE!
NORMALIZE!
DENORMALIZE!
K-MEANS!
more ….
Multiple other actions can be
done on the data in a single job.
20. Closer Look at Finding Data
Full block is scanned to find your data.
Blocks can be many terabytes in size.
!
a? !
K CA 45
a b c
K CA 45
21. Closer Look at Finding Data
!
a? !
K CA 45
a b c
When data is found
its sent to mapper.
CA , 1
K CA 45
22. Closer Look at Finding Data
!
a? !
K CA 45
a b c
K CA 45
Data location is know.
!
“Apply Schema on Read” during time
of query.
!
Data is processed locally.
Name State Age
23. Closer Look at Finding Data
!
a? !
File size can be a few bytes
to 4 exabytes with no limits
on the total number of files
that can be stored.
K CA 45
a b c
K CA 45
24. Speed
!
a?
128GB - 1TB
8TB - 16TB or more
2013
1.5 - 12.5% of data is in memory
and only recently used data is in memory.
25. Speed - Part 1
File Name
~/customers_2010-05
Kevin CA 45
Mark MI 27
Sara FL 64
File Name
~/customers_2010-05_index
• index per file
• customize by field(s)
Thor Master K CA 45 M MI 27 S FL 64
Thor Slaves
CA row #3
MI row #17
MI row #4
FL row #5
Indexing
Index Index Index
27. Example Index Example Index
1 40
Non-Indexed
1 200+
To
Indexed
male row #345
female row #4
male row #97
female row #267
CA row #3
MI row #17
MI row #4
FL row #5
28. Speed - Part 2
Roxie
Index In-Memory
Roxie Master K CA 45 M MI 27 S FL 64
Index Index Index
Roxie Slaves
29. Speed - Part 2
Roxie
Index In-Memory
or
Index In-Memory & Part or All Data
Index Index Index
Roxie Master K CA 45 M MI 27 S FL 64
Roxie Slaves
30. Speed - Part 2
Roxie
Index In-Memory
or
Index In-Memory & Part or All Data
Roxie is Multi-Threaded
Index Index Index
Roxie Master K CA 45 M MI 27 S FL 64
Roxie Slaves
31. Speed - Part 2
Roxie
Index In-Memory
or
Index In-Memory & Part or All Data
Roxie is Multi-Threaded
Index Index Index
Roxie Master K CA 45 M MI 27 S FL 64
Roxie Slaves
SSD are OK - write few / read many
32. Speed - Part 2
Roxie
Index In-Memory
or
Index In-Memory & Part or All Data
Roxie is Multi-Threaded
Index Index Index
Roxie Master K CA 45 M MI 27 S FL 64
Roxie Slaves
2004
33. Thor Master
Common Cluster
Dali ESP
Thor Slaves
Roxie Master
Roxie Slaves
Data is mostly
unstructured. Use Thor to
do ETL & create indexes.
Send results to Roxie for
user queries.
34. High Speed Cluster
Dali ESP
Roxie Master
Data is mostly structured.
Main goal is to have fast
queries all the time.
Roxie Slaves
35. Thor Master
Storage Cluster
Dali ESP
Data is structured or unstructured.
Main goal is to storage lots of data
and query using indexes on all or
part of the data in the cluster.
Thor Slaves
36. Complex or Multi-Step Queries
!
a?
Mapper
!
b?
Reducer
!
c?
Name Node
Data Nodes / Task Tracker
Job Tracker
Job Tracker
coordinates
multi step
jobs.
37. Job Tracker
3 hours 1 hours 1 hours 6 hours
CA 120
MI 500
FL 7
Food 31
Water 99
Candy 84
Wed 80
Fri 73
Sun 96
1 2 3
4 5 6
7 8 9
1 hours
Sum 80
Count 73
38. How do I Query HPCC Systems?
ECL (Enterprise Control Language) is a C++ based query
language for use with HPCC Systems Big Data platform.
ECLs syntax and format is very simple and easy to learn.!
!
Note - ECL is very similar to Hadoop’s pig ,but!
more expressive and feature rich.
39. ECL (Enterprise Control Language)
C++ based query language
SQL w/ JOINS
Map/Reduce
GraphDB
Machine
Learning
Simple to Complex Queries
40. Query is Completed in a Single Job!
Asynchronously
Count
Sort
Group
Classification
Country = ‘US’
Country = ‘US’
Join
Index of
~/facebook_2013
~/twitter_2013
~/facebook_2013
(ROXIE) 0.27 seconds to (THOR) few hours
SORT!
GROUP!
DEDUP!
JOIN!
MERGE!
BETWEEN!
LENGTH!
REGEX!
ROUND!
SUM!
COUNT!
TRIM!
WHEN!
AVE!
CASE!
NORMALIZE!
DENORMALIZE!
K-MEANS!
more ….
+
41. Machine Learning Built-in
http://hpccsystems.com/ml
Regression!
Linear Regression
Classification!
Naive Bayes
Perceptron
Decisions Trees
Logistic Regression
Clustering!
K-Means
KD Trees
Agglomerative/Hierarchical
Association Analysis!
AprioriN
EclatN
Rules
Michael Payne ,of Clemson University,
on high speed machine learning with
PB-BLAS in HPCC Systems.
http://youtu.be/s_HWlMwi6iI
43. Un-Structured Data
Lorem Ipsum is
simply dummy text
of the printing
Regular
Expression in C++
or
Pattern Match in
ECL
Regular Expression in Java
Reg Ex+ +
meta data
stored only
Filtered Data
+
Indexes
44. Full Text Search
Lorem Ipsum is
simply dummy text
of the printing
Pattern Match in ECL
and
Rex Ex + or
46. “I want sub-second speed but made investment in HDFS.”
Roxie Master K CA 45 M MI 27 S FL 64
Index Index Index
Roxie Slaves
!
a?
!
b?
!
c?
Hadoop / HPCC Transport Plug-in
Name Node
Data Nodes / Task Tracker
http://hpccsystems.com/products-and-services/products/modules/hadoop-integration
47. Migrating from Hadoop to HPCC Systems
Roxie Master K CA 45 M MI 27 S FL 64
Index Index Index
Roxie Slaves
Name Node
Data Nodes / Task Tracker
Thor Master
Thor Slaves
Slowly replace Hadoop with Thor.
49. HPCC Systems Security
User / Group Authentication
Third Party Authentication
Kerberos OK
Encrypt Data on Disk optional
50. For More HPCC!
“How To’s”!
Go to SlideShare
http://www.slideshare.net/FujioTurner/
51. Watch how to install
HPCC Systems
in 5 Minutes
Download HPCC Systems
Open Source
Community Edition
http://hpccsystems.com/download/
http://www.youtube.com/watch?v=8SV43DCUqJg
or
Source Code
https://github.com/hpcc-systems