HPCC Systems is a big data platform developed by LexisNexis that uses in-memory indexing to enable sub-second querying of large datasets. It uses the ECL query language and consists of Thor and Roxie components. Data can be loaded into Thor for querying or copied to Roxie, where it can be indexed and queried directly from memory in under a second through the ESP interface using published ECL queries. The document provides examples of loading and querying data using both Thor and Roxie.
Handwritten Text Recognition for manuscripts and early printed texts
Ā
Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems
1. HPCC Systems - Big Data!
!
Roxie -
In-Memory Data & Index ,
Sub-Second Query Cluster
By Fujio Turner
@FujioTurner
2. Who is ? What is HPCC Systems?
LexisNexis is a provider of legal,
tax, regulatory, news, business
information, and analysis to
legal, corporate, government,!
accounting and academic
markets. !
!
LexisNexis has been in
business since 1977 with over
30,000 employees worldwide.
LexisNexis Risk is the division
of the LexisNexis which focuses
on data, Big Data processing,
linking and vertical expertise
and supports HPCC Systems
as an open source project
under Apache 2.0 License.
3. Comparison
Block Based File Based
JAVA C++
Petabytes
1-80,000 Jobs/day
Since 2005
Exabytes
Non-Indexed 4X-13X
Indexed: 2K-3K Jobs/sec
Since 2000
? ? ? ? ? ?
Thor Roxie
4. Non-Indexed Full Data Set
1 20
Customers Development Business
http://hpccsystems.com/why-hpcc/benchmarks
5. How do I Query HPCC Systems?
ECL (Enterprise Control Language) is a C++ based query
language for use with HPCC Systems Big Data platform.
ECLs syntax and format is very simple and easy to learn.!
!
Note - ECL is very similar to Hadoopās pig ,but!
more expressive and feature rich.
6. ECL (Enterprise Control Language)
C++ based query language
SQL w/ JOINS
Map/Reduce
GraphDB
Machine
Learning
Simple to Complex Queries
7. āIām sub-second
fast.ā
āI can query all
or part of your
data.ā
Cluster Architecture
Thor Roxie
Hard Disk
Index(optional)
Hard Disk
Index(optional)
In-memory Index
SSD
Either/Both
Rapid Online XML Inquiry Engine
8. Roxie
Date 2004
Query
Languages ECL
or
In-Memory Index Only
Index w/ Part or All Data
Data Type Normalized
Query
Methods
or DeNormalized
REST Direct TCP
or Unstructured
SOAP
9. Example 1
File Load Into Thor Query
VS
File Load Into Roxie Index Publish Query
10. HPCC Systems Sample Data for Examples 1
Sample Data
http://hpccsystems.com/download/docs/learning-ecl
12. 4.
5.
1. Upload file*!
2. Distribute to cluster!
3. Name of file in cluster!
4. Size of each row!
5. Push to cluster
*2GB file size limit through web
No limit if uploaded via SOAP
Load Data
17. Indexing!
In-Memory
Ex. Creating an index by āLastNameā
allPeople := DATASET(ā~test::originalperson_copyā,
{Layout_People, UNSIGNED8 RecPtr {virtual(fileposition)}},
File Position Number!
pseudo recordID!
āAlter Tableā(new column)
Make Index
Index Filename
FLAT);
rx := INDEX(allPeople,{LastName,RecPtr},ā~test::key_person_copyā,PRELOAD);
BUILDINDEX(rx);
In-Memory
18. Indexing!
In-Memory with Luggage
Index Only
rx := INDEX(allPeople,{LastName,RecPtr},ā~test::key_person_copyā,PRELOAD);
Index + Part or All Data
rx := INDEX(allPeople,{FirstName,MiddleName},{LastName,RecPtr},ā~test::key_person_copyā,PRELOAD);
+
Store Data!
In-Memory!
with Index
19. Query
allPeople := DATASET(ā~test::originalperson_copyā, {Layout_People, UNSIGNED8 RecPtr {virtual(fileposition)}},FLAT);
datax:= INDEX(allPeople,{LastName,RecPtr},ā~test::key_person_copyā);
WHERE `LastName` = āSmithā from Index
filterdata; //Output
w/ Index
Data
Query
filterdata:= FETCH(allPeople,datax(LastName=āSmithā),RIGHT. RecPtr);
20. āIām sub-second
fast.ā Publish Your Code
What Is publishing your code?
ECL is built from C++. So your ECL needs to be
compiled before it runs.When you publish a
query, you pre-compile your ECL and send it to
the ESP server where it will be stored. ESP, on
port 8002 , will listen to any requests and execute
the published query.
Can I do ad-hoc queries without publishing?
Yes, but it will not be sub-second fast as it is not
pre-compiled.
22. How to Publish Your ECL
1.Select your ad-hoc ālastnameā query from before
2.Name you query
ālastnameā
3. Publish
23. ESP on port 8002
Enterprise Service Platform
1.Select ālastnameā query
2.Select your data output format
āOutput JSONā 3.Click Submit button
26. 2013-06-08
2013-06-07
Twitter
2013-06-06
Twitter
Twitter
Use this file
name to get
all the real files
2013-06
Twitter
2013-06-06
ā¦ā¦ā¦.. -07
ā¦ā¦ā¦.. -08
Logical File
Real File
SuperFile!
Organizing Your Files
+ Append new real files on the fly
27. 2013-06-08
2013-06-07
Twitter
2013-06-06
Twitter
Twitter
SuperKeys No Sub-Super
2013
Twitter
Files or Keys
in Roxie
Organizing Your Indexes
28. For More HPCC!
āHow Toāsā!
Go to SlideShare
http://www.slideshare.net/FujioTurner/
29. Watch how to install
HPCC Systems
in 5 Minutes
Download HPCC Systems
Open Source
Community Edition
http://hpccsystems.com/download/
http://www.youtube.com/watch?v=8SV43DCUqJg
or
Source Code
https://github.com/hpcc-systems