SlideShare a Scribd company logo
1 of 51
Download to read offline
Big Data 
Architectural 
Overview 
By Fujio Turner 
VS 
@FujioTurner
Who is ? What is HPCC Systems? 
LexisNexis is a provider of legal, 
tax, regulatory, news, business 
information, and analysis to 
legal, corporate, government,! 
accounting and academic 
markets. ! 
! 
LexisNexis has been in 
business since 1977 with over 
30,000 employees worldwide. 
LexisNexis Risk is the division 
of the LexisNexis which focuses 
on data, Big Data processing, 
linking and vertical expertise 
and supports HPCC Systems 
as an open source project 
under Apache 2.0 License. 
http://hpccsystems.com/
Comparison 
Block Based File Based 
JAVA C++ 
Petabytes 
1-80,000 Jobs/day 
Since 2005 
Exabytes 
Indexed: 2K-3K Jobs/sec* 
Since 2000 
? ? ? ? ? ? 
Thor Roxie 
In-Memory: 30 - 40 Jobs/min* 
Non-Indexed: 4-1,040,000 Jobs/day 
*based on job (size / result set / complexity)
Non-Indexed Full Data Set 
1 20 
Customers Development Business 
http://hpccsystems.com/why-hpcc/benchmarks
“I’m sub-second 
fast.” 
“I can query all 
or part of your 
data.” 
Cluster Architecture 
Thor Roxie 
Single Threaded 
Hard Disk 
Index(optional) 
Multi-Threaded 
Hard Disk 
Index(optional) 
In-memory 
SSD 
Either/Both
How do the platforms ! 
handle the same data?! 
Example 
300GB File 
Name State Age 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
Customer Data May 2010
Name Node 
Store Data 
Data Nodes 
! 
a? 
! 
b? 
! 
c? 
big blocks 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
Data is stored in 
random blocks. 
? ? ?
Name Node 
Store Data 
Data Nodes 
block a = server 1 
…… b = …….. 2 
…… c = …….. 3 
! 
a? 
! 
b? 
! 
c? 
big blocks 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
Block location are 
stored in memory. 
? ? ?
Store Data 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
Data is distributed 
evenly in the cluster 
with replica copies 
and is seen as a 
file (example below). 
K.. CA 45 M.. MI 27 S.. FL 64 
Thor Master 
Thor Slaves 
File Name 
~/customers_2010-05
Store Data 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
File locations are 
stored on disk. 
File Location & Job Scheduler 
K.. CA 45 M.. MI 27 S.. FL 64 
Thor Master 
Thor Slaves 
Dali 
File Name 
~/customers_2010-05
What state do most people live in? 
Blocks are scanned 
for wanted data 
! 
a? 
! 
b? 
! 
c? 
Name Node 
Data Nodes
What state do most people live in? 
! 
a? 
Mapper 
! 
b? 
! 
c? 
Name Node 
Data Nodes 
CA 1 
FL 1 
MI 1 
FL 1 
CA 1 
MI 1 
MI .. 
Found data is sent 
to Mapper(s) in 
Key/Value pairs 
and stored.
What state do most people live in? 
! 
a? 
Mapper 
! 
b? 
! 
c? 
Name Node 
Data Nodes 
Reducer 
CA 120 
MI 500 
FL 7 
CA 1 
FL 1 
MI 1 
FL 1 
CA 1 
MI 1 
MI .. 
Stored data is sent 
to Reducer(s) to be 
aggregated.
What state do most people live in? 
! 
a? 
Mapper 
! 
b? 
! 
c? 
Name Node 
Data Nodes 
Reducer 
CA 120 
MI 500 
FL 7 
CA 1 
FL 1 
MI 1 
FL 1 
CA 1 
MI 1 
MI .. 
Cannot use SSD in 
Mapper or Reducer 
due to too many 
writes.
What state do most people live in? 
1a. 
File Location & Job Scheduler 1.a A pre-compiled 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
Dali 
ESP 
2. 
query is triggered. 
(Mostly used in Roxie) 
1b. Ad-hoc query. 
! 
2.Query is sent to Dali 
to get file locations. 
1b.
What state do most people live in? 
File Location & Job Scheduler 
3. ESP 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
Dali 
3. Job is placed in 
que to be sent to 
Thor Master. Thor 
Master coordinates 
job execution on 
Thor Slave nodes.
What state do most people live in? 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
Dali 
ESP 
File Location & Job Scheduler 
Job are done 
locally on slaves 
and/or 
coordinated by 
master globally.
What state do most people live in? 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
Dali 
ESP 
4. 
4. 
MI 500 
CA 120 
FL 7 
File Location & Job Scheduler 
4.Job is returned with 
optional grouped by & 
sorted by at run time.
What state do most people live in? 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
Dali 
ESP 
MI 500 
CA 120 
FL 7 
File Location & Job Scheduler 
SORT! 
GROUP! 
DEDUP! 
JOIN! 
MERGE! 
BETWEEN! 
LENGTH! 
REGEX! 
ROUND! 
SUM! 
COUNT! 
TRIM! 
WHEN! 
AVE! 
CASE! 
NORMALIZE! 
DENORMALIZE! 
K-MEANS! 
more …. 
Multiple other actions can be 
done on the data in a single job.
Closer Look at Finding Data 
Full block is scanned to find your data. 
Blocks can be many terabytes in size. 
! 
a? ! 
K CA 45 
a b c 
K CA 45
Closer Look at Finding Data 
! 
a? ! 
K CA 45 
a b c 
When data is found 
its sent to mapper. 
CA , 1 
K CA 45
Closer Look at Finding Data 
! 
a? ! 
K CA 45 
a b c 
K CA 45 
Data location is know. 
! 
“Apply Schema on Read” during time 
of query. 
! 
Data is processed locally. 
Name State Age
Closer Look at Finding Data 
! 
a? ! 
File size can be a few bytes 
to 4 exabytes with no limits 
on the total number of files 
that can be stored. 
K CA 45 
a b c 
K CA 45
Speed 
! 
a? 
128GB - 1TB 
8TB - 16TB or more 
2013 
1.5 - 12.5% of data is in memory 
and only recently used data is in memory.
Speed - Part 1 
File Name 
~/customers_2010-05 
Kevin CA 45 
Mark MI 27 
Sara FL 64 
File Name 
~/customers_2010-05_index 
• index per file 
• customize by field(s) 
Thor Master K CA 45 M MI 27 S FL 64 
Thor Slaves 
CA row #3 
MI row #17 
MI row #4 
FL row #5 
Indexing 
Index Index Index
1 40 
Non-Indexed 
1 200 
To 
Indexed
Example Index Example Index 
1 40 
Non-Indexed 
1 200+ 
To 
Indexed 
male row #345 
female row #4 
male row #97 
female row #267 
CA row #3 
MI row #17 
MI row #4 
FL row #5
Speed - Part 2 
Roxie 
Index In-Memory 
Roxie Master K CA 45 M MI 27 S FL 64 
Index Index Index 
Roxie Slaves
Speed - Part 2 
Roxie 
Index In-Memory 
or 
Index In-Memory & Part or All Data 
Index Index Index 
Roxie Master K CA 45 M MI 27 S FL 64 
Roxie Slaves
Speed - Part 2 
Roxie 
Index In-Memory 
or 
Index In-Memory & Part or All Data 
Roxie is Multi-Threaded 
Index Index Index 
Roxie Master K CA 45 M MI 27 S FL 64 
Roxie Slaves
Speed - Part 2 
Roxie 
Index In-Memory 
or 
Index In-Memory & Part or All Data 
Roxie is Multi-Threaded 
Index Index Index 
Roxie Master K CA 45 M MI 27 S FL 64 
Roxie Slaves 
SSD are OK - write few / read many
Speed - Part 2 
Roxie 
Index In-Memory 
or 
Index In-Memory & Part or All Data 
Roxie is Multi-Threaded 
Index Index Index 
Roxie Master K CA 45 M MI 27 S FL 64 
Roxie Slaves 
2004
Thor Master 
Common Cluster 
Dali ESP 
Thor Slaves 
Roxie Master 
Roxie Slaves 
Data is mostly 
unstructured. Use Thor to 
do ETL & create indexes. 
Send results to Roxie for 
user queries.
High Speed Cluster 
Dali ESP 
Roxie Master 
Data is mostly structured. 
Main goal is to have fast 
queries all the time. 
Roxie Slaves
Thor Master 
Storage Cluster 
Dali ESP 
Data is structured or unstructured. 
Main goal is to storage lots of data 
and query using indexes on all or 
part of the data in the cluster. 
Thor Slaves
Complex or Multi-Step Queries 
! 
a? 
Mapper 
! 
b? 
Reducer 
! 
c? 
Name Node 
Data Nodes / Task Tracker 
Job Tracker 
Job Tracker 
coordinates 
multi step 
jobs.
Job Tracker 
3 hours 1 hours 1 hours 6 hours 
CA 120 
MI 500 
FL 7 
Food 31 
Water 99 
Candy 84 
Wed 80 
Fri 73 
Sun 96 
1 2 3 
4 5 6 
7 8 9 
1 hours 
Sum 80 
Count 73
How do I Query HPCC Systems? 
ECL (Enterprise Control Language) is a C++ based query 
language for use with HPCC Systems Big Data platform. 
ECLs syntax and format is very simple and easy to learn.! 
! 
Note - ECL is very similar to Hadoop’s pig ,but! 
more expressive and feature rich.
ECL (Enterprise Control Language) 
C++ based query language 
SQL w/ JOINS 
Map/Reduce 
GraphDB 
Machine 
Learning 
Simple to Complex Queries
Query is Completed in a Single Job! 
Asynchronously 
Count 
Sort 
Group 
Classification 
Country = ‘US’ 
Country = ‘US’ 
Join 
Index of 
~/facebook_2013 
~/twitter_2013 
~/facebook_2013 
(ROXIE) 0.27 seconds to (THOR) few hours 
SORT! 
GROUP! 
DEDUP! 
JOIN! 
MERGE! 
BETWEEN! 
LENGTH! 
REGEX! 
ROUND! 
SUM! 
COUNT! 
TRIM! 
WHEN! 
AVE! 
CASE! 
NORMALIZE! 
DENORMALIZE! 
K-MEANS! 
more …. 
+
Machine Learning Built-in 
http://hpccsystems.com/ml 
Regression! 
Linear Regression 
Classification! 
Naive Bayes 
Perceptron 
Decisions Trees 
Logistic Regression 
Clustering! 
K-Means 
KD Trees 
Agglomerative/Hierarchical 
Association Analysis! 
AprioriN 
EclatN 
Rules 
Michael Payne ,of Clemson University, 
on high speed machine learning with 
PB-BLAS in HPCC Systems. 
http://youtu.be/s_HWlMwi6iI
Un-Structured Data?! 
Example 
Lorem Ipsum is 
simply dummy text 
of the printing 
lots of text 
300GB File
Un-Structured Data 
Lorem Ipsum is 
simply dummy text 
of the printing 
Regular 
Expression in C++ 
or 
Pattern Match in 
ECL 
Regular Expression in Java 
Reg Ex+ + 
meta data 
stored only 
Filtered Data 
+ 
Indexes
Full Text Search 
Lorem Ipsum is 
simply dummy text 
of the printing 
Pattern Match in ECL 
and 
Rex Ex + or
Management & Administration 
vs 
More Moving Parts = More Downtime
“I want sub-second speed but made investment in HDFS.” 
Roxie Master K CA 45 M MI 27 S FL 64 
Index Index Index 
Roxie Slaves 
! 
a? 
! 
b? 
! 
c? 
Hadoop / HPCC Transport Plug-in 
Name Node 
Data Nodes / Task Tracker 
http://hpccsystems.com/products-and-services/products/modules/hadoop-integration
Migrating from Hadoop to HPCC Systems 
Roxie Master K CA 45 M MI 27 S FL 64 
Index Index Index 
Roxie Slaves 
Name Node 
Data Nodes / Task Tracker 
Thor Master 
Thor Slaves 
Slowly replace Hadoop with Thor.
Alternative Query Methods
HPCC Systems Security 
User / Group Authentication 
Third Party Authentication 
Kerberos OK 
Encrypt Data on Disk optional
For More HPCC! 
“How To’s”! 
Go to SlideShare 
http://www.slideshare.net/FujioTurner/
Watch how to install 
HPCC Systems 
in 5 Minutes 
Download HPCC Systems 
Open Source 
Community Edition 
http://hpccsystems.com/download/ 
http://www.youtube.com/watch?v=8SV43DCUqJg 
or 
Source Code 
https://github.com/hpcc-systems

More Related Content

What's hot

Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveSteve Loughran
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationlin bao
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive Rupak Roy
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Embedded R Execution using SQL
Embedded R Execution using SQLEmbedded R Execution using SQL
Embedded R Execution using SQLBrendan Tierney
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for ElasticsearchJodok Batlogg
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Lambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus AppsLambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus AppsSimon Su
 
Import Database Data using RODBC in R Studio
Import Database Data using RODBC in R StudioImport Database Data using RODBC in R Studio
Import Database Data using RODBC in R StudioRupak Roy
 

What's hot (20)

Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
 
Hadoop
HadoopHadoop
Hadoop
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentation
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Embedded R Execution using SQL
Embedded R Execution using SQLEmbedded R Execution using SQL
Embedded R Execution using SQL
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for Elasticsearch
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Lambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus AppsLambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus Apps
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Import Database Data using RODBC in R Studio
Import Database Data using RODBC in R StudioImport Database Data using RODBC in R Studio
Import Database Data using RODBC in R Studio
 

Similar to HPCC Systems vs Hadoop

Scaling Dropbox
Scaling DropboxScaling Dropbox
Scaling DropboxC4Media
 
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)Tony Rogerson
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 
POLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloudPOLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloudoysteing
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisliang chen
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Databricks
 
GNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesGNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesTanel Poder
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right JobEmily Curtin
 
Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Federico Panini
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...HostedbyConfluent
 
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Anya Bida
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaAmazee Labs
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsBenjamin Darfler
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 

Similar to HPCC Systems vs Hadoop (20)

Intro to hadoop
Intro to hadoopIntro to hadoop
Intro to hadoop
 
Scaling Dropbox
Scaling DropboxScaling Dropbox
Scaling Dropbox
 
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
POLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloudPOLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloud
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)
 
GNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesGNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for Databases
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right Job
 
Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
 
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

HPCC Systems vs Hadoop

  • 1. Big Data Architectural Overview By Fujio Turner VS @FujioTurner
  • 2. Who is ? What is HPCC Systems? LexisNexis is a provider of legal, tax, regulatory, news, business information, and analysis to legal, corporate, government,! accounting and academic markets. ! ! LexisNexis has been in business since 1977 with over 30,000 employees worldwide. LexisNexis Risk is the division of the LexisNexis which focuses on data, Big Data processing, linking and vertical expertise and supports HPCC Systems as an open source project under Apache 2.0 License. http://hpccsystems.com/
  • 3. Comparison Block Based File Based JAVA C++ Petabytes 1-80,000 Jobs/day Since 2005 Exabytes Indexed: 2K-3K Jobs/sec* Since 2000 ? ? ? ? ? ? Thor Roxie In-Memory: 30 - 40 Jobs/min* Non-Indexed: 4-1,040,000 Jobs/day *based on job (size / result set / complexity)
  • 4. Non-Indexed Full Data Set 1 20 Customers Development Business http://hpccsystems.com/why-hpcc/benchmarks
  • 5. “I’m sub-second fast.” “I can query all or part of your data.” Cluster Architecture Thor Roxie Single Threaded Hard Disk Index(optional) Multi-Threaded Hard Disk Index(optional) In-memory SSD Either/Both
  • 6. How do the platforms ! handle the same data?! Example 300GB File Name State Age Kevin CA 45 Mark MI 27 Sara FL 64 Customer Data May 2010
  • 7. Name Node Store Data Data Nodes ! a? ! b? ! c? big blocks Kevin CA 45 Mark MI 27 Sara FL 64 Data is stored in random blocks. ? ? ?
  • 8. Name Node Store Data Data Nodes block a = server 1 …… b = …….. 2 …… c = …….. 3 ! a? ! b? ! c? big blocks Kevin CA 45 Mark MI 27 Sara FL 64 Block location are stored in memory. ? ? ?
  • 9. Store Data Kevin CA 45 Mark MI 27 Sara FL 64 Data is distributed evenly in the cluster with replica copies and is seen as a file (example below). K.. CA 45 M.. MI 27 S.. FL 64 Thor Master Thor Slaves File Name ~/customers_2010-05
  • 10. Store Data Kevin CA 45 Mark MI 27 Sara FL 64 File locations are stored on disk. File Location & Job Scheduler K.. CA 45 M.. MI 27 S.. FL 64 Thor Master Thor Slaves Dali File Name ~/customers_2010-05
  • 11. What state do most people live in? Blocks are scanned for wanted data ! a? ! b? ! c? Name Node Data Nodes
  • 12. What state do most people live in? ! a? Mapper ! b? ! c? Name Node Data Nodes CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI .. Found data is sent to Mapper(s) in Key/Value pairs and stored.
  • 13. What state do most people live in? ! a? Mapper ! b? ! c? Name Node Data Nodes Reducer CA 120 MI 500 FL 7 CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI .. Stored data is sent to Reducer(s) to be aggregated.
  • 14. What state do most people live in? ! a? Mapper ! b? ! c? Name Node Data Nodes Reducer CA 120 MI 500 FL 7 CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI .. Cannot use SSD in Mapper or Reducer due to too many writes.
  • 15. What state do most people live in? 1a. File Location & Job Scheduler 1.a A pre-compiled Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves Dali ESP 2. query is triggered. (Mostly used in Roxie) 1b. Ad-hoc query. ! 2.Query is sent to Dali to get file locations. 1b.
  • 16. What state do most people live in? File Location & Job Scheduler 3. ESP Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves Dali 3. Job is placed in que to be sent to Thor Master. Thor Master coordinates job execution on Thor Slave nodes.
  • 17. What state do most people live in? Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves Dali ESP File Location & Job Scheduler Job are done locally on slaves and/or coordinated by master globally.
  • 18. What state do most people live in? Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves Dali ESP 4. 4. MI 500 CA 120 FL 7 File Location & Job Scheduler 4.Job is returned with optional grouped by & sorted by at run time.
  • 19. What state do most people live in? Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves Dali ESP MI 500 CA 120 FL 7 File Location & Job Scheduler SORT! GROUP! DEDUP! JOIN! MERGE! BETWEEN! LENGTH! REGEX! ROUND! SUM! COUNT! TRIM! WHEN! AVE! CASE! NORMALIZE! DENORMALIZE! K-MEANS! more …. Multiple other actions can be done on the data in a single job.
  • 20. Closer Look at Finding Data Full block is scanned to find your data. Blocks can be many terabytes in size. ! a? ! K CA 45 a b c K CA 45
  • 21. Closer Look at Finding Data ! a? ! K CA 45 a b c When data is found its sent to mapper. CA , 1 K CA 45
  • 22. Closer Look at Finding Data ! a? ! K CA 45 a b c K CA 45 Data location is know. ! “Apply Schema on Read” during time of query. ! Data is processed locally. Name State Age
  • 23. Closer Look at Finding Data ! a? ! File size can be a few bytes to 4 exabytes with no limits on the total number of files that can be stored. K CA 45 a b c K CA 45
  • 24. Speed ! a? 128GB - 1TB 8TB - 16TB or more 2013 1.5 - 12.5% of data is in memory and only recently used data is in memory.
  • 25. Speed - Part 1 File Name ~/customers_2010-05 Kevin CA 45 Mark MI 27 Sara FL 64 File Name ~/customers_2010-05_index • index per file • customize by field(s) Thor Master K CA 45 M MI 27 S FL 64 Thor Slaves CA row #3 MI row #17 MI row #4 FL row #5 Indexing Index Index Index
  • 26. 1 40 Non-Indexed 1 200 To Indexed
  • 27. Example Index Example Index 1 40 Non-Indexed 1 200+ To Indexed male row #345 female row #4 male row #97 female row #267 CA row #3 MI row #17 MI row #4 FL row #5
  • 28. Speed - Part 2 Roxie Index In-Memory Roxie Master K CA 45 M MI 27 S FL 64 Index Index Index Roxie Slaves
  • 29. Speed - Part 2 Roxie Index In-Memory or Index In-Memory & Part or All Data Index Index Index Roxie Master K CA 45 M MI 27 S FL 64 Roxie Slaves
  • 30. Speed - Part 2 Roxie Index In-Memory or Index In-Memory & Part or All Data Roxie is Multi-Threaded Index Index Index Roxie Master K CA 45 M MI 27 S FL 64 Roxie Slaves
  • 31. Speed - Part 2 Roxie Index In-Memory or Index In-Memory & Part or All Data Roxie is Multi-Threaded Index Index Index Roxie Master K CA 45 M MI 27 S FL 64 Roxie Slaves SSD are OK - write few / read many
  • 32. Speed - Part 2 Roxie Index In-Memory or Index In-Memory & Part or All Data Roxie is Multi-Threaded Index Index Index Roxie Master K CA 45 M MI 27 S FL 64 Roxie Slaves 2004
  • 33. Thor Master Common Cluster Dali ESP Thor Slaves Roxie Master Roxie Slaves Data is mostly unstructured. Use Thor to do ETL & create indexes. Send results to Roxie for user queries.
  • 34. High Speed Cluster Dali ESP Roxie Master Data is mostly structured. Main goal is to have fast queries all the time. Roxie Slaves
  • 35. Thor Master Storage Cluster Dali ESP Data is structured or unstructured. Main goal is to storage lots of data and query using indexes on all or part of the data in the cluster. Thor Slaves
  • 36. Complex or Multi-Step Queries ! a? Mapper ! b? Reducer ! c? Name Node Data Nodes / Task Tracker Job Tracker Job Tracker coordinates multi step jobs.
  • 37. Job Tracker 3 hours 1 hours 1 hours 6 hours CA 120 MI 500 FL 7 Food 31 Water 99 Candy 84 Wed 80 Fri 73 Sun 96 1 2 3 4 5 6 7 8 9 1 hours Sum 80 Count 73
  • 38. How do I Query HPCC Systems? ECL (Enterprise Control Language) is a C++ based query language for use with HPCC Systems Big Data platform. ECLs syntax and format is very simple and easy to learn.! ! Note - ECL is very similar to Hadoop’s pig ,but! more expressive and feature rich.
  • 39. ECL (Enterprise Control Language) C++ based query language SQL w/ JOINS Map/Reduce GraphDB Machine Learning Simple to Complex Queries
  • 40. Query is Completed in a Single Job! Asynchronously Count Sort Group Classification Country = ‘US’ Country = ‘US’ Join Index of ~/facebook_2013 ~/twitter_2013 ~/facebook_2013 (ROXIE) 0.27 seconds to (THOR) few hours SORT! GROUP! DEDUP! JOIN! MERGE! BETWEEN! LENGTH! REGEX! ROUND! SUM! COUNT! TRIM! WHEN! AVE! CASE! NORMALIZE! DENORMALIZE! K-MEANS! more …. +
  • 41. Machine Learning Built-in http://hpccsystems.com/ml Regression! Linear Regression Classification! Naive Bayes Perceptron Decisions Trees Logistic Regression Clustering! K-Means KD Trees Agglomerative/Hierarchical Association Analysis! AprioriN EclatN Rules Michael Payne ,of Clemson University, on high speed machine learning with PB-BLAS in HPCC Systems. http://youtu.be/s_HWlMwi6iI
  • 42. Un-Structured Data?! Example Lorem Ipsum is simply dummy text of the printing lots of text 300GB File
  • 43. Un-Structured Data Lorem Ipsum is simply dummy text of the printing Regular Expression in C++ or Pattern Match in ECL Regular Expression in Java Reg Ex+ + meta data stored only Filtered Data + Indexes
  • 44. Full Text Search Lorem Ipsum is simply dummy text of the printing Pattern Match in ECL and Rex Ex + or
  • 45. Management & Administration vs More Moving Parts = More Downtime
  • 46. “I want sub-second speed but made investment in HDFS.” Roxie Master K CA 45 M MI 27 S FL 64 Index Index Index Roxie Slaves ! a? ! b? ! c? Hadoop / HPCC Transport Plug-in Name Node Data Nodes / Task Tracker http://hpccsystems.com/products-and-services/products/modules/hadoop-integration
  • 47. Migrating from Hadoop to HPCC Systems Roxie Master K CA 45 M MI 27 S FL 64 Index Index Index Roxie Slaves Name Node Data Nodes / Task Tracker Thor Master Thor Slaves Slowly replace Hadoop with Thor.
  • 49. HPCC Systems Security User / Group Authentication Third Party Authentication Kerberos OK Encrypt Data on Disk optional
  • 50. For More HPCC! “How To’s”! Go to SlideShare http://www.slideshare.net/FujioTurner/
  • 51. Watch how to install HPCC Systems in 5 Minutes Download HPCC Systems Open Source Community Edition http://hpccsystems.com/download/ http://www.youtube.com/watch?v=8SV43DCUqJg or Source Code https://github.com/hpcc-systems