SlideShare a Scribd company logo
Sector vs. Hadoop A Brief Comparison Between the Two Systems
Background Sector is a relatively “new” system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector simply another clone of GFS/MapReduce? No. These slides compare most recent versions of Sector and Hadoop as of Nov. 2010. Both software are still under active development. If you find any inaccurate information, please contact Yunhong Gu [yunhong.gu # gmail]. We will try to keep these slides accurate and up to date.
Design Goals Sector Hadoop Three-layer functionality: Distributed file system Data collection, sharing, and distribution (over wide area networks) Massive in-storage parallel data processing with simplified interface Two-layer functionality: Distributed file system Massive in-storage parallel data processing with simplified interface
History Sector Hadoop Started around 2005 - 2006, as a P2P file sharing and content distribution system for extreme large scientific datasets. Switched to centralized general purpose file system in 2007. Introduced in-storage parallel data processing in 2007. First “modern” version released in 2009. Started as a web crawler & indexer, Nutch, that adopted GFS/MapReduce between 2004 – 2006. Y! took over the project in 2006. Hadoop split from Nutch. First “modern” version released in 2008.
Architecture Sector Hadoop Master-slave system Masters store metadata, slaves store data Multiple active masters Clients perform IO directly with slaves Master-slave system Namenode stores metadata, datanodes store data Single namenode (single point of failure) Clients perform IO directly with datanodes
Distributed File System Sector HDFS General purpose IO Optimized for large files File based (file not split by Sector but users have to take care of it) Use replication for fault tolerance Write once read many (no random write) Optimized for large files Block based (64MB block as default) Use replication for fault tolerance
Replication Sector HDFS System level default in configuration Per-file replica factor can be specified in a configuration file and can be changed at run-time Replicas are stored as far away as possible, but within a distance limit, configurable at per-file level File location can be limited to certain clusters only System level default in configuration Per-file replica factor is supported during file creation For 3 replicas (default), 2 on the same rack, the 3rd on a different rack
Security Sector Hadoop A Sector security server is used to maintain user credentials and permission, server ACL, etc. Security server can be extended to connect to other sources, e.g., LDAP Optional file transfer encryption UDP-based hole punching firewall traversing for clients Still in active development, new features in 2010 Kerberos/token based security framework to authenticate users No file transfer encryption
Wide Area Data Access Sector HDFS Sector ensures high performance data transfer with UDT, a high speed data transfer protocol As Sector pushes replicas as far away from each other as possible, a remote Sector client may find a nearby replica Thus, Sector can be used as content distribution network for very large datasets HDFS has no special consideration for wide area access. Its performance for remote access would be close to a stock FTP server. Its security mechanism may also be a problem for remote data access.
In-Storage Data Processing Sphere HadoopMapReduce Apply arbitrary user-defined functions (UDFs) on data segments UDFs can be Map, Reduce, or others Support native MapReduce as well Support the MapReduce framework
Sphere UDF vs. HadoopMapReduce Sphere UDF HadoopMapReduce Parsing: permanent record offset index if necessary Data segments (records, blocks, files, and directories) are processed by UDFs Transparent load balancing and fault tolerance Sphere is about 2 – 4x faster in various benchmarks Parsing: run-time data parsing with default or user-defined parser Data records are processed by Map and Reduce operations Transparent load balancing and fault tolerance
Why Sphere is Faster than Hadoop? C++ vs. Java Different internal data flows Sphere UDF model is more flexible than MapReduce UDF dissembles MapReduce and gives developers more control to the process Sphere has better input data locality Better performance for applications that process files and group of files as minimum input unit Sphere has better output data locality Output location can be used to optimize iterative and combinative processing, such as join Different implementations and optimizations UDT vs. TCP (significant in wide area systems)
Compatibility with Existing Systems Sector/Sphere Hadoop Sector files can be accessed from outside if necessary Sphere can simply apply an existing application executable on multiple files or even directories in parallel, if the executable accepts a file or a directory as input Data in HDFS can only be accessed via HDFS interfaces In Hadoop, executables that process files may also be wrapped in Map or Reduce operations, but will cause extra data movement if file size is greater than block size Hadoop cannot process multiple files within one operation.
Conclusions Sector is a unique system that integrates distributed file system, content sharing network, and parallel data processing framework. Hadoop is mainly focused on large data processing within a single data center. They overlap on the parallel data processing support.

More Related Content

What's hot

Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
Kelly Technologies
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
NilaNila16
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
Bhavesh Padharia
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
appaji intelhunt
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
srikanthhadoop
 
HADOOP
HADOOPHADOOP
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
RajatTripathi34
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
Cppt
CpptCppt

What's hot (18)

Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop
HadoopHadoop
Hadoop
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
hadoop
hadoophadoop
hadoop
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
HADOOP
HADOOPHADOOP
HADOOP
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hdfs
HdfsHdfs
Hdfs
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Cloud storage
Cloud storageCloud storage
Cloud storage
 
Cppt
CpptCppt
Cppt
 

Similar to Sector Vs Hadoop

Hadoop
HadoopHadoop
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
Sunil D Patil
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
ssuser6e8e41
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Hdfs design
Hdfs designHdfs design
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Hdfs
HdfsHdfs
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
preetik9044
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET Journal
 
HDFS
HDFSHDFS
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
JasmineMichael1
 
Hadoop
HadoopHadoop
Hadoop
Ankit Prasad
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
Santosh Nage
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
Dr. C.V. Suresh Babu
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
veeracynixit
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
veeracynixit
 
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSPARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
RaheemUnnisa1
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
umapavankumar kethavarapu
 

Similar to Sector Vs Hadoop (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Hdfs design
Hdfs designHdfs design
Hdfs design
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hdfs
HdfsHdfs
Hdfs
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
 
HDFS
HDFSHDFS
HDFS
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSPARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 

Recently uploaded

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 

Recently uploaded (20)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 

Sector Vs Hadoop

  • 1. Sector vs. Hadoop A Brief Comparison Between the Two Systems
  • 2. Background Sector is a relatively “new” system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector simply another clone of GFS/MapReduce? No. These slides compare most recent versions of Sector and Hadoop as of Nov. 2010. Both software are still under active development. If you find any inaccurate information, please contact Yunhong Gu [yunhong.gu # gmail]. We will try to keep these slides accurate and up to date.
  • 3. Design Goals Sector Hadoop Three-layer functionality: Distributed file system Data collection, sharing, and distribution (over wide area networks) Massive in-storage parallel data processing with simplified interface Two-layer functionality: Distributed file system Massive in-storage parallel data processing with simplified interface
  • 4. History Sector Hadoop Started around 2005 - 2006, as a P2P file sharing and content distribution system for extreme large scientific datasets. Switched to centralized general purpose file system in 2007. Introduced in-storage parallel data processing in 2007. First “modern” version released in 2009. Started as a web crawler & indexer, Nutch, that adopted GFS/MapReduce between 2004 – 2006. Y! took over the project in 2006. Hadoop split from Nutch. First “modern” version released in 2008.
  • 5. Architecture Sector Hadoop Master-slave system Masters store metadata, slaves store data Multiple active masters Clients perform IO directly with slaves Master-slave system Namenode stores metadata, datanodes store data Single namenode (single point of failure) Clients perform IO directly with datanodes
  • 6. Distributed File System Sector HDFS General purpose IO Optimized for large files File based (file not split by Sector but users have to take care of it) Use replication for fault tolerance Write once read many (no random write) Optimized for large files Block based (64MB block as default) Use replication for fault tolerance
  • 7. Replication Sector HDFS System level default in configuration Per-file replica factor can be specified in a configuration file and can be changed at run-time Replicas are stored as far away as possible, but within a distance limit, configurable at per-file level File location can be limited to certain clusters only System level default in configuration Per-file replica factor is supported during file creation For 3 replicas (default), 2 on the same rack, the 3rd on a different rack
  • 8. Security Sector Hadoop A Sector security server is used to maintain user credentials and permission, server ACL, etc. Security server can be extended to connect to other sources, e.g., LDAP Optional file transfer encryption UDP-based hole punching firewall traversing for clients Still in active development, new features in 2010 Kerberos/token based security framework to authenticate users No file transfer encryption
  • 9. Wide Area Data Access Sector HDFS Sector ensures high performance data transfer with UDT, a high speed data transfer protocol As Sector pushes replicas as far away from each other as possible, a remote Sector client may find a nearby replica Thus, Sector can be used as content distribution network for very large datasets HDFS has no special consideration for wide area access. Its performance for remote access would be close to a stock FTP server. Its security mechanism may also be a problem for remote data access.
  • 10. In-Storage Data Processing Sphere HadoopMapReduce Apply arbitrary user-defined functions (UDFs) on data segments UDFs can be Map, Reduce, or others Support native MapReduce as well Support the MapReduce framework
  • 11. Sphere UDF vs. HadoopMapReduce Sphere UDF HadoopMapReduce Parsing: permanent record offset index if necessary Data segments (records, blocks, files, and directories) are processed by UDFs Transparent load balancing and fault tolerance Sphere is about 2 – 4x faster in various benchmarks Parsing: run-time data parsing with default or user-defined parser Data records are processed by Map and Reduce operations Transparent load balancing and fault tolerance
  • 12. Why Sphere is Faster than Hadoop? C++ vs. Java Different internal data flows Sphere UDF model is more flexible than MapReduce UDF dissembles MapReduce and gives developers more control to the process Sphere has better input data locality Better performance for applications that process files and group of files as minimum input unit Sphere has better output data locality Output location can be used to optimize iterative and combinative processing, such as join Different implementations and optimizations UDT vs. TCP (significant in wide area systems)
  • 13. Compatibility with Existing Systems Sector/Sphere Hadoop Sector files can be accessed from outside if necessary Sphere can simply apply an existing application executable on multiple files or even directories in parallel, if the executable accepts a file or a directory as input Data in HDFS can only be accessed via HDFS interfaces In Hadoop, executables that process files may also be wrapped in Map or Reduce operations, but will cause extra data movement if file size is greater than block size Hadoop cannot process multiple files within one operation.
  • 14. Conclusions Sector is a unique system that integrates distributed file system, content sharing network, and parallel data processing framework. Hadoop is mainly focused on large data processing within a single data center. They overlap on the parallel data processing support.