SlideShare a Scribd company logo
1 of 14
Sector vs. Hadoop A Brief Comparison Between the Two Systems
Background Sector is a relatively “new” system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector simply another clone of GFS/MapReduce? No. These slides compare most recent versions of Sector and Hadoop as of Nov. 2010. Both software are still under active development. If you find any inaccurate information, please contact Yunhong Gu [yunhong.gu # gmail]. We will try to keep these slides accurate and up to date.
Design Goals Sector Hadoop Three-layer functionality: Distributed file system Data collection, sharing, and distribution (over wide area networks) Massive in-storage parallel data processing with simplified interface Two-layer functionality: Distributed file system Massive in-storage parallel data processing with simplified interface
History Sector Hadoop Started around 2005 - 2006, as a P2P file sharing and content distribution system for extreme large scientific datasets. Switched to centralized general purpose file system in 2007. Introduced in-storage parallel data processing in 2007. First “modern” version released in 2009. Started as a web crawler & indexer, Nutch, that adopted GFS/MapReduce between 2004 – 2006. Y! took over the project in 2006. Hadoop split from Nutch. First “modern” version released in 2008.
Architecture Sector Hadoop Master-slave system Masters store metadata, slaves store data Multiple active masters Clients perform IO directly with slaves Master-slave system Namenode stores metadata, datanodes store data Single namenode (single point of failure) Clients perform IO directly with datanodes
Distributed File System Sector HDFS General purpose IO Optimized for large files File based (file not split by Sector but users have to take care of it) Use replication for fault tolerance Write once read many (no random write) Optimized for large files Block based (64MB block as default) Use replication for fault tolerance
Replication Sector HDFS System level default in configuration Per-file replica factor can be specified in a configuration file and can be changed at run-time Replicas are stored as far away as possible, but within a distance limit, configurable at per-file level File location can be limited to certain clusters only System level default in configuration Per-file replica factor is supported during file creation For 3 replicas (default), 2 on the same rack, the 3rd on a different rack
Security Sector Hadoop A Sector security server is used to maintain user credentials and permission, server ACL, etc. Security server can be extended to connect to other sources, e.g., LDAP Optional file transfer encryption UDP-based hole punching firewall traversing for clients Still in active development, new features in 2010 Kerberos/token based security framework to authenticate users No file transfer encryption
Wide Area Data Access Sector HDFS Sector ensures high performance data transfer with UDT, a high speed data transfer protocol As Sector pushes replicas as far away from each other as possible, a remote Sector client may find a nearby replica Thus, Sector can be used as content distribution network for very large datasets HDFS has no special consideration for wide area access. Its performance for remote access would be close to a stock FTP server. Its security mechanism may also be a problem for remote data access.
In-Storage Data Processing Sphere HadoopMapReduce Apply arbitrary user-defined functions (UDFs) on data segments UDFs can be Map, Reduce, or others Support native MapReduce as well Support the MapReduce framework
Sphere UDF vs. HadoopMapReduce Sphere UDF HadoopMapReduce Parsing: permanent record offset index if necessary Data segments (records, blocks, files, and directories) are processed by UDFs Transparent load balancing and fault tolerance Sphere is about 2 – 4x faster in various benchmarks Parsing: run-time data parsing with default or user-defined parser Data records are processed by Map and Reduce operations Transparent load balancing and fault tolerance
Why Sphere is Faster than Hadoop? C++ vs. Java Different internal data flows Sphere UDF model is more flexible than MapReduce UDF dissembles MapReduce and gives developers more control to the process Sphere has better input data locality Better performance for applications that process files and group of files as minimum input unit Sphere has better output data locality Output location can be used to optimize iterative and combinative processing, such as join Different implementations and optimizations UDT vs. TCP (significant in wide area systems)
Compatibility with Existing Systems Sector/Sphere Hadoop Sector files can be accessed from outside if necessary Sphere can simply apply an existing application executable on multiple files or even directories in parallel, if the executable accepts a file or a directory as input Data in HDFS can only be accessed via HDFS interfaces In Hadoop, executables that process files may also be wrapped in Map or Reduce operations, but will cause extra data movement if file size is greater than block size Hadoop cannot process multiple files within one operation.
Conclusions Sector is a unique system that integrates distributed file system, content sharing network, and parallel data processing framework. Hadoop is mainly focused on large data processing within a single data center. They overlap on the parallel data processing support.

More Related Content

What's hot

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemNilaNila16
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemBhavesh Padharia
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemsrikanthhadoop
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 

What's hot (18)

Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop
HadoopHadoop
Hadoop
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
hadoop
hadoophadoop
hadoop
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
HADOOP
HADOOPHADOOP
HADOOP
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hdfs
HdfsHdfs
Hdfs
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Cloud storage
Cloud storageCloud storage
Cloud storage
 
Cppt
CpptCppt
Cppt
 

Similar to Sector Vs Hadoop

Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing datapreetik9044
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET Journal
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSPARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSRaheemUnnisa1
 

Similar to Sector Vs Hadoop (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Hdfs design
Hdfs designHdfs design
Hdfs design
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hdfs
HdfsHdfs
Hdfs
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache SparkIRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
 
HDFS
HDFSHDFS
HDFS
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSPARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 

Recently uploaded

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Sector Vs Hadoop

  • 1. Sector vs. Hadoop A Brief Comparison Between the Two Systems
  • 2. Background Sector is a relatively “new” system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector simply another clone of GFS/MapReduce? No. These slides compare most recent versions of Sector and Hadoop as of Nov. 2010. Both software are still under active development. If you find any inaccurate information, please contact Yunhong Gu [yunhong.gu # gmail]. We will try to keep these slides accurate and up to date.
  • 3. Design Goals Sector Hadoop Three-layer functionality: Distributed file system Data collection, sharing, and distribution (over wide area networks) Massive in-storage parallel data processing with simplified interface Two-layer functionality: Distributed file system Massive in-storage parallel data processing with simplified interface
  • 4. History Sector Hadoop Started around 2005 - 2006, as a P2P file sharing and content distribution system for extreme large scientific datasets. Switched to centralized general purpose file system in 2007. Introduced in-storage parallel data processing in 2007. First “modern” version released in 2009. Started as a web crawler & indexer, Nutch, that adopted GFS/MapReduce between 2004 – 2006. Y! took over the project in 2006. Hadoop split from Nutch. First “modern” version released in 2008.
  • 5. Architecture Sector Hadoop Master-slave system Masters store metadata, slaves store data Multiple active masters Clients perform IO directly with slaves Master-slave system Namenode stores metadata, datanodes store data Single namenode (single point of failure) Clients perform IO directly with datanodes
  • 6. Distributed File System Sector HDFS General purpose IO Optimized for large files File based (file not split by Sector but users have to take care of it) Use replication for fault tolerance Write once read many (no random write) Optimized for large files Block based (64MB block as default) Use replication for fault tolerance
  • 7. Replication Sector HDFS System level default in configuration Per-file replica factor can be specified in a configuration file and can be changed at run-time Replicas are stored as far away as possible, but within a distance limit, configurable at per-file level File location can be limited to certain clusters only System level default in configuration Per-file replica factor is supported during file creation For 3 replicas (default), 2 on the same rack, the 3rd on a different rack
  • 8. Security Sector Hadoop A Sector security server is used to maintain user credentials and permission, server ACL, etc. Security server can be extended to connect to other sources, e.g., LDAP Optional file transfer encryption UDP-based hole punching firewall traversing for clients Still in active development, new features in 2010 Kerberos/token based security framework to authenticate users No file transfer encryption
  • 9. Wide Area Data Access Sector HDFS Sector ensures high performance data transfer with UDT, a high speed data transfer protocol As Sector pushes replicas as far away from each other as possible, a remote Sector client may find a nearby replica Thus, Sector can be used as content distribution network for very large datasets HDFS has no special consideration for wide area access. Its performance for remote access would be close to a stock FTP server. Its security mechanism may also be a problem for remote data access.
  • 10. In-Storage Data Processing Sphere HadoopMapReduce Apply arbitrary user-defined functions (UDFs) on data segments UDFs can be Map, Reduce, or others Support native MapReduce as well Support the MapReduce framework
  • 11. Sphere UDF vs. HadoopMapReduce Sphere UDF HadoopMapReduce Parsing: permanent record offset index if necessary Data segments (records, blocks, files, and directories) are processed by UDFs Transparent load balancing and fault tolerance Sphere is about 2 – 4x faster in various benchmarks Parsing: run-time data parsing with default or user-defined parser Data records are processed by Map and Reduce operations Transparent load balancing and fault tolerance
  • 12. Why Sphere is Faster than Hadoop? C++ vs. Java Different internal data flows Sphere UDF model is more flexible than MapReduce UDF dissembles MapReduce and gives developers more control to the process Sphere has better input data locality Better performance for applications that process files and group of files as minimum input unit Sphere has better output data locality Output location can be used to optimize iterative and combinative processing, such as join Different implementations and optimizations UDT vs. TCP (significant in wide area systems)
  • 13. Compatibility with Existing Systems Sector/Sphere Hadoop Sector files can be accessed from outside if necessary Sphere can simply apply an existing application executable on multiple files or even directories in parallel, if the executable accepts a file or a directory as input Data in HDFS can only be accessed via HDFS interfaces In Hadoop, executables that process files may also be wrapped in Map or Reduce operations, but will cause extra data movement if file size is greater than block size Hadoop cannot process multiple files within one operation.
  • 14. Conclusions Sector is a unique system that integrates distributed file system, content sharing network, and parallel data processing framework. Hadoop is mainly focused on large data processing within a single data center. They overlap on the parallel data processing support.