SlideShare a Scribd company logo
1 of 21
BNM INSTITUTE OF TECHNOLOGY
A Technical Seminar on
“BANIAN: A CROSS-PLATFORM INTERACTIVE
QUERY SYSTEM FOR STRUCTURED BIG DATA”
Under the Guidance of
Ms. Tulasi Sunitha M.
Assistant Professor
Dept. of CSE, BNMIT
CONTENTS
1. INTRODUCTION
2. LITERATURE SURVEY
3. SYSTEM ARCHITECTURE
4. SPLITTING AND SCHEDULING
5. CROSS PLATFORM QUERY
6. EVALUATION
INTRODUCTION
 The GFS and MapReduce developed by Google could process 20 PB of
webpage’s per day in 2007.
 The HDFS and HBase clusters developed by Facebook scanned 300
million images daily in 2012.
 The search engine system developed by Baidu, could handle 100 PB of
data per day in 2013.
Continued….
 At present, parallel database based on Massively Parallel Processing (MPP)
architecture can manage hundreds of TB of data.
 MapReduce is a programming framework proposed by Google and a
typical technology for processing big data.
 By combining HDFS with the splitting and scheduling model, Banian
effectively integrates large-scale storage management with interactive
query and analysis.
LITERATURE SURVEY
 One line of research is incorporating MapReduce on the basis of MPP
database, such as Greenplum and Teradata.
 Hive is the most typical example of SQL on Hadoop. It is used to map files
onto a database table and provide an SQL query interface.
 Dremel is an interactive data analysis system proposed by Google.
Continued….
 Impala is an MPP SQL query engine developed by Cloudera.
 BlinkDB proposed by UC Berkeley is a large scale parallel processing
engine capable of running interactive SQL commands on PB level datasets.
 Spark originated from the cluster computing platform at AMPLab, UC
Berkeley.
SYSTEM ARCHITECTURE
 The architecture of Banian, which is divided into three main layers
according to logic functions: the storage layer, scheduling and execution
layer, and application layer.
Continued….
 The storage layer contains three important interfaces as well,
I. The interface used for providing the data block distribution
information of the file to the scheduler module through NameNode;
II. The read/write interface of local data to the query engine module;
III. The read/write interface of HDFS to the ETL module.
Continued….
 The scheduling and execution layer is the core component of Banian. It
contains three modules: Scheduler, Query Engine, and Metadata Server.
 The scheduler receives SQL commands from the application layer.
 The metadata server maintains a fast lookup table for caching data
block information.
 The query engine is deployed on each sub-node. It is responsible for
receiving and executing the operation list allocated by the scheduler.
SPLITTING AND SCHEDULING
Continued….
 The complete workflow of the scheduling and execution layer processing
SQL commands.
 Grammatical and lexical analysis is conducted by the execution and
analysis units to generate the task tree after receiving SQL commands.
 Traverse each entry on the task tree, query metadata server according to
table information, and obtain the corresponding file information.
 Transform tasks into file operations, i.e., task tree into operation tree.
Query the fast lookup table, and go to Step 5 in the case of cache hit.
 Traverse each entry on the operation tree, query HDFS NameNode
according to file information, and obtain the corresponding data block
position.
 The coordinator unit sends the operation list to the query engine on the
corresponding sub-node.
 The query engine initiates the workflow after receiving the operation list
and directly reads local data for further execution.
 The aggregation unit collects all results from query engine and sends them
to the application layer.
Scheduler
 The scheduler is a logical unit as opposed to a physical module. It is
composed of the scheduler daemons on each physical node.
Cross-Platform Query
Continued….
 The SQL interface provides a command shell for users and forwards query
commands to the crossplatform module.
 The crossplatform module queries the global table and gets the information of
Location.
 The global table stores the configuration information of all platforms using a
data structure called Location.
struct Location{
char *tagname;
char *host;
int port;
int authority;
char *username;
char *password;
}
EVALUATION
I. Performance Evaluation.
Evaluate the performance and scalability of Banian and compare the
results with those of Hive.
2. TPC-H evaluation.
Figure 6.1: (a) Query time of Q1-Q5 on Banian and Hive using 1.2 PB dataset
D1. (b) query time of 22 SQL commands of TPC-H benchmark on banian and
Hive using 1TB dataset D2.
Load dataset D2 into Banian and Hive, and run a suite of business oriented
ad-hoc queries (22 SQL commands) from the TPC-H benchmark on our
experimental platform.
3. Scalability evaluation.
 Split dataset D1 and each node retains 12 TB of data. The table size
increase from 120 TB to 1.2 PB as the cluster size increases from 10
nodes to 100 nodes.
Figure 6.2: Query time of Q1-Q5 on Banian and Hive for cluster
size of 10, 20, 40, 60, 80 and 100 in sequence.
CONCLUSIONS
 Banian combines HDFS with the splitting and scheduling engine of parallel
database.
 This platform supports the storage of PB level data and interactive cross-
platform query.
 The test results suggest that the performance of Banian is 5–30 times better
than that of Hive.
 Banian employs a symmetrical structure having a loose coupling degree
and shows higher scalability and compatibility.
REFERENCES
[1] S. Ghemawat, H. Gobioff, and S. T. Leung, The Google file system, ACM SIGOPS
Operating Systems Review, vol. 37, no. 5, pp. 29–43, 2003.
[2] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters,
Commun, of ACM, vol. 51, no. 1, pp. 107–113, 2008.
[3] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop distributed file system, in
Proceedings of IEEE Conference on Mass Storage Systems and Technologies (MSST),
2010, pp. 1–10.
[4] HBase project, http://hbase.apache.org/, 2014.
[5] M. Li, L. Andrey, T. Sasu, and Y. Antti, MPTCP incast in data center networks, China
Communications, vol. 11, no. 4, pp. 25–37, 2014.
[6] Greenplum Inc., Greenplum Database: Powering the data driven enterprise, the resources
http://www.greenplum.com/resources, 2014.

More Related Content

What's hot

A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service IndexingA New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexingijdms
 
SPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication MiddlewareSPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication MiddlewareMaria Stylianou
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joinsShalish VJ
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf pointsdba3003
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduceShrihari Rathod
 
Repartition join in mapreduce
Repartition join in mapreduceRepartition join in mapreduce
Repartition join in mapreduceUday Vakalapudi
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498IJRAT
 
Research of Embedded GIS Data Management Strategies for Large Capacity
Research of Embedded GIS Data Management Strategies for Large CapacityResearch of Embedded GIS Data Management Strategies for Large Capacity
Research of Embedded GIS Data Management Strategies for Large CapacityNooria Sukmaningtyas
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Adbms 27 parallel database distribution architecture
Adbms 27 parallel database distribution architectureAdbms 27 parallel database distribution architecture
Adbms 27 parallel database distribution architectureVaibhav Khanna
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudBharat Rane
 

What's hot (20)

A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service IndexingA New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
 
SPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication MiddlewareSPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication Middleware
 
Info Grafix
Info GrafixInfo Grafix
Info Grafix
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Spark
SparkSpark
Spark
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Repartition join in mapreduce
Repartition join in mapreduceRepartition join in mapreduce
Repartition join in mapreduce
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
 
Research of Embedded GIS Data Management Strategies for Large Capacity
Research of Embedded GIS Data Management Strategies for Large CapacityResearch of Embedded GIS Data Management Strategies for Large Capacity
Research of Embedded GIS Data Management Strategies for Large Capacity
 
C044051215
C044051215C044051215
C044051215
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Adbms 27 parallel database distribution architecture
Adbms 27 parallel database distribution architectureAdbms 27 parallel database distribution architecture
Adbms 27 parallel database distribution architecture
 
E031201032036
E031201032036E031201032036
E031201032036
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloud
 

Similar to banian

A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataredpel dot com
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
A New Architecture for Group Replication in Data Grid
A New Architecture for Group Replication in Data GridA New Architecture for Group Replication in Data Grid
A New Architecture for Group Replication in Data GridEditor IJCATR
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONijdms
 
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATAAishwarya Saseendran
 
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...IJECEIAES
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataEMC
 
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...eSAT Journals
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on webcsandit
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...IOSR Journals
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Yahoo Developer Network
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy snehal parikh
 

Similar to banian (20)

A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring data
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
A New Architecture for Group Replication in Data Grid
A New Architecture for Group Replication in Data GridA New Architecture for Group Replication in Data Grid
A New Architecture for Group Replication in Data Grid
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
 
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big Data
 
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 
H017144148
H017144148H017144148
H017144148
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
 
GHCNPaper3
GHCNPaper3GHCNPaper3
GHCNPaper3
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 

banian

  • 1. BNM INSTITUTE OF TECHNOLOGY A Technical Seminar on “BANIAN: A CROSS-PLATFORM INTERACTIVE QUERY SYSTEM FOR STRUCTURED BIG DATA” Under the Guidance of Ms. Tulasi Sunitha M. Assistant Professor Dept. of CSE, BNMIT
  • 2. CONTENTS 1. INTRODUCTION 2. LITERATURE SURVEY 3. SYSTEM ARCHITECTURE 4. SPLITTING AND SCHEDULING 5. CROSS PLATFORM QUERY 6. EVALUATION
  • 3.
  • 4. INTRODUCTION  The GFS and MapReduce developed by Google could process 20 PB of webpage’s per day in 2007.  The HDFS and HBase clusters developed by Facebook scanned 300 million images daily in 2012.  The search engine system developed by Baidu, could handle 100 PB of data per day in 2013.
  • 5. Continued….  At present, parallel database based on Massively Parallel Processing (MPP) architecture can manage hundreds of TB of data.  MapReduce is a programming framework proposed by Google and a typical technology for processing big data.  By combining HDFS with the splitting and scheduling model, Banian effectively integrates large-scale storage management with interactive query and analysis.
  • 6. LITERATURE SURVEY  One line of research is incorporating MapReduce on the basis of MPP database, such as Greenplum and Teradata.  Hive is the most typical example of SQL on Hadoop. It is used to map files onto a database table and provide an SQL query interface.  Dremel is an interactive data analysis system proposed by Google.
  • 7. Continued….  Impala is an MPP SQL query engine developed by Cloudera.  BlinkDB proposed by UC Berkeley is a large scale parallel processing engine capable of running interactive SQL commands on PB level datasets.  Spark originated from the cluster computing platform at AMPLab, UC Berkeley.
  • 8. SYSTEM ARCHITECTURE  The architecture of Banian, which is divided into three main layers according to logic functions: the storage layer, scheduling and execution layer, and application layer.
  • 9. Continued….  The storage layer contains three important interfaces as well, I. The interface used for providing the data block distribution information of the file to the scheduler module through NameNode; II. The read/write interface of local data to the query engine module; III. The read/write interface of HDFS to the ETL module.
  • 10. Continued….  The scheduling and execution layer is the core component of Banian. It contains three modules: Scheduler, Query Engine, and Metadata Server.  The scheduler receives SQL commands from the application layer.  The metadata server maintains a fast lookup table for caching data block information.  The query engine is deployed on each sub-node. It is responsible for receiving and executing the operation list allocated by the scheduler.
  • 12. Continued….  The complete workflow of the scheduling and execution layer processing SQL commands.  Grammatical and lexical analysis is conducted by the execution and analysis units to generate the task tree after receiving SQL commands.  Traverse each entry on the task tree, query metadata server according to table information, and obtain the corresponding file information.  Transform tasks into file operations, i.e., task tree into operation tree. Query the fast lookup table, and go to Step 5 in the case of cache hit.
  • 13.  Traverse each entry on the operation tree, query HDFS NameNode according to file information, and obtain the corresponding data block position.  The coordinator unit sends the operation list to the query engine on the corresponding sub-node.  The query engine initiates the workflow after receiving the operation list and directly reads local data for further execution.  The aggregation unit collects all results from query engine and sends them to the application layer.
  • 14. Scheduler  The scheduler is a logical unit as opposed to a physical module. It is composed of the scheduler daemons on each physical node.
  • 16. Continued….  The SQL interface provides a command shell for users and forwards query commands to the crossplatform module.  The crossplatform module queries the global table and gets the information of Location.  The global table stores the configuration information of all platforms using a data structure called Location. struct Location{ char *tagname; char *host; int port; int authority; char *username; char *password; }
  • 17. EVALUATION I. Performance Evaluation. Evaluate the performance and scalability of Banian and compare the results with those of Hive.
  • 18. 2. TPC-H evaluation. Figure 6.1: (a) Query time of Q1-Q5 on Banian and Hive using 1.2 PB dataset D1. (b) query time of 22 SQL commands of TPC-H benchmark on banian and Hive using 1TB dataset D2. Load dataset D2 into Banian and Hive, and run a suite of business oriented ad-hoc queries (22 SQL commands) from the TPC-H benchmark on our experimental platform.
  • 19. 3. Scalability evaluation.  Split dataset D1 and each node retains 12 TB of data. The table size increase from 120 TB to 1.2 PB as the cluster size increases from 10 nodes to 100 nodes. Figure 6.2: Query time of Q1-Q5 on Banian and Hive for cluster size of 10, 20, 40, 60, 80 and 100 in sequence.
  • 20. CONCLUSIONS  Banian combines HDFS with the splitting and scheduling engine of parallel database.  This platform supports the storage of PB level data and interactive cross- platform query.  The test results suggest that the performance of Banian is 5–30 times better than that of Hive.  Banian employs a symmetrical structure having a loose coupling degree and shows higher scalability and compatibility.
  • 21. REFERENCES [1] S. Ghemawat, H. Gobioff, and S. T. Leung, The Google file system, ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29–43, 2003. [2] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun, of ACM, vol. 51, no. 1, pp. 107–113, 2008. [3] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop distributed file system, in Proceedings of IEEE Conference on Mass Storage Systems and Technologies (MSST), 2010, pp. 1–10. [4] HBase project, http://hbase.apache.org/, 2014. [5] M. Li, L. Andrey, T. Sasu, and Y. Antti, MPTCP incast in data center networks, China Communications, vol. 11, no. 4, pp. 25–37, 2014. [6] Greenplum Inc., Greenplum Database: Powering the data driven enterprise, the resources http://www.greenplum.com/resources, 2014.