SlideShare a Scribd company logo
1 of 77
HOTEL INSPECTION DATASET ANALYSIS
1
A mini project on BIG DATA-HADOOP
Project Title
HOTEL INSPECTION
DATASET ANALYSIS
Presented by,
SHARON MOSES
RAGINI AKULA
HOTEL INSPECTION DATASET ANALYSIS
2
CONTENTS
Abstract
List of figures
List of screens
TOPIC NAME PAGE NO
1. INTRODUCTION 8-10
1.1Motivation 8
1.2Existing System 8
1.3Problemdefinition 9
1.3.1Storing 9
1.3.2Processing 9
1.4ProposedSystem 9
1.5Features ofproject 10
1.5.1Storing the data set 10
1.5.2Processing the data set 10
2. LITERATURE SURVEY 11-27
2.1Bigdata 11
2.2Apache Hadoop 13
2.2.1Vendors 16
2.2.2Cloudera 20
2.2.3HadoopEcosystems 22
2.3Linux Ubuntu 24
2.4MySql 26
3. SYSTEM REQUIREMENTS 28-29
HOTEL INSPECTION DATASET ANALYSIS
3
3.1Identificationof needs 28
3.2EnvironmentalRequirements 29
3.2.1Software Requirements 29
3.2.2HardwareRequirements 29
4. BUSINESS LOGIG 30-36
4.1SystemAnalysis 30
4.1.1FunctionalRequirements 30
4.1.1.1TechnicalFeasibility 30
4.1.1.2OperationalFeasibility 31
4.2SystemDesign 32
4.2.1BusinessFlow 32
4.2.1.1Apache Hadoopworking model-I 32
4.2.1.2Apache Hadoopworking model-II 33
4.2.2BusinessLogic 34
5. PROJECT MODULES 37-61
5.1Modules Introduction 37
5.2Modules 37
5.2.1Analysing the data and filtering the data 37
5.2.2Identifying the headers (schema) 39
5.2.3Installing single node Hadoop cluster 43
5.2.4Moving the data to HDFS 51
5.2.5Creating the tables in Hive 54
5.2.6Importing data from HDFS to hive warehouse 56
5.2.7Analysing the data based on the queries from the client 58
5.2.8Generating the reports 61
6. EXECUTION OF JOBS 62-69
HOTEL INSPECTION DATASET ANALYSIS
4
6.1Methods ofExecution 62
6.1.1Executing the job from the Hive prompts 62
6.1.2Executing the job from terminal with Hadoop 63
6.1.3Executing the jog as Script 63
6.2ExecutionofHiveQL jobs 65
7. TESTING 70
7.1Introduction 70
7.2Sample unit testing 70
8. SCREENS 71-74
9. CONCLUSIONS 75
10. REFERENCES 76
HOTEL INSPECTION DATASET ANALYSIS
5
ABSTRACT
Generally, hotels are complex and costly when it comes to maintenance with various
things like quality of food, usage of spaces that have different schedules and uses for guest
rooms‟ restaurants, health club, swimming pool, retail store and each has a functional
engineering system required for its maintenance. Maintenance therefore has to be done
throughout the year, requiring competent staff to undertake building services, operation and
maintenance, supplemented by outsourced contractors.
In the hospitality industry the maintenance of the engineering systems is important despite its
complex processes as its effectiveness will directly affect the quality of hotel service, food, and
beverage which have direct and significant effect on guests‟ impression of the hotel.
Here is the data of various inspections done on Hotels in various parts o USA .The data deals
with the violations made by the hotel managements and their violation codes. Data also explains
the action taken by the government according to the violation codes on the hotel.
We analyze the data based on which parts they are violating the codes, so that the new hotels will
exclude this problems and survive in the market.
HOTEL INSPECTION DATASET ANALYSIS
6
LIST OF FIGURES
Figure no Name of the figure Page no
1 Big data 3 v’s 11
2 Data Measurements 12
3 Hadoop logo 13
4 Components of Hadoop 14
5 HDFS data distribution 15
6 Map reduce compute distribution 15
7 Performance and Scalability 19
8 Apache Hadoop ecosystem 24
9 Ubuntu logo 24
10 MySql logo 26
11 Apache Hadoop working model-I 32
12 Apache Hadoop working model-II 33
13 MapReduce Logic 35
14 Job execution phase in Hadoop 36
15 Violation table schema diagram 42
16 Hotel table schema diagram 43
HOTEL INSPECTION DATASET ANALYSIS
7
LIST OF SCREENS
Screenno Name of the screen Page no
1 Raw-dataset 38
2 Unnecessary data fields 38
3 Final dataset 39
4 JDK installation path 44
5 Java path 45
6 Hadoop location 46
7 Hadoop installation crosscheck 47
8 Hive installation crosscheck 48
9 Hadoop-env.sh file 49
10 Core-site.xml 49
11 Mapred-site.xml 50
12 Hdfs-site.xml 50
13 Creating Directory 51
14 Listing the Directories 52
15 Moving data to HDFS 53
16 Checking the files in HDFS 54
17 Table created successfully 55
18 Checking created tables 56
19 Data loaded to hive warehouse and table 56
20 Table description 57
21 Verifying the data 57
22 Job execution 58
23 Query executed and data loaded to HDFS 59
24 Result moved to home directory 60
25 Stored output 60
26 Output generated by the query 61
27 Report generated from the query 61
28 Hive prompt 62
29 Query using hive –e 63
HOTEL INSPECTION DATASET ANALYSIS
8
30 Query from script 64
31 Script home directory 64
32 Query written in script 65
33 Violation codes 71
34 Violation made 71
35 Inspections made area wise 72
36 Violation counts from each restaurant 72
37 Types of cosines inspected 73
38 More inspections in cosines 73
39 Critical and non-critical issues 74
HOTEL INSPECTION DATASET ANALYSIS
9
1. INTRODUCTION
1.1 MOTIVATION:
Generally, hotels are complex and costly when it comes to maintenance with
various things like quality of food, usage of spaces that have different schedules and uses
for guest room’s restaurants, health club, swimming pool, retail store and each has a
functional engineering system required for its maintenance. Maintenance therefore has to
be done throughout the year, requiring competent staff to undertake building services,
operation and maintenance, supplemented by outsourced contractors.
In the hospitality industry the maintenance of the engineering systems is
important despite its complex processes as its effectiveness will directly affect the quality
of hotel service, food, and beverage which have direct and significant effect on guest’s
impression of the hotel. Here is the data of various inspection done on Hotels in various
parts o USA. The data deals with the violations made by the hotel managements and their
violation codes. Data also explains the action taken by the government according to the
violation codes on the hotel.
We analyze the data based on which parts they are violating the codes, so that the
new hotels will exclude these problems and survive in the market.
1.2 EXISTING SYSTEM:
These days for any organization, company or a business firm the most important
things for them is survive in the market and compete with the competitors. As to do so
the firm need to analyze their position in the market.
HOTEL INSPECTION DATASET ANALYSIS
10
Analyzing the market needs the data which they have generated from long years. The
data from last year’s, which has been rapidly, multiply in numbers and creating a lot of problems
in terms of storing and analyzing the stored data.
These days we have a tedious improvement in storing technologies rather than in
analyzing techniques. We are having problems in analyzing the data stored in our Traditional
RDBMS ( Mysql, Db2..) and at same time the data size is also exceeding our storage
probabilities.
1.3 PROBLEM DEFINITION:
The following are the problems which we are facing with the existing systems.
1.3.1 Storing:
Since couple of years we can see how the data has rustically increased in its size and
creating lot of problems in storing them.
1.3.2 Processing:
Since the data is very huge we are not able to analyze the dataset in the fixed period of
time and so unable to get the results in an efficient way.
1.4 PROPOSED SYSTEM:
In our proposed system we are using new technologies for analyzing the datasets. The
framework we are using is Hadoop.
HOTEL INSPECTION DATASET ANALYSIS
11
This is a framework which is capable of storing any tedious amount of data and can
processing the dataset in a less time and in a efficient way compared to other technologies.
1.5 FEATURES OF PROJECT:
1.5.1 Storing the Dataset:
We extract the dataset from an external source to our Hadoop cluster using Sqoop
ecosystem.
1.5.2 Processing the Dataset:
Once the dataset is extracted the data is analyzed using MapReduce and other ecosystem
which works well with the dataset.
HOTEL INSPECTION DATASET ANALYSIS
12
2. LITERATURE SURVEY
2.1 BIGDATA:
Big data is an evolving term that describes any voluminous amount of structured, semi-
structured and unstructured data that has the potential to be mined for information. Although big
data doesn't refer to any specific quantity, the term is often used when speaking about petabytes
and Exabyte’s of data.
Big data is used to describe a massive volume of data that is so large that it's difficult to
process. The data is too big that exceeds current processing capacity.
Big data can be characterized by 3Vs: the extreme volume of data, the wide variety of
types of data and the velocity at which the data must be must processed.
Big Data 3V’s
HOTEL INSPECTION DATASET ANALYSIS
13
An example of big data might be pentabytes (1,024 terabytes) or Exabyte’s (1,024
pentabytes) of data consisting of billions to trillions of records.
E.g. Web, sales, customer contact center, social media, and mobile data...
Data Measurements
HOTEL INSPECTION DATASET ANALYSIS
14
2.2 APACHE HADOOP:
Hadoop-logo
Hadoop is an open-source software framework for storing and processing big data in a
distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two
tasks: massive data storage and faster processing.
Doug Cutting, Cloud era’s Chief Architect, helped create Apache Hadoop out of
necessity as data from the web exploded, and grew far beyond the ability of traditional systems
to handle it. Hadoop was initially inspired by papers published by Google outlining its approach
to handling an avalanche of data, and has since become the de facto standard for storing,
processing and analyzing hundreds of terabytes, and even pet bytes of data.
Why is Hadoop important?
Since its inception, Hadoop has become one of the most talked about technologies. Why?
One of the top reasons (and why it was invented) is its ability to handle huge amounts of data –
any kind of data – quickly. With volumes and varieties of data growing each day, especially from
social media and automated sensors, that’s a key consideration for most organizations. Other
reasons include:
Low cost: The open-source framework is free and uses commodity hardware to store large
quantities of data.
Computing power: Its distributed computing model can quickly process very large volumes
of data. The more computing nodes you use, the more processing power you have.
HOTEL INSPECTION DATASET ANALYSIS
15
Scalability: You can easily grow your system simply by adding more nodes. Little
administration is required.
Storage flexibility: Unlike traditional relational databases, you don’t have to preprocess data
before storing it. And that includes unstructured data like text, images and videos. You can store
as much data as you want and decide how to use it later.
Inherent data protection and self-healing capabilities: Data and application processing are
protected against hardware failure. If a node goes down, jobs are automatically redirected to
other nodes to make sure the distributed computing does not fail. And it automatically stores
multiple copies of all data.
Components of Hadoop
HDFS: (Hadoop Distributed File System)
HDFS is a fault tolerant and self-healing distributed file system designed to turn a cluster
of industry standard servers into a massively scalable pool of storage. Developed specifically for
large-scale data processing workloads where scalability, flexibility and throughput are critical,
HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming,
and scales to proven deployments of 100PB and beyond.
HOTEL INSPECTION DATASET ANALYSIS
16
HDFS Data Distribution
Data in HDFS is replicated across multiple nodes for compute performance and data protection.
MapReduce:
MapReduce is a massively scalable, parallel processing framework that works in tandem
with HDFS. With MapReduce and Hadoop, compute is executed at the location of the data,
rather than moving data to the compute location; data storage and computation coexist on the
same physical nodes in the cluster. MapReduce processes exceedingly large amounts of data
without being affected by traditional bottlenecks like network bandwidth by taking advantage of
this data proximity.
MapReduce Compute Distribution
MapReduce divides workloads up into multiple tasks that can be executed in parallel.
HOTEL INSPECTION DATASET ANALYSIS
17
The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be Serializable by the framework and hence need to
implement the Writable interface. Additionally, the key classes have to implement the
WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
2.2.1 Vendors:
Hadoop vendors share the Hadoop architecture from Apache Hadoop.
EMC:
Pivotal HD, the Apache Hadoop distribution from EMC, natively integrates EMC’s
massively parallel processing (MPP) database technology (formerly known as Greenplum, and
now known as HAWQ) with Apache Hadoop. The result is a high-performance Hadoop
distribution with true SQL processing for Hadoop. SQL-based queries and other business
intelligence tools can be used to analyze data that is stored in HDFS.
Hortonworks: Another major player in the Hadoop market, Hortonworks has the largest
number of committers and code contributors for the Hadoop ecosystem components.
(Committers are the gatekeepers of Apache projects and have the power to approve code
changes.)
Hortonworks is a spin-off from Yahoo!, which was the original corporate driver of the
Hadoop project because it needed a large-scale platform to support its search engine business. Of
HOTEL INSPECTION DATASET ANALYSIS
18
all the Hadoop distribution vendors, Hortonworks is the most committed to the open source
movement, based on the sheer volume of the development work it contributes to the community,
and because all its development efforts are (eventually) folded into the open source codebase.
The Hortonworks business model is based on its ability to leverage its popular HDP
distribution and provide paid services and support. However, it does not sell proprietary
software. Rather, the company enthusiastically supports the idea of working within the open
source community to develop solutions that address enterprise feature requirements (for
example, faster query processing with Hive).
Hortonworks has forged a number of relationships with established companies in the data
management industry: Teradata, Microsoft, Informatica, and SAS, for example. Though these
companies don’t have their own, in-house Hadoop offerings, they collaborate with Hortonworks
to provide integrated Hadoop solutions with their own product sets.
The Hortonworks Hadoop offering is the Hortonworks Data Platform (HDP), which
includes Hadoop as well as related tooling and projects. Also unlike Cloudera, Hortonworks
releases only HDP versions with production-level code from the open source community.
IBM:
Big Blue offers a range of Hadoop offerings, with the focus around value added on top of
the open source Hadoop stack.
HOTEL INSPECTION DATASET ANALYSIS
19
Intel:
The Intel Distribution for Apache Hadoop (Intel Distribution) provides distributed
processing and data management for enterprise applications that analyze big data.
MapR:
For a complete distribution for Apache Hadoop and related projects that’s independent of
the Apache Software Foundation, look no further than MapR. Boasting no Java dependencies or
reliance on the Linux file system, MapR is being promoted as the only Hadoop distribution that
provides full data protection, no single points of failure, and significant ease-of-use advantages.
Three MapR editions are available: M3, M5, and M7. The M3 Edition is free and
available for unlimited production use; MapR M5 is an intermediate-level subscription software
offering; and MapR M7 is a complete distribution for Apache Hadoop and HBase that includes
Pig, Hive, Sqoop, and much more.
Cloudera:
Perhaps the best-known player in the field, Cloudera is able to claim Doug Cutting,
Hadoop’s co-founder, as its chief architect. Cloudera is seen by many people as the market
leader in the Hadoop space because it released the first commercial Hadoop distribution and it is
a highly active contributor of code to the Hadoop ecosystem.
HOTEL INSPECTION DATASET ANALYSIS
20
Performance and scalability
HOTEL INSPECTION DATASET ANALYSIS
21
2.2.2 Cloudera:
Perhaps the best-known player in the field, Cloudera is able to claim Doug Cutting,
Hadoop’s co-founder, as its chief architect. Cloudera is seen by many people as the market
leader in the Hadoop space because it released the first commercial Hadoop distribution and it is
a highly active contributor of code to the Hadoop ecosystem.
Cloudera Enterprise, a product positioned by Cloudera at the center of what it calls the
“Enterprise Data Hub,” includes the Cloudera Distribution for Hadoop (CDH), an open-source-
based distribution of Hadoop and its related projects as well as its proprietary Cloudera Manager.
Also included is a technical support subscription for the core components of CDH.
Cloudera’s primary business model has long been based on its ability to leverage its
popular CDH distribution and provide paid services and support. In the fall of 2013, Cloudera
formally announced that it is focusing on adding proprietary value-added components on top of
open source Hadoop to act as a differentiator.
Also, Cloudera has made it a common practice to accelerate the adoption of alpha- and
beta-level open source code for the newer Hadoop releases. Its approach is to take components it
deems to be mature and retrofit them into the existing production-ready open source libraries that
are included in its distribution.
2.2.3 Hadoop Ecosystems:
The Hadoop platform consists of two key services: a reliable, distributed file system
called Hadoop Distributed File System (HDFS) and the high-performance parallel data
processing engine called Hadoop MapReduce, described in MapReduce below.
HOTEL INSPECTION DATASET ANALYSIS
22
The combination of HDFS and MapReduce provides a software framework for
processing vast amounts of data in parallel on large clusters of commodity hardware (potentially
scaling to thousands of nodes) in a reliable, fault-tolerant manner. Hadoop is a generic
processing framework designed to execute queries and other batch read operations against
massive datasets that can scale from tens of terabytes to pentabytes in size.
The popularity of Hadoop has grown in the last few years, because it meets the needs of
many organizations for flexible data analysis capabilities with an unmatched price-performance
curve. The flexible data analysis features apply to data in a variety of formats, from unstructured
data, such as raw text, to semi-structured data, such as logs, to structured data with a fixed
schema.
Hadoop has been particularly useful in environments where massive server farms are
used to collect data from a variety of sources. Hadoop is able to process parallel queries as big,
background batch jobs on the same server farm. This saves the user from having to acquire
additional hardware for a traditional database system to process the data (assume such a system
can scale to the required size). Hadoop also reduces the effort and time required to load data into
another system; you can process it directly within Hadoop. This overhead becomes impractical in
very large data sets.
Many of the ideas behind the open source Hadoop project originated from the Internet
search community, most notably Google and Yahoo!. Search engines employ massive farms of
inexpensive servers that crawl the Internet retrieving Web pages into local clusters where they
are analyzes with massive, parallel queries to build search indices and other useful data
structures.
HOTEL INSPECTION DATASET ANALYSIS
23
The Hadoop ecosystem includes other tools to address particular needs. Hive is a SQL
dialect and Pig is a dataflow language for that hide the tedium of creating MapReduce jobs
behind higher-level abstractions more appropriate for user goals. Zookeeper is used for
federating services and Oozie is a scheduling system. Avro, Thrift and Protobuf are platform-
portable data serialization and description formats.
MapReduce:
MapReduce is now the most widely-used, general-purpose computing model and runtime
system for distributed data analytics. It provides a flexible and scalable foundation for analytics,
from traditional reporting to leading-edge machine learning algorithms. In the MapReduce
model, a compute “job” is decomposed into smaller “tasks” (which correspond to separate Java
Virtual Machine (JVM) processes in the Hadoop implementation). The tasks are distributed
around the cluster to parallelize and balance the load as much as possible. The MapReduce
runtime infrastructure coordinates the tasks, re-running any that fail or appear to hang. Users of
MapReduce don’t need to implement parallelism or reliability features themselves. Instead, they
focus on the data problem they are trying to solve.
Pig:
Pig is a platform for constructing data flows for extract, transform, and load (ETL)
processing and analysis of large datasets. Pig Latin, the programming language for Pig provides
common data manipulation operations, such as grouping, joining, and filtering. Pig generates
Hadoop MapReduce jobs to perform the data flows. This high-level language for ad hoc analysis
allows developers to inspect HDFS stored data without the need to learn the complexities of the
MapReduce framework, thus simplifying the access to the data.
The Pig Latin scripting language is not only a higher-level data flow language but also
has operators similar to SQL (e.g., FILTER and JOIN) that are translated into a series of map and
HOTEL INSPECTION DATASET ANALYSIS
24
reduce functions. Pig Latin, in essence, is designed to fill the gap between the declarative style of
SQL and the low-level procedural style of MapReduce.
Hive :
Hive is a SQL-based data warehouse system for Hadoop that facilitates data
summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop-compatible
file systems (e.g., HDFS, MapR-FS, and S3) and some NoSQL databases. Hive is not a relational
database, but a query engine that supports the parts of SQL specific to querying data, with some
additional support for writing new tables or files, but not updating individual records. That is,
Hive jobs are optimized for scalability, i.e., computing over all rows, but not latency, i.e., when
you just want a few rows returned and you want the results returned quickly. Hive’s SQL dialect
is called HiveQL. Table schema can be defined that reflect the data in the underlying files or data
stores and SQL queries can be written against that data. Queries are translated to MapReduce
jobs to exploit the scalability of MapReduce. Hive also support custom extensions written in
Java, including user-defined functions (UDFs) and serializer-deserializers for reading and
optionally writing custom formats, e.g., JSON and XML dialects. Hence, analysts have
tremendous flexibility in working with data from many sources and in many different formats,
with minimal need for complex ETL processes to transform data into more restrictive formats.
Contrast with Shark and Impala.
HOTEL INSPECTION DATASET ANALYSIS
25
Apache Hadoop Ecosystem
2.3 LINUX UBUNTU:
Ubuntu-logo
Ubuntu is an ancient African word meaning 'humanity to others'. It also means 'I am what
I am because of who we all are'. The Ubuntu operating system brings the spirit of Ubuntu to the
world of computers.
Linux was already established as an enterprise server platform in 2004, but free software
was not a part of everyday life for most computer users. That's why Mark Shuttleworth gathered
a small team of developers from one of the most established Linux projects – Debian – and set
out to create an easy-to-use Linux desktop: Ubuntu.
HOTEL INSPECTION DATASET ANALYSIS
26
The vision for Ubuntu is part social and part economic: free software, available to
everybody on the same terms, and funded through a portfolio of services provided by Canonical.
The first official Ubuntu release -- Version 4.10, codenamed the 'Warty Warthog' — was
launched in October 2004, and sparked dramatic global interest as thousands of free software
enthusiasts and experts joined the Ubuntu community.
The governance of Ubuntu is somewhat independent of Canonical, with volunteer leaders
from around the world taking responsibility for many critical elements of the project. It remains a
key tenet of the Ubuntu Project that Ubuntu is a shared work between Canonical, other
companies, and the thousands of volunteers who bring their expertise to bear on making it a
world-class platform for anyone to use.
Ubuntu today has eight flavours and dozens of localised and specialised derivatives.
There are also special editions for servers, OpenStack clouds, and mobile devices. All editions
share common infrastructure and software, making Ubuntu a unique single platform that scales
from consumer electronics to the desktop and up into the cloud for enterprise computing.
The Ubuntu OS and the innovative Ubuntu for Android convergence solution make it an
exciting time for Ubuntu on mobile devices. In the cloud, Ubuntu is the reference operating
system for the OpenStack project, it’s a hugely popular guest OS on Amazon's EC2 and
Rackspace's Cloud, and it’s pre-installed on computers from Dell, HP, Asus, Lenovo and other
global vendors. And thanks to that shared infrastructure, developers can work on the desktop,
and smoothly deliver code to cloud servers running the stripped-down Ubuntu Server Edition.
HOTEL INSPECTION DATASET ANALYSIS
27
After many years Ubuntu still is and always will be free to use, share and develop. We
hope it will bring a touch of light to your computing — and we hope that you'll join us in helping
to build the next version.
2.4 MySQL:
Mysql-logo
MySQL is the world's most popular open source database software, with over 100 million
copies of its software downloaded or distributed throughout it's history. With its superior speed,
reliability, and ease of use, MySQL has become the preferred choice for Web, Web 2.0, SaaS,
ISV, Telecom companies and forward-thinking corporate IT Managers because it eliminates the
major problems associated with downtime, maintenance and administration for modern, online
applications.
Many of the world's largest and fastest-growing organizations use MySQL to save time
and money powering their high-volume Web sites, critical business systems, and packaged
software — including industry leaders such as Yahoo!, Alcatel-Lucent, Google, Nokia,
YouTube, Wikipedia, and Booking.com.
The flagship MySQL offering is MySQL Enterprise, a comprehensive set of production-
tested software, proactive monitoring tools, and premium support services available in an
affordable annual subscription.
HOTEL INSPECTION DATASET ANALYSIS
28
MySQL is a key part of LAMP (Linux, Apache, MySQL, PHP / Perl / Python), the fast-
growing open source enterprise software stack. More and more companies are using LAMP as an
alternative to expensive proprietary software stacks because of its lower cost and freedom from
platform lock-in.
HOTEL INSPECTION DATASET ANALYSIS
29
3. SYSTEM REQUIREMENTS
The purpose of this SRS document is to identify the requirements and functionalities for
Intelligent Network Backup Tool . The SRS will define how our team and the client conceive the
final product and the characteristics or functionality it must have. This document also makes a
note of the optional requirements which we plan to implement but are not mandatory for the
functioning of the project.
This phase appraises the needed requirements for the Hotel Inspection dataset for a
systematic way of evaluating the requirements several processes are involved. The first step
involved in analyzing the requirements of the system is recognizing the nature of system for a
reliable investigation and all the case are formulated to better understand the analysis of the
dataset.
Document Conventions:
The convention used in the size of fonts remains the same as for other documents in the
project. The section headings have the largest font of 14, subheadings have a font size of
12(bold), and the text is on font 12. The priorities of the requirements are specified with the
requirement statements.
Intended Audience and Reading Suggestions:
This document is intended for project developers, managers, users, testers and
documentation writers. This document aims at discussing design and implementation constraints,
dependencies, system features, external interface requirements and other non functional
requirements.
HOTEL INSPECTION DATASET ANALYSIS
30
3.1 IDENTIFICATION OF NEEDS:
The foremost and important necessity for a business firm or an organization is to know
how they are performing in the market and parallel they need to know how to overcome their
competitors in the market.
To do so we need to analysis our data based on all the available factors. The system
requirements for the project to be accomplished are:
3.2 ENVIRONMENTAL REQUIREMENTS:
3.2.1 Software Requirements:
Development & Usage:
Linux Operating System.
Apache Hadoop.
Mozilla Firefox: (or any browser).
Microsoft Excel or Open office.
3.2.2 Hardware Requirements:
Development & Usage:
Pentium 4 processor.
40GB Hard disc.
256 MB RAM. / 4 GB RAM
System with all standard accessories like monitor, keyboard, mouse, etc.,
HOTEL INSPECTION DATASET ANALYSIS
31
4. BUSINESS LOGIC
Logic Features:
1. Store:
The main intention of the Hotel Inspection Dataset is to analysis the data based
on the violations made by all inspected restaurants and hotel. To handle the
things we first load the data to our Hadoop HDFS Component.
2. Analysis:
This is the other major step for the dataset, this module is done based the type
of dataset we have, any ways our Hotel Inspection Dataset is a structure data.
So we work with Hadoop Ecosystem HIVE.
4.1 SYSTEM ANALYSIS:
4.1.1 FUNCTIONAL REQUIREMENTS:
4.1.1.1 Technical Feasibility:
Evaluating the technical feasibility is the trickiest part of a feasibility study. This is
because, at this point in time, not too many detailed design of the system, making it difficult to
access issues like performance, costs on (on account of the kind of technology to be deployed)
etc.
A number of issues have to be considered while doing a technical analysis. Understand
the different technologies involved in the proposed system.
Before commencing the project, we have to be very clear about what are the technologies
that are to be required for the development of the new system.
HOTEL INSPECTION DATASET ANALYSIS
32
Find out whether the organization currently possesses the required technologies. Is the
required technology available with the organization?
If so is the capacity sufficient?
For instance –“Will the current printer be able to handle the new reports and forms required for
the new system?”
4.1.1.2 Operational Feasibility
Proposed projects are beneficial only if they can be turned into information systems that
will meet the organizations operating requirements. Simply stated, this test of feasibility asks if
the system will work when it is developed and installed. Are there major barriers to
Implementation? Here are questions that will help test the operational feasibility of a project.
• Is there sufficient support for the project from management from users? If the current
system is well liked and used to the extent that persons will not be able to see reasons for change,
there may be resistance.
• Are the current business methods acceptable to the user? If they are not, Users may
welcome a change that will bring about a more operational and useful systems.
• Have the user been involved in the planning and development of the project? Early
involvement reduces the chances of resistance to the system and in General and increases the
likelihood of successful project.
HOTEL INSPECTION DATASET ANALYSIS
33
Since the proposed system was to help reduce the hardships encountered in the existing
manual system, the new system was considered to be operational feasible.
4.2 SYSTEM DESIGN:
4.2.1 Business Flow:
4.2.1.1 Apache Hadoop Working Model-I:
Apache hadoop working Model-I
Create Secure Shell Connection
(SSH) from Local host to Linux
(Ubuntu) Kernel – ssh localhost
Start all Demons Name node,
Secondary Name node, Data
node, Job tracker,Task Tracker –
start-all.sh
Check weather all demons are up
Jps
Create a directory and move the
dataset to the HDFS using
terminal Linux.
Check the data format from the
browser view. From the data
point of view chose the
Ecosystem to work
Based on the Ecosystem, design
the platform and execute the
jobs.
Once the jobs are executed.
Generate the Reports based on
the dataset
Analyze the Reports for the
improvement of the firm
HOTEL INSPECTION DATASET ANALYSIS
34
4.2.1.2 Apache Hadoop Working Model-II:
Apache Hadoop Working Model-II
Install a Virtual
Machine..VMware
Open a virtual machine which is
already created from the
Cloudera
Start the centousfromthe
virtual Machine. Work withthe
terminal.
Create a directory and move the
dataset to the HDFS using
terminal Linux.
Check the data format from the
browser view. From the data
point of view chose the
Ecosystem to work
Based on the Ecosystem, design
the platform and execute the
jobs.
Once the jobs are executed.
Generate the Reports based on
the dataset
Analyze the Reports for the
improvement of the firm
HOTEL INSPECTION DATASET ANALYSIS
35
4.2.2 Business Logic:
Functional Programming:
Multithreading is one of the popular way of doing parallel programming, but major
complexity of multi-thread programming is to co-ordinate the access of each thread to the shared
data. We need things like semaphores, locks, and also use them with great care, otherwise dead
locks will result.
User defined Map/Reduce functions:
Map/reduce is a special form of such a DAG which is applicable in a wide range of use
cases. It is organized as a “map” function which transform a piece of data into some number of
key/value pairs. Each of these elements will then be sorted by their key and reach to the same
node, where a “reduce” function is use to merge the values (of the same key) into a single result.
Mapper:
map(input_record) {
...
emit(k1, v1)
...
emit(k2, v2)
...
}
HOTEL INSPECTION DATASET ANALYSIS
36
Reducer:
reduce (key, values) {
aggregate = initialize()
while (values.has_next) {
aggregate = merge(values.next)
}
collect(key, aggregate)
}
MapReduce logic
HOTEL INSPECTION DATASET ANALYSIS
37
Job execution phase in Hadoop
HOTEL INSPECTION DATASET ANALYSIS
38
5. PROJECT MODULES
5.1 MODULES INTRODUCTION:
The dataset holds the Hotel Inspection data from last years. We have taken the dataset
from a reference website https://data.ny.gov/. The size of the dataset is very huge with the data
around three lacks of lines. We had taken a part of it as that our basic systems can’t be able to
support that much huge amount of dataset, this needs a well classified configuration to work on.
To deal with the project we have taken the dataset with around twenty five thousand lines.
We have analyzed the raw dataset, by eliminating the unnecessary fields from the data
and given the dataset a well organized format.
The dataset is dividing into two tables based on the data and their fields. The first table
deals with the inspection data with contains the parameters like id, name of restaurant, area,
address, location, inspected data, violated code, critical point of violation, type of inspection. The
second table deals with the violation code and the violation property.
5.2 MODULES:
5.2.1 Analyzing the Data and filtering the Data:
The first step of the project we need to analyze the data, we should check the data how it
has been formatted. We should be aware of the fields that has given to us and need to know the
importance of each and every field, if we think that there are some unnecessary information
which is disturbing our dataset, we need to talk to our client before taking any step in changing
the dataset or removing or moving any columns from the dataset.
HOTEL INSPECTION DATASET ANALYSIS
39
Raw-dataset
unnecessery data fields
HOTEL INSPECTION DATASET ANALYSIS
40
The unnecessary fiedls have removed from the raw dataset. The fields have been removed from
the dataset and the dataset has been divided into two separate Tables.
Table1 (violation) – dataset with violation code and its explanation.
Table2 (Hotel) – dataset with voilation code and remaing fields from the filtered dataset.
Now the final dataset would be refered as filtered dataset.
final-dataset
5.2.2 Identifying the headers (Schema):
The schema is generated based on the dataset and the data we are having. This schema is
for the table hotel.
HOTEL INSPECTION DATASET ANALYSIS
41
Schema for Hotel:
Name of Header Description headername in schema
ID - id (Primary Key) - id
CAMIS - Refers to the Store ID's - camis
DBA - Refers to the Restaurant - dba
BORO - Place - boro
BUILDING - Building Number - building
STREET - Street Address - street
ZIPCODE - Area zipcode - zipcode
PHONE - Store phone - phone
CUISINE DESCRIPTION - Type of Cusine - cuisine_description
INSPECTION DATE - Inspected on Date - inspection_date
ACTION - Type of Action - action
VIOLATION CODE - Voilaton Codes - violation_code
CRITICAL FLAG - Serious of Voilations - critical_flag
SCORE - Rating - score
GRADE - Grade - grade
GRADE DATE - Grade Date - grade_date
RECORD DATE - Record Date - record_date
INSPECTION TYPE - Type of Inspection - inspection_type
HOTEL INSPECTION DATASET ANALYSIS
42
The scheme is for the table Violation. The schema is violation of code and the voilation
description.
Schema for table Violation:
Name of Header Description headername in schema
ID - id (Primary Key) - id
VIOLATION CODE - Refers to Violation code - violation_code
VIOLATION DESCRITION - Refers description of code - v_desc
Table for Hotel:
Name of Header Description headername in schema
ID id (Primary Key) Id
CAMIS Refers to the Store ID's Camis
DBA Refers to the Restaurant Dba
BORO Place Boro
BUILDING Building Number Building
STREET Street Address Street
ZIPCODE Area zipcode Zipcode
PHONE Store phone Phone
CUISINE DESCRIPTION Type of Cusine cuisine_description
INSPECTION DATE Inspected on Date inspection_date
ACTION Type of Action Action
VIOLATION CODE Violation Codes Violation_code
HOTEL INSPECTION DATASET ANALYSIS
43
Name of Header Description headername in schema
CRITICAL FLAG Serious of Violations critical_flag
SCORE Rating Score
GRADE Grade Grade
GRADE DATE Grade Date grade_date
RECORD DATE Record Date record_date
INSPECTION TYPE Type of Inspection inspection_type
Table for Violation:
Name of Header Description Headername in schema
ID id (Primary Key) Id
VIOLATION CODE Refers to Violation code violation_code
VIOLATION DESCRITION Refers description of code v_desc
Violation Table Schema diagram:
Violation Table Schema diagram
HOTEL INSPECTION DATASET ANALYSIS
44
Hotel Table Schema diagram:
Hotel Table Schema diagram
5.2.3 Installing Single Node Hadoop Cluster:
Java Development Kit 1.7:
Download the Java Development Kit 1.7 from the official website of Oracle services.
Once the JDK1.7 is downloaded extract the file from downloads and create a directory named
JAVA in the root directory. The path of the root directory “ /usr/lib/java”
HOTEL INSPECTION DATASET ANALYSIS
45
Once the java folder is created with sudo (admistrator) permissions, then move the
downloaded jdk to the /usr/lib/java/ so the jdk1.7 would be in the /usr/lib/java/jdk1.7. The java
path would be now “/usr/lib/java/jdk1.7”.
Jdk installation path
Once this part is done, now we need to set the runable file to configure with the java and kernel,
to do so run the below mentioned scripts.
sudo update-alternatives --install "/usr/bin/java"
"java" usr/lib/java/jdk1.7.0_67/bin/java" 1
sudo update-alternatives --install "/usr/bin/javac" "javac"
"/usr/lib/java/jdk1.7.0_67/bin/javac" 1
sudo update-alternatives --install "/usr/bin/javaws" "javaws"
usr/lib/java/jdk1.7.0_67/bin/javaws"
HOTEL INSPECTION DATASET ANALYSIS
46
To check the java installation is completed, run the command “ java –version”
java path
Now edit the bashrc file in Linux (Ubuntu), to do so run the command
sudo gedit ~/.bashrc
and add the following lines to the file:
export JAVA_HOME="/usr/lib/java/jdk1.7.0_67"
set PATH="$PATH:$JAVA_HOME/bin"
alias jps="/usr/lib/java/jdk1.7.0_67/bin/jps"
HOTEL INSPECTION DATASET ANALYSIS
47
Install Hadoop1.2.0:
Download hadoop1.2 version from the source website of Apache Hadoop:
Create a file named Hadoop in /usr /lib/ path, once the file is created extract the downloaded
Hadoop file and move it to “/usr/lib/Hadoop” path with the sudo permissions.
Hadoop location
Configure the Hadoop location with bashrc file, sudo gedit ~/.bashrc
Add the lines to the file :
export HADOOP_HOME="/usr/lib/hadoop/hadoop-1.2.1"
PATH=$PATH:$HADOOP_HOME/bin
HOTEL INSPECTION DATASET ANALYSIS
48
Hadoop installation cross checking
Install Hive:
Now download the hive 0.12.0 file from the target source of Apache hive:
Create a directory with hive in “/usr/lib” directory and move the extracted file to “/usr/lib/hive/”
this path is the hive directory.
Open the bashrc file: sudo gedit ~/.bashrc
Configure the file with the script :
# Hive Home Directory Configuration
HIVE_HOME="/usr/lib/hive/hive-0.12.0"
export PATH=$PATH:$HIVE_HOME/bin
HOTEL INSPECTION DATASET ANALYSIS
49
hive installation cross check
We need to configure the four important file in the Hadoop environment. Open the Hadoop
directory from the location “/usr/lib/Hadoop.hadoop1.2.1/conf”
Open the files hdfs-site.xml, mapred-site.xml, core-site.xml, Hadoop-env.sh. Add the following
lines to these file respectively:
HOTEL INSPECTION DATASET ANALYSIS
50
Hadoop-env.sh file
core-site.xml
HOTEL INSPECTION DATASET ANALYSIS
51
mapred-site.xml
hdfs-site.xml
HOTEL INSPECTION DATASET ANALYSIS
52
Hadoop installation in Cloudera:
5.2.4 Moving the data to HDFS:
Once the data schema is ready, the Hadoop installation is done , now our next task is to
move the data from our localfilesytem to the Hadoop single node cluster i.e., to the HDFS a
Component of Hadoop where the data is stored in the form of file systems.
The Command we use is: Hadoop fs -mkdir Hotel
This Command is used for creating a directory for our project in HDFS. Here we are
creating a directory Hotel which is used to store our datasets
1) Hotel dataset
2) Violation code dataset
Creating directory
HOTEL INSPECTION DATASET ANALYSIS
53
Hadoop fs -ls
This command is used to display all the directories in the HDFS, We need to cross check
as to know whether our directory Hotel has been created or not.
Listing the directory
Hadoop fs -copyFromLocal Src ... Dest
This command is used to move our file from localfilesytem to HDFS.
We are copying our file Hotel.csv to the Hotel directory of HDFS
Hadoop fs -copyFromLocal '/home/username/Desktop/hotel.csv' /user/username/hotel/
'/home/username/Desktop/hotel.csv' indicates the location of the file.
HOTEL INSPECTION DATASET ANALYSIS
54
‘/user/username/hotel/’ indicates the location of HDFS.
Hotel – indicates the HDFS directory
Moving data to hdfs
from the images we can see the two files hotel.csv ad codes.txt had been moved to the hdfs
directory Hotel.
Hadoop fs -ls Hotel
This command is used to display all the files from our specified HDFS directories Hotel. We
need to cross check as to know whether our file has been created or not.
Hadoop fs -ls hotel
HOTEL INSPECTION DATASET ANALYSIS
55
Checking the files in hdfs
This is clear that we have moved all our files to the HDFS – into the hotel directory.
5.2.5 Creating the tables in hive:
We are all set to create the tables for our dataset.
The query for creating the hotel table:
hive -e "create table 360_hotel ( camis string, dba string, boro string, building string, street
string, zipcode string, phone string, cuisine_description string, inspection_date string, action
string, violation_code string, critical_flag string, score string, grade string, grade_date string,
record_date string, inspection_type string)row format delimited fields terminated by ',' "
HOTEL INSPECTION DATASET ANALYSIS
56
Table created successfully
To see the table:
hive -e “show tables”
HOTEL INSPECTION DATASET ANALYSIS
57
Checking created table
5.2.6 Importing data from hdfs to hive warehouse:
To Load Data:
hive -e "load data inpath '/user/training/hotel/hotel.csv' overwrite into table 360_hotel"
Data loaded to hive warehouse and table.
HOTEL INSPECTION DATASET ANALYSIS
58
Hotel Table Description: hive -e “desc 360_hotel”
Table Description
Checking the tables:
hive -e “select *from hotel limit 3”
verifying the data
HOTEL INSPECTION DATASET ANALYSIS
59
5.2.7 Analyzing the data based on the queries from the client:
- Frequent violated code.
- How many stores/restaurants have been inspected and location wise.
- Number of violations made by each restaurant.
- How many areas have been covered in the inspection?
- Types of cuisines inspected.
- More inspections were done on descending order.
- ascending order more violation codes.
- No violation cited on from restaurants.
- Critical and noncritical violations.
- Critical Violation and non critical violation codes.
Frequent violated code:
hive -e "SELECT violation_code, COUNT(violation_code) FROM hotel GROUP BY
violation_code HAVING ( COUNT(violation_code) > 1 )limit 5 "
Job execution
HOTEL INSPECTION DATASET ANALYSIS
60
The above displays the result to the screen, but we need the result set to be reported to an
excel sheet to generate the reports.
To do so we need to store the result set in table or we can store the result in HDFS, then
we can move the result data from HDFS to our localfilesystem, from there the dataset is exported
to excel files to generate reports.
This result is stored in HDFS in the form of output.ods or output.xls
hive -e "insert overwrite directory '/user/training/Desktop/output.csv' SELECT
violation_code, COUNT(violation_code) FROM hotel GROUP BY violation_code
HAVING ( COUNT(violation_code) > 1 )"
The result set has been stored in the HDFS with the file name output.ods and the path to access it
is '/user/username/output.csv'
To export the file from HDFS to Localfilesystem
hadoop fs -copyToLocal '/user/training/output.csv' /home/training/Desktop/
Query executed and the data is loaded to the hdfs directory
HOTEL INSPECTION DATASET ANALYSIS
61
The result set has been stored in HDFS. Now we need to move the result set to the
Local file system.
The result set has been moved to the home directory. '/home/training/'
Stored output
HOTEL INSPECTION DATASET ANALYSIS
62
This the result files in CSV format. We need to export this dataset to excel to make the report in
an efficient way.
The output generated by the query.
5.2.8 Generating the Reports:
This Module, Here we deal with all the generated reports. We can use any data reporting
tools or else we can go with excel.
Report generated from the query
HOTEL INSPECTION DATASET ANALYSIS
63
6. EXECUTIONS OF JOBS
6.1 METHODS OF EXECUTION:
We can execute the jobs in hive in three different ways:
6.1.1 Executing the job from the hive prompt:
The job is written directly in the hive prompt:
hive prompt
HOTEL INSPECTION DATASET ANALYSIS
64
6.1.2 Executing the job from terminal with Hadoop:
The job is executed here with the help of Hadoop terminal, there will be no contact with the hive
prompt in the job execution:
Query using hive -e
6.1.3 Executing the job as a script:
The job is executed as script here, once the script has been written, the script is placed in the
home directory of the Linux environment
HOTEL INSPECTION DATASET ANALYSIS
65
query from script
script home directory
HOTEL INSPECTION DATASET ANALYSIS
66
query written in script
6.2 EXECUTION OF HIVEQL JOBS:
How many stores/restaurants have been inspected and location wise.
hive e "insert overwrite directory '/user/training/output2-1.csv' select count(dba) from hotel
where boro='BRONX'"
hadoop fs -copyToLocal /user/training/output2-1.csv' /home/training/Desktop
hive e "insert overwrite directory '/user/training/output2-2.csv' select count(dba) from hotel
where boro='BROOKLYN'"
hadoop fs -copyToLocal /user/training/output2-2.csv' /home/training/Desktop
hive e "insert overwrite directory '/user/ training/output2-3.csv' select count(dba) from hotel
where boro='MANHATTAN'"
hadoop fs -copyToLocal /user/ training /output2-3.csv' /home/ training/Desktop
HOTEL INSPECTION DATASET ANALYSIS
67
hive e "insert overwrite directory '/user/ training /output2-4.csv' select count(dba) from hotel
where boro='QUEENS'"
hadoop fs -copyToLocal /user/training/output2-4.csv' /home/ training /Desktop
hive e "insert overwrite directory '/user/ training /output2-5.csv' select count(dba) from hotel
where boro='STATEN ISLAND'"
hadoop fs -copyToLocal /user/ training /output2-5.csv' /home/ training /Desktop
img:
Number of violations made by each restaurant:
hive -e "insert overwrite directory '/user/ training /output2.csv select distinct(dba) from hotel"
hive -e "insert overwrite directory '/user/ training /output3.csv' select count(violation_code) from
hotel where dba = 'MORRIS PARK BAKE SHOP'"
hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code)
from hotel where dba = 'WENDY'"
hive -e "insert overwrite directory '/user/ training /output3-2.csv' select count(violation_code)
from hotel where dba = 'DJ REYNOLDS PUB AND RESTAURANT'"
hive -e "insert overwrite directory '/user/ training /output3-3.csv' select count(violation_code)
from hotel where dba = 'RIVIERA CATERER'"
hive -e "insert overwrite directory '/user/ training /output3-4.csv' select count(violation_code)
from hotel where dba = 'TOV KOSHER KITCHEN'"
HOTEL INSPECTION DATASET ANALYSIS
68
hive -e "insert overwrite directory '/user/ training /output3-5.csv' select count(violation_code)
from hotel where dba = 'BRUNOS ON THE BOULEVARD'"
hive -e "insert overwrite directory '/user/ training /output3-6.csv' select count(violation_code)
from hotel where dba = 'KOSHER ISLAND'"
hive -e "insert overwrite directory '/user/ training /output3-7.csv' select count(violation_code)
from hotel where dba = 'WILKEN'S FINE FOOD'"
hive -e "insert overwrite directory '/user/ training /output3-8.csv' select count(violation_code)
from hotel where dba = 'REGINA CATERERS'"
hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code)
from hotel where dba = 'MAY MAY KITCHEN'"
hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code)
from hotel where dba = 'NATHAN'S FAMOUS'"
hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code)
from hotel where dba = 'SEUDA FOODS'"
hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code)
from hotel where dba = 'CARVEL ICE CREAM'"
hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code)
from hotel where dba = 'GLORIOUS FOOD'"
HOTEL INSPECTION DATASET ANALYSIS
69
img
How many areas have been covered in the inspection:
hive -e "select distinct(boro) from hotel"
Types of cusines inspected:
"insert overwrite directory '/user/ training /output.csv' select distinct(cuisine_description) from
hotel"
"select count(cuisine_description) from hotel where cuisine_description='African'"
"select count(cuisine_description) from hotel where cuisine_description='American'"
"select count(cuisine_description) from hotel where cuisine_description='Armenian'"
"select count(cuisine_description) from hotel where cuisine_description='Bagels/Pretzels'"
"select count(cuisine_description) from hotel where cuisine_description='Bakery'"
"select count(cuisine_description) from hotel where cuisine_description='Café/Coffee/Tea'"
"select count(cuisine_description) from hotel where cuisine_description='Caribbean'"
"select count(cuisine_description) from hotel where cuisine_description='Chicken'"
"select count(cuisine_description) from hotel where cuisine_description='Chinese'"
"select count(cuisine_description) from hotel where cuisine_description='Continental'"
"select count(cuisine_description) from hotel where cuisine_description='Donuts'"
"select count(cuisine_description) from hotel where cuisine_description='German'"
"select count(cuisine_description) from hotel where cuisine_description='Greek'"
"select count(cuisine_description) from hotel where cuisine_description='Hamburgers'"
"select count(cuisine_description) from hotel where cuisine_description='Hotdogs'"
"select count(cuisine_description) from hotel where cuisine_description='Indian'"
"select count(cuisine_description) from hotel where cuisine_description='Japanese'"
HOTEL INSPECTION DATASET ANALYSIS
70
Critical Violation and non critical violation codes:
hive -e "insert overwrite directory '/user/training/critical.csv' select violation_code from hotel
where critical_flag = 'Critical'"
hive -e "insert overwrite directory '/user/training/not-critical.csv' select violation_code from hotel
where critical_flag = 'Not Critical' "
HOTEL INSPECTION DATASET ANALYSIS
71
7. TESTING
7.1 INTRODUCTION:
Software testing is a critical element of software quality assurance and represents the
ultimate review of specification, design and coding. The increasing visibility of software as a
system element and attendant costs associated with a software failure are motivating factors for
we planned, through testing. Testing is the process of executing a program with the intent of
finding an error. The design of tests for software and other engineered products can be as
challenging as the initial design of the product itself.
7.2 SAMPLE UNIT TESTING:
Unit testing is done when the data is loaded into hdfs. Once data is loaded we need to cross
the data by seeing it in the browser. Now the sample data from the browser say about once chunk
of the file:
Copy the data to the text file, load the sample data to the hdfs and work on the data, write
the jobs on the sample data execute the jobs store the results. Then if the job is successfully
executed on the sample data then execute the job on the main dataset with the same parameters.
HOTEL INSPECTION DATASET ANALYSIS
72
8. SCREENS
Violation codes
violation made
HOTEL INSPECTION DATASET ANALYSIS
73
Inspections made area wise
Violations counts from each restaurant
HOTEL INSPECTION DATASET ANALYSIS
74
Types of Cosines inspected
More inspections in cosines
HOTEL INSPECTION DATASET ANALYSIS
75
Critical and non critical issues
HOTEL INSPECTION DATASET ANALYSIS
76
9. CONCLUSIONS
Hadoop is trending technology in the market. Hadoop solves the big data problem more
effectively and efficiently. More importantly Hadoop can analyze any kind of data. Analyzing
the data based on Hadoop requires very less amount of time, and it reduces the production time
which directly affects the economy of the organization.
Analyzing the dataset based on the Apache Hadoop will overcome all the issues caused by the
traditional RDBMS and Master slave Architecture of Servers.
In our project we are trying to analyze Hotel Inspection dataset using Hadoop.
This analysis makes to analyze total number of hotels, their violations and their descriptions.
HOTEL INSPECTION DATASET ANALYSIS
77
10 REFERENCES
Hadoop:
https://hadoop.apache.org/
Java:
http://www.oracle.com/technetwork/java/javase/downloads/
Hive:
https://hive.apache.org/
Linux:
http://www.ubuntu.com/

More Related Content

What's hot

A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit
 

What's hot (20)

A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 

Viewers also liked

1 house-inspections report
1 house-inspections report1 house-inspections report
1 house-inspections reportChantel Clayborn
 
Presentation on Rebeca Gasso Aguilar's "Sources, Methods and Triangulation in...
Presentation on Rebeca Gasso Aguilar's "Sources, Methods and Triangulation in...Presentation on Rebeca Gasso Aguilar's "Sources, Methods and Triangulation in...
Presentation on Rebeca Gasso Aguilar's "Sources, Methods and Triangulation in...Matías Argüello Pitt
 
New purchase contract sample arizona - copy aar
New purchase contract sample   arizona - copy aarNew purchase contract sample   arizona - copy aar
New purchase contract sample arizona - copy aarJerry Walter
 
Septic tank inspection guide
Septic tank inspection guideSeptic tank inspection guide
Septic tank inspection guideEnglish Septic
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationMaruthi Nataraj K
 
Esempio di questionario per la job analysis
Esempio di questionario per la job analysisEsempio di questionario per la job analysis
Esempio di questionario per la job analysisSalvatore Pisano
 
Data collection techniques
Data collection techniquesData collection techniques
Data collection techniquesJags Jagdish
 
[Webinar] - Use mobile forms for higher business productivity
[Webinar] - Use mobile forms for higher business productivity[Webinar] - Use mobile forms for higher business productivity
[Webinar] - Use mobile forms for higher business productivityTaraSpan
 
Architecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyArchitecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyMark Ginnebaugh
 
Data collection of five star hotel
Data collection of five star hotelData collection of five star hotel
Data collection of five star hotelAr. Sahid Akhtar
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Data Collection-Primary & Secondary
Data Collection-Primary & SecondaryData Collection-Primary & Secondary
Data Collection-Primary & SecondaryPrathamesh Parab
 

Viewers also liked (15)

1 house-inspections report
1 house-inspections report1 house-inspections report
1 house-inspections report
 
Presentation on Rebeca Gasso Aguilar's "Sources, Methods and Triangulation in...
Presentation on Rebeca Gasso Aguilar's "Sources, Methods and Triangulation in...Presentation on Rebeca Gasso Aguilar's "Sources, Methods and Triangulation in...
Presentation on Rebeca Gasso Aguilar's "Sources, Methods and Triangulation in...
 
New purchase contract sample arizona - copy aar
New purchase contract sample   arizona - copy aarNew purchase contract sample   arizona - copy aar
New purchase contract sample arizona - copy aar
 
Pre Shipment Inpection
Pre Shipment Inpection Pre Shipment Inpection
Pre Shipment Inpection
 
Septic tank inspection guide
Septic tank inspection guideSeptic tank inspection guide
Septic tank inspection guide
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce Implementation
 
Esempio di questionario per la job analysis
Esempio di questionario per la job analysisEsempio di questionario per la job analysis
Esempio di questionario per la job analysis
 
Data collection techniques
Data collection techniquesData collection techniques
Data collection techniques
 
[Webinar] - Use mobile forms for higher business productivity
[Webinar] - Use mobile forms for higher business productivity[Webinar] - Use mobile forms for higher business productivity
[Webinar] - Use mobile forms for higher business productivity
 
Architecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case StudyArchitecting a Data Warehouse: A Case Study
Architecting a Data Warehouse: A Case Study
 
Data collection of five star hotel
Data collection of five star hotelData collection of five star hotel
Data collection of five star hotel
 
Chapter 9-METHODS OF DATA COLLECTION
Chapter 9-METHODS OF DATA COLLECTIONChapter 9-METHODS OF DATA COLLECTION
Chapter 9-METHODS OF DATA COLLECTION
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Data Collection-Primary & Secondary
Data Collection-Primary & SecondaryData Collection-Primary & Secondary
Data Collection-Primary & Secondary
 

Similar to Hotel inspection data set analysis copy

A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)Denodo
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?Denodo
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Denodo
 
WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2
 
9 Hyperion Performance Myths and How to Debunk Them
9 Hyperion Performance Myths and How to Debunk Them9 Hyperion Performance Myths and How to Debunk Them
9 Hyperion Performance Myths and How to Debunk ThemDatavail
 
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...SoftServe
 
127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentationNitesh Kumar
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Strengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data ImplementationsStrengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data ImplementationsCognizant
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
Von der Zustandsüberwachung zur vorausschauenden Wartung
Von der Zustandsüberwachung zur vorausschauenden WartungVon der Zustandsüberwachung zur vorausschauenden Wartung
Von der Zustandsüberwachung zur vorausschauenden WartungPeter Schleinitz
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
 
Hot Technologies of 2013: Investigative Analytics
Hot Technologies of 2013: Investigative AnalyticsHot Technologies of 2013: Investigative Analytics
Hot Technologies of 2013: Investigative AnalyticsInside Analysis
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Prof.Balakrishnan S
 
DX2000 from NEC lets you put big data to work
DX2000 from NEC lets you put big data to workDX2000 from NEC lets you put big data to work
DX2000 from NEC lets you put big data to workPrincipled Technologies
 

Similar to Hotel inspection data set analysis copy (20)

A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product Overview
 
9 Hyperion Performance Myths and How to Debunk Them
9 Hyperion Performance Myths and How to Debunk Them9 Hyperion Performance Myths and How to Debunk Them
9 Hyperion Performance Myths and How to Debunk Them
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
From Business Idea to Successful Delivery by Serhiy Haziyev & Olha Hrytsay, S...
 
127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation127801976 mobile-shop-management-system-documentation
127801976 mobile-shop-management-system-documentation
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Strengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data ImplementationsStrengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data Implementations
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
Von der Zustandsüberwachung zur vorausschauenden Wartung
Von der Zustandsüberwachung zur vorausschauenden WartungVon der Zustandsüberwachung zur vorausschauenden Wartung
Von der Zustandsüberwachung zur vorausschauenden Wartung
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
Hot Technologies of 2013: Investigative Analytics
Hot Technologies of 2013: Investigative AnalyticsHot Technologies of 2013: Investigative Analytics
Hot Technologies of 2013: Investigative Analytics
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
IoTReport
IoTReportIoTReport
IoTReport
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
DX2000 from NEC lets you put big data to work
DX2000 from NEC lets you put big data to workDX2000 from NEC lets you put big data to work
DX2000 from NEC lets you put big data to work
 

Recently uploaded

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 

Recently uploaded (20)

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 

Hotel inspection data set analysis copy

  • 1. HOTEL INSPECTION DATASET ANALYSIS 1 A mini project on BIG DATA-HADOOP Project Title HOTEL INSPECTION DATASET ANALYSIS Presented by, SHARON MOSES RAGINI AKULA
  • 2. HOTEL INSPECTION DATASET ANALYSIS 2 CONTENTS Abstract List of figures List of screens TOPIC NAME PAGE NO 1. INTRODUCTION 8-10 1.1Motivation 8 1.2Existing System 8 1.3Problemdefinition 9 1.3.1Storing 9 1.3.2Processing 9 1.4ProposedSystem 9 1.5Features ofproject 10 1.5.1Storing the data set 10 1.5.2Processing the data set 10 2. LITERATURE SURVEY 11-27 2.1Bigdata 11 2.2Apache Hadoop 13 2.2.1Vendors 16 2.2.2Cloudera 20 2.2.3HadoopEcosystems 22 2.3Linux Ubuntu 24 2.4MySql 26 3. SYSTEM REQUIREMENTS 28-29
  • 3. HOTEL INSPECTION DATASET ANALYSIS 3 3.1Identificationof needs 28 3.2EnvironmentalRequirements 29 3.2.1Software Requirements 29 3.2.2HardwareRequirements 29 4. BUSINESS LOGIG 30-36 4.1SystemAnalysis 30 4.1.1FunctionalRequirements 30 4.1.1.1TechnicalFeasibility 30 4.1.1.2OperationalFeasibility 31 4.2SystemDesign 32 4.2.1BusinessFlow 32 4.2.1.1Apache Hadoopworking model-I 32 4.2.1.2Apache Hadoopworking model-II 33 4.2.2BusinessLogic 34 5. PROJECT MODULES 37-61 5.1Modules Introduction 37 5.2Modules 37 5.2.1Analysing the data and filtering the data 37 5.2.2Identifying the headers (schema) 39 5.2.3Installing single node Hadoop cluster 43 5.2.4Moving the data to HDFS 51 5.2.5Creating the tables in Hive 54 5.2.6Importing data from HDFS to hive warehouse 56 5.2.7Analysing the data based on the queries from the client 58 5.2.8Generating the reports 61 6. EXECUTION OF JOBS 62-69
  • 4. HOTEL INSPECTION DATASET ANALYSIS 4 6.1Methods ofExecution 62 6.1.1Executing the job from the Hive prompts 62 6.1.2Executing the job from terminal with Hadoop 63 6.1.3Executing the jog as Script 63 6.2ExecutionofHiveQL jobs 65 7. TESTING 70 7.1Introduction 70 7.2Sample unit testing 70 8. SCREENS 71-74 9. CONCLUSIONS 75 10. REFERENCES 76
  • 5. HOTEL INSPECTION DATASET ANALYSIS 5 ABSTRACT Generally, hotels are complex and costly when it comes to maintenance with various things like quality of food, usage of spaces that have different schedules and uses for guest rooms‟ restaurants, health club, swimming pool, retail store and each has a functional engineering system required for its maintenance. Maintenance therefore has to be done throughout the year, requiring competent staff to undertake building services, operation and maintenance, supplemented by outsourced contractors. In the hospitality industry the maintenance of the engineering systems is important despite its complex processes as its effectiveness will directly affect the quality of hotel service, food, and beverage which have direct and significant effect on guests‟ impression of the hotel. Here is the data of various inspections done on Hotels in various parts o USA .The data deals with the violations made by the hotel managements and their violation codes. Data also explains the action taken by the government according to the violation codes on the hotel. We analyze the data based on which parts they are violating the codes, so that the new hotels will exclude this problems and survive in the market.
  • 6. HOTEL INSPECTION DATASET ANALYSIS 6 LIST OF FIGURES Figure no Name of the figure Page no 1 Big data 3 v’s 11 2 Data Measurements 12 3 Hadoop logo 13 4 Components of Hadoop 14 5 HDFS data distribution 15 6 Map reduce compute distribution 15 7 Performance and Scalability 19 8 Apache Hadoop ecosystem 24 9 Ubuntu logo 24 10 MySql logo 26 11 Apache Hadoop working model-I 32 12 Apache Hadoop working model-II 33 13 MapReduce Logic 35 14 Job execution phase in Hadoop 36 15 Violation table schema diagram 42 16 Hotel table schema diagram 43
  • 7. HOTEL INSPECTION DATASET ANALYSIS 7 LIST OF SCREENS Screenno Name of the screen Page no 1 Raw-dataset 38 2 Unnecessary data fields 38 3 Final dataset 39 4 JDK installation path 44 5 Java path 45 6 Hadoop location 46 7 Hadoop installation crosscheck 47 8 Hive installation crosscheck 48 9 Hadoop-env.sh file 49 10 Core-site.xml 49 11 Mapred-site.xml 50 12 Hdfs-site.xml 50 13 Creating Directory 51 14 Listing the Directories 52 15 Moving data to HDFS 53 16 Checking the files in HDFS 54 17 Table created successfully 55 18 Checking created tables 56 19 Data loaded to hive warehouse and table 56 20 Table description 57 21 Verifying the data 57 22 Job execution 58 23 Query executed and data loaded to HDFS 59 24 Result moved to home directory 60 25 Stored output 60 26 Output generated by the query 61 27 Report generated from the query 61 28 Hive prompt 62 29 Query using hive –e 63
  • 8. HOTEL INSPECTION DATASET ANALYSIS 8 30 Query from script 64 31 Script home directory 64 32 Query written in script 65 33 Violation codes 71 34 Violation made 71 35 Inspections made area wise 72 36 Violation counts from each restaurant 72 37 Types of cosines inspected 73 38 More inspections in cosines 73 39 Critical and non-critical issues 74
  • 9. HOTEL INSPECTION DATASET ANALYSIS 9 1. INTRODUCTION 1.1 MOTIVATION: Generally, hotels are complex and costly when it comes to maintenance with various things like quality of food, usage of spaces that have different schedules and uses for guest room’s restaurants, health club, swimming pool, retail store and each has a functional engineering system required for its maintenance. Maintenance therefore has to be done throughout the year, requiring competent staff to undertake building services, operation and maintenance, supplemented by outsourced contractors. In the hospitality industry the maintenance of the engineering systems is important despite its complex processes as its effectiveness will directly affect the quality of hotel service, food, and beverage which have direct and significant effect on guest’s impression of the hotel. Here is the data of various inspection done on Hotels in various parts o USA. The data deals with the violations made by the hotel managements and their violation codes. Data also explains the action taken by the government according to the violation codes on the hotel. We analyze the data based on which parts they are violating the codes, so that the new hotels will exclude these problems and survive in the market. 1.2 EXISTING SYSTEM: These days for any organization, company or a business firm the most important things for them is survive in the market and compete with the competitors. As to do so the firm need to analyze their position in the market.
  • 10. HOTEL INSPECTION DATASET ANALYSIS 10 Analyzing the market needs the data which they have generated from long years. The data from last year’s, which has been rapidly, multiply in numbers and creating a lot of problems in terms of storing and analyzing the stored data. These days we have a tedious improvement in storing technologies rather than in analyzing techniques. We are having problems in analyzing the data stored in our Traditional RDBMS ( Mysql, Db2..) and at same time the data size is also exceeding our storage probabilities. 1.3 PROBLEM DEFINITION: The following are the problems which we are facing with the existing systems. 1.3.1 Storing: Since couple of years we can see how the data has rustically increased in its size and creating lot of problems in storing them. 1.3.2 Processing: Since the data is very huge we are not able to analyze the dataset in the fixed period of time and so unable to get the results in an efficient way. 1.4 PROPOSED SYSTEM: In our proposed system we are using new technologies for analyzing the datasets. The framework we are using is Hadoop.
  • 11. HOTEL INSPECTION DATASET ANALYSIS 11 This is a framework which is capable of storing any tedious amount of data and can processing the dataset in a less time and in a efficient way compared to other technologies. 1.5 FEATURES OF PROJECT: 1.5.1 Storing the Dataset: We extract the dataset from an external source to our Hadoop cluster using Sqoop ecosystem. 1.5.2 Processing the Dataset: Once the dataset is extracted the data is analyzed using MapReduce and other ecosystem which works well with the dataset.
  • 12. HOTEL INSPECTION DATASET ANALYSIS 12 2. LITERATURE SURVEY 2.1 BIGDATA: Big data is an evolving term that describes any voluminous amount of structured, semi- structured and unstructured data that has the potential to be mined for information. Although big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and Exabyte’s of data. Big data is used to describe a massive volume of data that is so large that it's difficult to process. The data is too big that exceeds current processing capacity. Big data can be characterized by 3Vs: the extreme volume of data, the wide variety of types of data and the velocity at which the data must be must processed. Big Data 3V’s
  • 13. HOTEL INSPECTION DATASET ANALYSIS 13 An example of big data might be pentabytes (1,024 terabytes) or Exabyte’s (1,024 pentabytes) of data consisting of billions to trillions of records. E.g. Web, sales, customer contact center, social media, and mobile data... Data Measurements
  • 14. HOTEL INSPECTION DATASET ANALYSIS 14 2.2 APACHE HADOOP: Hadoop-logo Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. Doug Cutting, Cloud era’s Chief Architect, helped create Apache Hadoop out of necessity as data from the web exploded, and grew far beyond the ability of traditional systems to handle it. Hadoop was initially inspired by papers published by Google outlining its approach to handling an avalanche of data, and has since become the de facto standard for storing, processing and analyzing hundreds of terabytes, and even pet bytes of data. Why is Hadoop important? Since its inception, Hadoop has become one of the most talked about technologies. Why? One of the top reasons (and why it was invented) is its ability to handle huge amounts of data – any kind of data – quickly. With volumes and varieties of data growing each day, especially from social media and automated sensors, that’s a key consideration for most organizations. Other reasons include: Low cost: The open-source framework is free and uses commodity hardware to store large quantities of data. Computing power: Its distributed computing model can quickly process very large volumes of data. The more computing nodes you use, the more processing power you have.
  • 15. HOTEL INSPECTION DATASET ANALYSIS 15 Scalability: You can easily grow your system simply by adding more nodes. Little administration is required. Storage flexibility: Unlike traditional relational databases, you don’t have to preprocess data before storing it. And that includes unstructured data like text, images and videos. You can store as much data as you want and decide how to use it later. Inherent data protection and self-healing capabilities: Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data. Components of Hadoop HDFS: (Hadoop Distributed File System) HDFS is a fault tolerant and self-healing distributed file system designed to turn a cluster of industry standard servers into a massively scalable pool of storage. Developed specifically for large-scale data processing workloads where scalability, flexibility and throughput are critical, HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100PB and beyond.
  • 16. HOTEL INSPECTION DATASET ANALYSIS 16 HDFS Data Distribution Data in HDFS is replicated across multiple nodes for compute performance and data protection. MapReduce: MapReduce is a massively scalable, parallel processing framework that works in tandem with HDFS. With MapReduce and Hadoop, compute is executed at the location of the data, rather than moving data to the compute location; data storage and computation coexist on the same physical nodes in the cluster. MapReduce processes exceedingly large amounts of data without being affected by traditional bottlenecks like network bandwidth by taking advantage of this data proximity. MapReduce Compute Distribution MapReduce divides workloads up into multiple tasks that can be executed in parallel.
  • 17. HOTEL INSPECTION DATASET ANALYSIS 17 The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types. The key and value classes have to be Serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output) 2.2.1 Vendors: Hadoop vendors share the Hadoop architecture from Apache Hadoop. EMC: Pivotal HD, the Apache Hadoop distribution from EMC, natively integrates EMC’s massively parallel processing (MPP) database technology (formerly known as Greenplum, and now known as HAWQ) with Apache Hadoop. The result is a high-performance Hadoop distribution with true SQL processing for Hadoop. SQL-based queries and other business intelligence tools can be used to analyze data that is stored in HDFS. Hortonworks: Another major player in the Hadoop market, Hortonworks has the largest number of committers and code contributors for the Hadoop ecosystem components. (Committers are the gatekeepers of Apache projects and have the power to approve code changes.) Hortonworks is a spin-off from Yahoo!, which was the original corporate driver of the Hadoop project because it needed a large-scale platform to support its search engine business. Of
  • 18. HOTEL INSPECTION DATASET ANALYSIS 18 all the Hadoop distribution vendors, Hortonworks is the most committed to the open source movement, based on the sheer volume of the development work it contributes to the community, and because all its development efforts are (eventually) folded into the open source codebase. The Hortonworks business model is based on its ability to leverage its popular HDP distribution and provide paid services and support. However, it does not sell proprietary software. Rather, the company enthusiastically supports the idea of working within the open source community to develop solutions that address enterprise feature requirements (for example, faster query processing with Hive). Hortonworks has forged a number of relationships with established companies in the data management industry: Teradata, Microsoft, Informatica, and SAS, for example. Though these companies don’t have their own, in-house Hadoop offerings, they collaborate with Hortonworks to provide integrated Hadoop solutions with their own product sets. The Hortonworks Hadoop offering is the Hortonworks Data Platform (HDP), which includes Hadoop as well as related tooling and projects. Also unlike Cloudera, Hortonworks releases only HDP versions with production-level code from the open source community. IBM: Big Blue offers a range of Hadoop offerings, with the focus around value added on top of the open source Hadoop stack.
  • 19. HOTEL INSPECTION DATASET ANALYSIS 19 Intel: The Intel Distribution for Apache Hadoop (Intel Distribution) provides distributed processing and data management for enterprise applications that analyze big data. MapR: For a complete distribution for Apache Hadoop and related projects that’s independent of the Apache Software Foundation, look no further than MapR. Boasting no Java dependencies or reliance on the Linux file system, MapR is being promoted as the only Hadoop distribution that provides full data protection, no single points of failure, and significant ease-of-use advantages. Three MapR editions are available: M3, M5, and M7. The M3 Edition is free and available for unlimited production use; MapR M5 is an intermediate-level subscription software offering; and MapR M7 is a complete distribution for Apache Hadoop and HBase that includes Pig, Hive, Sqoop, and much more. Cloudera: Perhaps the best-known player in the field, Cloudera is able to claim Doug Cutting, Hadoop’s co-founder, as its chief architect. Cloudera is seen by many people as the market leader in the Hadoop space because it released the first commercial Hadoop distribution and it is a highly active contributor of code to the Hadoop ecosystem.
  • 20. HOTEL INSPECTION DATASET ANALYSIS 20 Performance and scalability
  • 21. HOTEL INSPECTION DATASET ANALYSIS 21 2.2.2 Cloudera: Perhaps the best-known player in the field, Cloudera is able to claim Doug Cutting, Hadoop’s co-founder, as its chief architect. Cloudera is seen by many people as the market leader in the Hadoop space because it released the first commercial Hadoop distribution and it is a highly active contributor of code to the Hadoop ecosystem. Cloudera Enterprise, a product positioned by Cloudera at the center of what it calls the “Enterprise Data Hub,” includes the Cloudera Distribution for Hadoop (CDH), an open-source- based distribution of Hadoop and its related projects as well as its proprietary Cloudera Manager. Also included is a technical support subscription for the core components of CDH. Cloudera’s primary business model has long been based on its ability to leverage its popular CDH distribution and provide paid services and support. In the fall of 2013, Cloudera formally announced that it is focusing on adding proprietary value-added components on top of open source Hadoop to act as a differentiator. Also, Cloudera has made it a common practice to accelerate the adoption of alpha- and beta-level open source code for the newer Hadoop releases. Its approach is to take components it deems to be mature and retrofit them into the existing production-ready open source libraries that are included in its distribution. 2.2.3 Hadoop Ecosystems: The Hadoop platform consists of two key services: a reliable, distributed file system called Hadoop Distributed File System (HDFS) and the high-performance parallel data processing engine called Hadoop MapReduce, described in MapReduce below.
  • 22. HOTEL INSPECTION DATASET ANALYSIS 22 The combination of HDFS and MapReduce provides a software framework for processing vast amounts of data in parallel on large clusters of commodity hardware (potentially scaling to thousands of nodes) in a reliable, fault-tolerant manner. Hadoop is a generic processing framework designed to execute queries and other batch read operations against massive datasets that can scale from tens of terabytes to pentabytes in size. The popularity of Hadoop has grown in the last few years, because it meets the needs of many organizations for flexible data analysis capabilities with an unmatched price-performance curve. The flexible data analysis features apply to data in a variety of formats, from unstructured data, such as raw text, to semi-structured data, such as logs, to structured data with a fixed schema. Hadoop has been particularly useful in environments where massive server farms are used to collect data from a variety of sources. Hadoop is able to process parallel queries as big, background batch jobs on the same server farm. This saves the user from having to acquire additional hardware for a traditional database system to process the data (assume such a system can scale to the required size). Hadoop also reduces the effort and time required to load data into another system; you can process it directly within Hadoop. This overhead becomes impractical in very large data sets. Many of the ideas behind the open source Hadoop project originated from the Internet search community, most notably Google and Yahoo!. Search engines employ massive farms of inexpensive servers that crawl the Internet retrieving Web pages into local clusters where they are analyzes with massive, parallel queries to build search indices and other useful data structures.
  • 23. HOTEL INSPECTION DATASET ANALYSIS 23 The Hadoop ecosystem includes other tools to address particular needs. Hive is a SQL dialect and Pig is a dataflow language for that hide the tedium of creating MapReduce jobs behind higher-level abstractions more appropriate for user goals. Zookeeper is used for federating services and Oozie is a scheduling system. Avro, Thrift and Protobuf are platform- portable data serialization and description formats. MapReduce: MapReduce is now the most widely-used, general-purpose computing model and runtime system for distributed data analytics. It provides a flexible and scalable foundation for analytics, from traditional reporting to leading-edge machine learning algorithms. In the MapReduce model, a compute “job” is decomposed into smaller “tasks” (which correspond to separate Java Virtual Machine (JVM) processes in the Hadoop implementation). The tasks are distributed around the cluster to parallelize and balance the load as much as possible. The MapReduce runtime infrastructure coordinates the tasks, re-running any that fail or appear to hang. Users of MapReduce don’t need to implement parallelism or reliability features themselves. Instead, they focus on the data problem they are trying to solve. Pig: Pig is a platform for constructing data flows for extract, transform, and load (ETL) processing and analysis of large datasets. Pig Latin, the programming language for Pig provides common data manipulation operations, such as grouping, joining, and filtering. Pig generates Hadoop MapReduce jobs to perform the data flows. This high-level language for ad hoc analysis allows developers to inspect HDFS stored data without the need to learn the complexities of the MapReduce framework, thus simplifying the access to the data. The Pig Latin scripting language is not only a higher-level data flow language but also has operators similar to SQL (e.g., FILTER and JOIN) that are translated into a series of map and
  • 24. HOTEL INSPECTION DATASET ANALYSIS 24 reduce functions. Pig Latin, in essence, is designed to fill the gap between the declarative style of SQL and the low-level procedural style of MapReduce. Hive : Hive is a SQL-based data warehouse system for Hadoop that facilitates data summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems (e.g., HDFS, MapR-FS, and S3) and some NoSQL databases. Hive is not a relational database, but a query engine that supports the parts of SQL specific to querying data, with some additional support for writing new tables or files, but not updating individual records. That is, Hive jobs are optimized for scalability, i.e., computing over all rows, but not latency, i.e., when you just want a few rows returned and you want the results returned quickly. Hive’s SQL dialect is called HiveQL. Table schema can be defined that reflect the data in the underlying files or data stores and SQL queries can be written against that data. Queries are translated to MapReduce jobs to exploit the scalability of MapReduce. Hive also support custom extensions written in Java, including user-defined functions (UDFs) and serializer-deserializers for reading and optionally writing custom formats, e.g., JSON and XML dialects. Hence, analysts have tremendous flexibility in working with data from many sources and in many different formats, with minimal need for complex ETL processes to transform data into more restrictive formats. Contrast with Shark and Impala.
  • 25. HOTEL INSPECTION DATASET ANALYSIS 25 Apache Hadoop Ecosystem 2.3 LINUX UBUNTU: Ubuntu-logo Ubuntu is an ancient African word meaning 'humanity to others'. It also means 'I am what I am because of who we all are'. The Ubuntu operating system brings the spirit of Ubuntu to the world of computers. Linux was already established as an enterprise server platform in 2004, but free software was not a part of everyday life for most computer users. That's why Mark Shuttleworth gathered a small team of developers from one of the most established Linux projects – Debian – and set out to create an easy-to-use Linux desktop: Ubuntu.
  • 26. HOTEL INSPECTION DATASET ANALYSIS 26 The vision for Ubuntu is part social and part economic: free software, available to everybody on the same terms, and funded through a portfolio of services provided by Canonical. The first official Ubuntu release -- Version 4.10, codenamed the 'Warty Warthog' — was launched in October 2004, and sparked dramatic global interest as thousands of free software enthusiasts and experts joined the Ubuntu community. The governance of Ubuntu is somewhat independent of Canonical, with volunteer leaders from around the world taking responsibility for many critical elements of the project. It remains a key tenet of the Ubuntu Project that Ubuntu is a shared work between Canonical, other companies, and the thousands of volunteers who bring their expertise to bear on making it a world-class platform for anyone to use. Ubuntu today has eight flavours and dozens of localised and specialised derivatives. There are also special editions for servers, OpenStack clouds, and mobile devices. All editions share common infrastructure and software, making Ubuntu a unique single platform that scales from consumer electronics to the desktop and up into the cloud for enterprise computing. The Ubuntu OS and the innovative Ubuntu for Android convergence solution make it an exciting time for Ubuntu on mobile devices. In the cloud, Ubuntu is the reference operating system for the OpenStack project, it’s a hugely popular guest OS on Amazon's EC2 and Rackspace's Cloud, and it’s pre-installed on computers from Dell, HP, Asus, Lenovo and other global vendors. And thanks to that shared infrastructure, developers can work on the desktop, and smoothly deliver code to cloud servers running the stripped-down Ubuntu Server Edition.
  • 27. HOTEL INSPECTION DATASET ANALYSIS 27 After many years Ubuntu still is and always will be free to use, share and develop. We hope it will bring a touch of light to your computing — and we hope that you'll join us in helping to build the next version. 2.4 MySQL: Mysql-logo MySQL is the world's most popular open source database software, with over 100 million copies of its software downloaded or distributed throughout it's history. With its superior speed, reliability, and ease of use, MySQL has become the preferred choice for Web, Web 2.0, SaaS, ISV, Telecom companies and forward-thinking corporate IT Managers because it eliminates the major problems associated with downtime, maintenance and administration for modern, online applications. Many of the world's largest and fastest-growing organizations use MySQL to save time and money powering their high-volume Web sites, critical business systems, and packaged software — including industry leaders such as Yahoo!, Alcatel-Lucent, Google, Nokia, YouTube, Wikipedia, and Booking.com. The flagship MySQL offering is MySQL Enterprise, a comprehensive set of production- tested software, proactive monitoring tools, and premium support services available in an affordable annual subscription.
  • 28. HOTEL INSPECTION DATASET ANALYSIS 28 MySQL is a key part of LAMP (Linux, Apache, MySQL, PHP / Perl / Python), the fast- growing open source enterprise software stack. More and more companies are using LAMP as an alternative to expensive proprietary software stacks because of its lower cost and freedom from platform lock-in.
  • 29. HOTEL INSPECTION DATASET ANALYSIS 29 3. SYSTEM REQUIREMENTS The purpose of this SRS document is to identify the requirements and functionalities for Intelligent Network Backup Tool . The SRS will define how our team and the client conceive the final product and the characteristics or functionality it must have. This document also makes a note of the optional requirements which we plan to implement but are not mandatory for the functioning of the project. This phase appraises the needed requirements for the Hotel Inspection dataset for a systematic way of evaluating the requirements several processes are involved. The first step involved in analyzing the requirements of the system is recognizing the nature of system for a reliable investigation and all the case are formulated to better understand the analysis of the dataset. Document Conventions: The convention used in the size of fonts remains the same as for other documents in the project. The section headings have the largest font of 14, subheadings have a font size of 12(bold), and the text is on font 12. The priorities of the requirements are specified with the requirement statements. Intended Audience and Reading Suggestions: This document is intended for project developers, managers, users, testers and documentation writers. This document aims at discussing design and implementation constraints, dependencies, system features, external interface requirements and other non functional requirements.
  • 30. HOTEL INSPECTION DATASET ANALYSIS 30 3.1 IDENTIFICATION OF NEEDS: The foremost and important necessity for a business firm or an organization is to know how they are performing in the market and parallel they need to know how to overcome their competitors in the market. To do so we need to analysis our data based on all the available factors. The system requirements for the project to be accomplished are: 3.2 ENVIRONMENTAL REQUIREMENTS: 3.2.1 Software Requirements: Development & Usage: Linux Operating System. Apache Hadoop. Mozilla Firefox: (or any browser). Microsoft Excel or Open office. 3.2.2 Hardware Requirements: Development & Usage: Pentium 4 processor. 40GB Hard disc. 256 MB RAM. / 4 GB RAM System with all standard accessories like monitor, keyboard, mouse, etc.,
  • 31. HOTEL INSPECTION DATASET ANALYSIS 31 4. BUSINESS LOGIC Logic Features: 1. Store: The main intention of the Hotel Inspection Dataset is to analysis the data based on the violations made by all inspected restaurants and hotel. To handle the things we first load the data to our Hadoop HDFS Component. 2. Analysis: This is the other major step for the dataset, this module is done based the type of dataset we have, any ways our Hotel Inspection Dataset is a structure data. So we work with Hadoop Ecosystem HIVE. 4.1 SYSTEM ANALYSIS: 4.1.1 FUNCTIONAL REQUIREMENTS: 4.1.1.1 Technical Feasibility: Evaluating the technical feasibility is the trickiest part of a feasibility study. This is because, at this point in time, not too many detailed design of the system, making it difficult to access issues like performance, costs on (on account of the kind of technology to be deployed) etc. A number of issues have to be considered while doing a technical analysis. Understand the different technologies involved in the proposed system. Before commencing the project, we have to be very clear about what are the technologies that are to be required for the development of the new system.
  • 32. HOTEL INSPECTION DATASET ANALYSIS 32 Find out whether the organization currently possesses the required technologies. Is the required technology available with the organization? If so is the capacity sufficient? For instance –“Will the current printer be able to handle the new reports and forms required for the new system?” 4.1.1.2 Operational Feasibility Proposed projects are beneficial only if they can be turned into information systems that will meet the organizations operating requirements. Simply stated, this test of feasibility asks if the system will work when it is developed and installed. Are there major barriers to Implementation? Here are questions that will help test the operational feasibility of a project. • Is there sufficient support for the project from management from users? If the current system is well liked and used to the extent that persons will not be able to see reasons for change, there may be resistance. • Are the current business methods acceptable to the user? If they are not, Users may welcome a change that will bring about a more operational and useful systems. • Have the user been involved in the planning and development of the project? Early involvement reduces the chances of resistance to the system and in General and increases the likelihood of successful project.
  • 33. HOTEL INSPECTION DATASET ANALYSIS 33 Since the proposed system was to help reduce the hardships encountered in the existing manual system, the new system was considered to be operational feasible. 4.2 SYSTEM DESIGN: 4.2.1 Business Flow: 4.2.1.1 Apache Hadoop Working Model-I: Apache hadoop working Model-I Create Secure Shell Connection (SSH) from Local host to Linux (Ubuntu) Kernel – ssh localhost Start all Demons Name node, Secondary Name node, Data node, Job tracker,Task Tracker – start-all.sh Check weather all demons are up Jps Create a directory and move the dataset to the HDFS using terminal Linux. Check the data format from the browser view. From the data point of view chose the Ecosystem to work Based on the Ecosystem, design the platform and execute the jobs. Once the jobs are executed. Generate the Reports based on the dataset Analyze the Reports for the improvement of the firm
  • 34. HOTEL INSPECTION DATASET ANALYSIS 34 4.2.1.2 Apache Hadoop Working Model-II: Apache Hadoop Working Model-II Install a Virtual Machine..VMware Open a virtual machine which is already created from the Cloudera Start the centousfromthe virtual Machine. Work withthe terminal. Create a directory and move the dataset to the HDFS using terminal Linux. Check the data format from the browser view. From the data point of view chose the Ecosystem to work Based on the Ecosystem, design the platform and execute the jobs. Once the jobs are executed. Generate the Reports based on the dataset Analyze the Reports for the improvement of the firm
  • 35. HOTEL INSPECTION DATASET ANALYSIS 35 4.2.2 Business Logic: Functional Programming: Multithreading is one of the popular way of doing parallel programming, but major complexity of multi-thread programming is to co-ordinate the access of each thread to the shared data. We need things like semaphores, locks, and also use them with great care, otherwise dead locks will result. User defined Map/Reduce functions: Map/reduce is a special form of such a DAG which is applicable in a wide range of use cases. It is organized as a “map” function which transform a piece of data into some number of key/value pairs. Each of these elements will then be sorted by their key and reach to the same node, where a “reduce” function is use to merge the values (of the same key) into a single result. Mapper: map(input_record) { ... emit(k1, v1) ... emit(k2, v2) ... }
  • 36. HOTEL INSPECTION DATASET ANALYSIS 36 Reducer: reduce (key, values) { aggregate = initialize() while (values.has_next) { aggregate = merge(values.next) } collect(key, aggregate) } MapReduce logic
  • 37. HOTEL INSPECTION DATASET ANALYSIS 37 Job execution phase in Hadoop
  • 38. HOTEL INSPECTION DATASET ANALYSIS 38 5. PROJECT MODULES 5.1 MODULES INTRODUCTION: The dataset holds the Hotel Inspection data from last years. We have taken the dataset from a reference website https://data.ny.gov/. The size of the dataset is very huge with the data around three lacks of lines. We had taken a part of it as that our basic systems can’t be able to support that much huge amount of dataset, this needs a well classified configuration to work on. To deal with the project we have taken the dataset with around twenty five thousand lines. We have analyzed the raw dataset, by eliminating the unnecessary fields from the data and given the dataset a well organized format. The dataset is dividing into two tables based on the data and their fields. The first table deals with the inspection data with contains the parameters like id, name of restaurant, area, address, location, inspected data, violated code, critical point of violation, type of inspection. The second table deals with the violation code and the violation property. 5.2 MODULES: 5.2.1 Analyzing the Data and filtering the Data: The first step of the project we need to analyze the data, we should check the data how it has been formatted. We should be aware of the fields that has given to us and need to know the importance of each and every field, if we think that there are some unnecessary information which is disturbing our dataset, we need to talk to our client before taking any step in changing the dataset or removing or moving any columns from the dataset.
  • 39. HOTEL INSPECTION DATASET ANALYSIS 39 Raw-dataset unnecessery data fields
  • 40. HOTEL INSPECTION DATASET ANALYSIS 40 The unnecessary fiedls have removed from the raw dataset. The fields have been removed from the dataset and the dataset has been divided into two separate Tables. Table1 (violation) – dataset with violation code and its explanation. Table2 (Hotel) – dataset with voilation code and remaing fields from the filtered dataset. Now the final dataset would be refered as filtered dataset. final-dataset 5.2.2 Identifying the headers (Schema): The schema is generated based on the dataset and the data we are having. This schema is for the table hotel.
  • 41. HOTEL INSPECTION DATASET ANALYSIS 41 Schema for Hotel: Name of Header Description headername in schema ID - id (Primary Key) - id CAMIS - Refers to the Store ID's - camis DBA - Refers to the Restaurant - dba BORO - Place - boro BUILDING - Building Number - building STREET - Street Address - street ZIPCODE - Area zipcode - zipcode PHONE - Store phone - phone CUISINE DESCRIPTION - Type of Cusine - cuisine_description INSPECTION DATE - Inspected on Date - inspection_date ACTION - Type of Action - action VIOLATION CODE - Voilaton Codes - violation_code CRITICAL FLAG - Serious of Voilations - critical_flag SCORE - Rating - score GRADE - Grade - grade GRADE DATE - Grade Date - grade_date RECORD DATE - Record Date - record_date INSPECTION TYPE - Type of Inspection - inspection_type
  • 42. HOTEL INSPECTION DATASET ANALYSIS 42 The scheme is for the table Violation. The schema is violation of code and the voilation description. Schema for table Violation: Name of Header Description headername in schema ID - id (Primary Key) - id VIOLATION CODE - Refers to Violation code - violation_code VIOLATION DESCRITION - Refers description of code - v_desc Table for Hotel: Name of Header Description headername in schema ID id (Primary Key) Id CAMIS Refers to the Store ID's Camis DBA Refers to the Restaurant Dba BORO Place Boro BUILDING Building Number Building STREET Street Address Street ZIPCODE Area zipcode Zipcode PHONE Store phone Phone CUISINE DESCRIPTION Type of Cusine cuisine_description INSPECTION DATE Inspected on Date inspection_date ACTION Type of Action Action VIOLATION CODE Violation Codes Violation_code
  • 43. HOTEL INSPECTION DATASET ANALYSIS 43 Name of Header Description headername in schema CRITICAL FLAG Serious of Violations critical_flag SCORE Rating Score GRADE Grade Grade GRADE DATE Grade Date grade_date RECORD DATE Record Date record_date INSPECTION TYPE Type of Inspection inspection_type Table for Violation: Name of Header Description Headername in schema ID id (Primary Key) Id VIOLATION CODE Refers to Violation code violation_code VIOLATION DESCRITION Refers description of code v_desc Violation Table Schema diagram: Violation Table Schema diagram
  • 44. HOTEL INSPECTION DATASET ANALYSIS 44 Hotel Table Schema diagram: Hotel Table Schema diagram 5.2.3 Installing Single Node Hadoop Cluster: Java Development Kit 1.7: Download the Java Development Kit 1.7 from the official website of Oracle services. Once the JDK1.7 is downloaded extract the file from downloads and create a directory named JAVA in the root directory. The path of the root directory “ /usr/lib/java”
  • 45. HOTEL INSPECTION DATASET ANALYSIS 45 Once the java folder is created with sudo (admistrator) permissions, then move the downloaded jdk to the /usr/lib/java/ so the jdk1.7 would be in the /usr/lib/java/jdk1.7. The java path would be now “/usr/lib/java/jdk1.7”. Jdk installation path Once this part is done, now we need to set the runable file to configure with the java and kernel, to do so run the below mentioned scripts. sudo update-alternatives --install "/usr/bin/java" "java" usr/lib/java/jdk1.7.0_67/bin/java" 1 sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/java/jdk1.7.0_67/bin/javac" 1 sudo update-alternatives --install "/usr/bin/javaws" "javaws" usr/lib/java/jdk1.7.0_67/bin/javaws"
  • 46. HOTEL INSPECTION DATASET ANALYSIS 46 To check the java installation is completed, run the command “ java –version” java path Now edit the bashrc file in Linux (Ubuntu), to do so run the command sudo gedit ~/.bashrc and add the following lines to the file: export JAVA_HOME="/usr/lib/java/jdk1.7.0_67" set PATH="$PATH:$JAVA_HOME/bin" alias jps="/usr/lib/java/jdk1.7.0_67/bin/jps"
  • 47. HOTEL INSPECTION DATASET ANALYSIS 47 Install Hadoop1.2.0: Download hadoop1.2 version from the source website of Apache Hadoop: Create a file named Hadoop in /usr /lib/ path, once the file is created extract the downloaded Hadoop file and move it to “/usr/lib/Hadoop” path with the sudo permissions. Hadoop location Configure the Hadoop location with bashrc file, sudo gedit ~/.bashrc Add the lines to the file : export HADOOP_HOME="/usr/lib/hadoop/hadoop-1.2.1" PATH=$PATH:$HADOOP_HOME/bin
  • 48. HOTEL INSPECTION DATASET ANALYSIS 48 Hadoop installation cross checking Install Hive: Now download the hive 0.12.0 file from the target source of Apache hive: Create a directory with hive in “/usr/lib” directory and move the extracted file to “/usr/lib/hive/” this path is the hive directory. Open the bashrc file: sudo gedit ~/.bashrc Configure the file with the script : # Hive Home Directory Configuration HIVE_HOME="/usr/lib/hive/hive-0.12.0" export PATH=$PATH:$HIVE_HOME/bin
  • 49. HOTEL INSPECTION DATASET ANALYSIS 49 hive installation cross check We need to configure the four important file in the Hadoop environment. Open the Hadoop directory from the location “/usr/lib/Hadoop.hadoop1.2.1/conf” Open the files hdfs-site.xml, mapred-site.xml, core-site.xml, Hadoop-env.sh. Add the following lines to these file respectively:
  • 50. HOTEL INSPECTION DATASET ANALYSIS 50 Hadoop-env.sh file core-site.xml
  • 51. HOTEL INSPECTION DATASET ANALYSIS 51 mapred-site.xml hdfs-site.xml
  • 52. HOTEL INSPECTION DATASET ANALYSIS 52 Hadoop installation in Cloudera: 5.2.4 Moving the data to HDFS: Once the data schema is ready, the Hadoop installation is done , now our next task is to move the data from our localfilesytem to the Hadoop single node cluster i.e., to the HDFS a Component of Hadoop where the data is stored in the form of file systems. The Command we use is: Hadoop fs -mkdir Hotel This Command is used for creating a directory for our project in HDFS. Here we are creating a directory Hotel which is used to store our datasets 1) Hotel dataset 2) Violation code dataset Creating directory
  • 53. HOTEL INSPECTION DATASET ANALYSIS 53 Hadoop fs -ls This command is used to display all the directories in the HDFS, We need to cross check as to know whether our directory Hotel has been created or not. Listing the directory Hadoop fs -copyFromLocal Src ... Dest This command is used to move our file from localfilesytem to HDFS. We are copying our file Hotel.csv to the Hotel directory of HDFS Hadoop fs -copyFromLocal '/home/username/Desktop/hotel.csv' /user/username/hotel/ '/home/username/Desktop/hotel.csv' indicates the location of the file.
  • 54. HOTEL INSPECTION DATASET ANALYSIS 54 ‘/user/username/hotel/’ indicates the location of HDFS. Hotel – indicates the HDFS directory Moving data to hdfs from the images we can see the two files hotel.csv ad codes.txt had been moved to the hdfs directory Hotel. Hadoop fs -ls Hotel This command is used to display all the files from our specified HDFS directories Hotel. We need to cross check as to know whether our file has been created or not. Hadoop fs -ls hotel
  • 55. HOTEL INSPECTION DATASET ANALYSIS 55 Checking the files in hdfs This is clear that we have moved all our files to the HDFS – into the hotel directory. 5.2.5 Creating the tables in hive: We are all set to create the tables for our dataset. The query for creating the hotel table: hive -e "create table 360_hotel ( camis string, dba string, boro string, building string, street string, zipcode string, phone string, cuisine_description string, inspection_date string, action string, violation_code string, critical_flag string, score string, grade string, grade_date string, record_date string, inspection_type string)row format delimited fields terminated by ',' "
  • 56. HOTEL INSPECTION DATASET ANALYSIS 56 Table created successfully To see the table: hive -e “show tables”
  • 57. HOTEL INSPECTION DATASET ANALYSIS 57 Checking created table 5.2.6 Importing data from hdfs to hive warehouse: To Load Data: hive -e "load data inpath '/user/training/hotel/hotel.csv' overwrite into table 360_hotel" Data loaded to hive warehouse and table.
  • 58. HOTEL INSPECTION DATASET ANALYSIS 58 Hotel Table Description: hive -e “desc 360_hotel” Table Description Checking the tables: hive -e “select *from hotel limit 3” verifying the data
  • 59. HOTEL INSPECTION DATASET ANALYSIS 59 5.2.7 Analyzing the data based on the queries from the client: - Frequent violated code. - How many stores/restaurants have been inspected and location wise. - Number of violations made by each restaurant. - How many areas have been covered in the inspection? - Types of cuisines inspected. - More inspections were done on descending order. - ascending order more violation codes. - No violation cited on from restaurants. - Critical and noncritical violations. - Critical Violation and non critical violation codes. Frequent violated code: hive -e "SELECT violation_code, COUNT(violation_code) FROM hotel GROUP BY violation_code HAVING ( COUNT(violation_code) > 1 )limit 5 " Job execution
  • 60. HOTEL INSPECTION DATASET ANALYSIS 60 The above displays the result to the screen, but we need the result set to be reported to an excel sheet to generate the reports. To do so we need to store the result set in table or we can store the result in HDFS, then we can move the result data from HDFS to our localfilesystem, from there the dataset is exported to excel files to generate reports. This result is stored in HDFS in the form of output.ods or output.xls hive -e "insert overwrite directory '/user/training/Desktop/output.csv' SELECT violation_code, COUNT(violation_code) FROM hotel GROUP BY violation_code HAVING ( COUNT(violation_code) > 1 )" The result set has been stored in the HDFS with the file name output.ods and the path to access it is '/user/username/output.csv' To export the file from HDFS to Localfilesystem hadoop fs -copyToLocal '/user/training/output.csv' /home/training/Desktop/ Query executed and the data is loaded to the hdfs directory
  • 61. HOTEL INSPECTION DATASET ANALYSIS 61 The result set has been stored in HDFS. Now we need to move the result set to the Local file system. The result set has been moved to the home directory. '/home/training/' Stored output
  • 62. HOTEL INSPECTION DATASET ANALYSIS 62 This the result files in CSV format. We need to export this dataset to excel to make the report in an efficient way. The output generated by the query. 5.2.8 Generating the Reports: This Module, Here we deal with all the generated reports. We can use any data reporting tools or else we can go with excel. Report generated from the query
  • 63. HOTEL INSPECTION DATASET ANALYSIS 63 6. EXECUTIONS OF JOBS 6.1 METHODS OF EXECUTION: We can execute the jobs in hive in three different ways: 6.1.1 Executing the job from the hive prompt: The job is written directly in the hive prompt: hive prompt
  • 64. HOTEL INSPECTION DATASET ANALYSIS 64 6.1.2 Executing the job from terminal with Hadoop: The job is executed here with the help of Hadoop terminal, there will be no contact with the hive prompt in the job execution: Query using hive -e 6.1.3 Executing the job as a script: The job is executed as script here, once the script has been written, the script is placed in the home directory of the Linux environment
  • 65. HOTEL INSPECTION DATASET ANALYSIS 65 query from script script home directory
  • 66. HOTEL INSPECTION DATASET ANALYSIS 66 query written in script 6.2 EXECUTION OF HIVEQL JOBS: How many stores/restaurants have been inspected and location wise. hive e "insert overwrite directory '/user/training/output2-1.csv' select count(dba) from hotel where boro='BRONX'" hadoop fs -copyToLocal /user/training/output2-1.csv' /home/training/Desktop hive e "insert overwrite directory '/user/training/output2-2.csv' select count(dba) from hotel where boro='BROOKLYN'" hadoop fs -copyToLocal /user/training/output2-2.csv' /home/training/Desktop hive e "insert overwrite directory '/user/ training/output2-3.csv' select count(dba) from hotel where boro='MANHATTAN'" hadoop fs -copyToLocal /user/ training /output2-3.csv' /home/ training/Desktop
  • 67. HOTEL INSPECTION DATASET ANALYSIS 67 hive e "insert overwrite directory '/user/ training /output2-4.csv' select count(dba) from hotel where boro='QUEENS'" hadoop fs -copyToLocal /user/training/output2-4.csv' /home/ training /Desktop hive e "insert overwrite directory '/user/ training /output2-5.csv' select count(dba) from hotel where boro='STATEN ISLAND'" hadoop fs -copyToLocal /user/ training /output2-5.csv' /home/ training /Desktop img: Number of violations made by each restaurant: hive -e "insert overwrite directory '/user/ training /output2.csv select distinct(dba) from hotel" hive -e "insert overwrite directory '/user/ training /output3.csv' select count(violation_code) from hotel where dba = 'MORRIS PARK BAKE SHOP'" hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'WENDY'" hive -e "insert overwrite directory '/user/ training /output3-2.csv' select count(violation_code) from hotel where dba = 'DJ REYNOLDS PUB AND RESTAURANT'" hive -e "insert overwrite directory '/user/ training /output3-3.csv' select count(violation_code) from hotel where dba = 'RIVIERA CATERER'" hive -e "insert overwrite directory '/user/ training /output3-4.csv' select count(violation_code) from hotel where dba = 'TOV KOSHER KITCHEN'"
  • 68. HOTEL INSPECTION DATASET ANALYSIS 68 hive -e "insert overwrite directory '/user/ training /output3-5.csv' select count(violation_code) from hotel where dba = 'BRUNOS ON THE BOULEVARD'" hive -e "insert overwrite directory '/user/ training /output3-6.csv' select count(violation_code) from hotel where dba = 'KOSHER ISLAND'" hive -e "insert overwrite directory '/user/ training /output3-7.csv' select count(violation_code) from hotel where dba = 'WILKEN'S FINE FOOD'" hive -e "insert overwrite directory '/user/ training /output3-8.csv' select count(violation_code) from hotel where dba = 'REGINA CATERERS'" hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'MAY MAY KITCHEN'" hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'NATHAN'S FAMOUS'" hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'SEUDA FOODS'" hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'CARVEL ICE CREAM'" hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code) from hotel where dba = 'GLORIOUS FOOD'"
  • 69. HOTEL INSPECTION DATASET ANALYSIS 69 img How many areas have been covered in the inspection: hive -e "select distinct(boro) from hotel" Types of cusines inspected: "insert overwrite directory '/user/ training /output.csv' select distinct(cuisine_description) from hotel" "select count(cuisine_description) from hotel where cuisine_description='African'" "select count(cuisine_description) from hotel where cuisine_description='American'" "select count(cuisine_description) from hotel where cuisine_description='Armenian'" "select count(cuisine_description) from hotel where cuisine_description='Bagels/Pretzels'" "select count(cuisine_description) from hotel where cuisine_description='Bakery'" "select count(cuisine_description) from hotel where cuisine_description='Café/Coffee/Tea'" "select count(cuisine_description) from hotel where cuisine_description='Caribbean'" "select count(cuisine_description) from hotel where cuisine_description='Chicken'" "select count(cuisine_description) from hotel where cuisine_description='Chinese'" "select count(cuisine_description) from hotel where cuisine_description='Continental'" "select count(cuisine_description) from hotel where cuisine_description='Donuts'" "select count(cuisine_description) from hotel where cuisine_description='German'" "select count(cuisine_description) from hotel where cuisine_description='Greek'" "select count(cuisine_description) from hotel where cuisine_description='Hamburgers'" "select count(cuisine_description) from hotel where cuisine_description='Hotdogs'" "select count(cuisine_description) from hotel where cuisine_description='Indian'" "select count(cuisine_description) from hotel where cuisine_description='Japanese'"
  • 70. HOTEL INSPECTION DATASET ANALYSIS 70 Critical Violation and non critical violation codes: hive -e "insert overwrite directory '/user/training/critical.csv' select violation_code from hotel where critical_flag = 'Critical'" hive -e "insert overwrite directory '/user/training/not-critical.csv' select violation_code from hotel where critical_flag = 'Not Critical' "
  • 71. HOTEL INSPECTION DATASET ANALYSIS 71 7. TESTING 7.1 INTRODUCTION: Software testing is a critical element of software quality assurance and represents the ultimate review of specification, design and coding. The increasing visibility of software as a system element and attendant costs associated with a software failure are motivating factors for we planned, through testing. Testing is the process of executing a program with the intent of finding an error. The design of tests for software and other engineered products can be as challenging as the initial design of the product itself. 7.2 SAMPLE UNIT TESTING: Unit testing is done when the data is loaded into hdfs. Once data is loaded we need to cross the data by seeing it in the browser. Now the sample data from the browser say about once chunk of the file: Copy the data to the text file, load the sample data to the hdfs and work on the data, write the jobs on the sample data execute the jobs store the results. Then if the job is successfully executed on the sample data then execute the job on the main dataset with the same parameters.
  • 72. HOTEL INSPECTION DATASET ANALYSIS 72 8. SCREENS Violation codes violation made
  • 73. HOTEL INSPECTION DATASET ANALYSIS 73 Inspections made area wise Violations counts from each restaurant
  • 74. HOTEL INSPECTION DATASET ANALYSIS 74 Types of Cosines inspected More inspections in cosines
  • 75. HOTEL INSPECTION DATASET ANALYSIS 75 Critical and non critical issues
  • 76. HOTEL INSPECTION DATASET ANALYSIS 76 9. CONCLUSIONS Hadoop is trending technology in the market. Hadoop solves the big data problem more effectively and efficiently. More importantly Hadoop can analyze any kind of data. Analyzing the data based on Hadoop requires very less amount of time, and it reduces the production time which directly affects the economy of the organization. Analyzing the dataset based on the Apache Hadoop will overcome all the issues caused by the traditional RDBMS and Master slave Architecture of Servers. In our project we are trying to analyze Hotel Inspection dataset using Hadoop. This analysis makes to analyze total number of hotels, their violations and their descriptions.
  • 77. HOTEL INSPECTION DATASET ANALYSIS 77 10 REFERENCES Hadoop: https://hadoop.apache.org/ Java: http://www.oracle.com/technetwork/java/javase/downloads/ Hive: https://hive.apache.org/ Linux: http://www.ubuntu.com/