Hotel inspection data set analysis copy

HOTEL INSPECTION DATASET ANALYSIS
1
A mini project on BIG DATA-HADOOP
Project Title
HOTEL INSPECTION
DATASET ANALYSIS
Presented by,
SHARON MOSES
RAGINI AKULA

2
CONTENTS
Abstract
List of figures
List of screens
TOPIC NAME PAGE NO
1. INTRODUCTION 8-10
1.1Motivation 8
1.2Existing System 8
1.3Problemdefinition 9
1.3.1Storing 9
1.3.2Processing 9
1.4ProposedSystem 9
1.5Features ofproject 10
1.5.1Storing the data set 10
1.5.2Processing the data set 10
2. LITERATURE SURVEY 11-27
2.1Bigdata 11
2.2Apache Hadoop 13
2.2.1Vendors 16
2.2.2Cloudera 20
2.2.3HadoopEcosystems 22
2.3Linux Ubuntu 24
2.4MySql 26
3. SYSTEM REQUIREMENTS 28-29

3
3.1Identificationof needs 28
3.2EnvironmentalRequirements 29
3.2.1Software Requirements 29
3.2.2HardwareRequirements 29
4. BUSINESS LOGIG 30-36
4.1SystemAnalysis 30
4.1.1FunctionalRequirements 30
4.1.1.1TechnicalFeasibility 30
4.1.1.2OperationalFeasibility 31
4.2SystemDesign 32
4.2.1BusinessFlow 32
4.2.1.1Apache Hadoopworking model-I 32
4.2.1.2Apache Hadoopworking model-II 33
4.2.2BusinessLogic 34
5. PROJECT MODULES 37-61
5.1Modules Introduction 37
5.2Modules 37
5.2.1Analysing the data and filtering the data 37
5.2.2Identifying the headers (schema) 39
5.2.3Installing single node Hadoop cluster 43
5.2.4Moving the data to HDFS 51
5.2.5Creating the tables in Hive 54
5.2.6Importing data from HDFS to hive warehouse 56
5.2.7Analysing the data based on the queries from the client 58
5.2.8Generating the reports 61
6. EXECUTION OF JOBS 62-69

4
6.1Methods ofExecution 62
6.1.1Executing the job from the Hive prompts 62
6.1.2Executing the job from terminal with Hadoop 63
6.1.3Executing the jog as Script 63
6.2ExecutionofHiveQL jobs 65
7. TESTING 70
7.1Introduction 70
7.2Sample unit testing 70
8. SCREENS 71-74
9. CONCLUSIONS 75
10. REFERENCES 76

5
ABSTRACT
Generally, hotels are complex and costly when it comes to maintenance with various
things like quality of food, usage of spaces that have different schedules and uses for guest
rooms‟ restaurants, health club, swimming pool, retail store and each has a functional
engineering system required for its maintenance. Maintenance therefore has to be done
throughout the year, requiring competent staff to undertake building services, operation and
maintenance, supplemented by outsourced contractors.
In the hospitality industry the maintenance of the engineering systems is important despite its
complex processes as its effectiveness will directly affect the quality of hotel service, food, and
beverage which have direct and significant effect on guests‟ impression of the hotel.
Here is the data of various inspections done on Hotels in various parts o USA .The data deals
with the violations made by the hotel managements and their violation codes. Data also explains
the action taken by the government according to the violation codes on the hotel.
We analyze the data based on which parts they are violating the codes, so that the new hotels will
exclude this problems and survive in the market.

6
LIST OF FIGURES
Figure no Name of the figure Page no
1 Big data 3 v’s 11
2 Data Measurements 12
3 Hadoop logo 13
4 Components of Hadoop 14
5 HDFS data distribution 15
6 Map reduce compute distribution 15
7 Performance and Scalability 19
8 Apache Hadoop ecosystem 24
9 Ubuntu logo 24
10 MySql logo 26
11 Apache Hadoop working model-I 32
12 Apache Hadoop working model-II 33
13 MapReduce Logic 35
14 Job execution phase in Hadoop 36
15 Violation table schema diagram 42
16 Hotel table schema diagram 43

7
LIST OF SCREENS
Screenno Name of the screen Page no
1 Raw-dataset 38
2 Unnecessary data fields 38
3 Final dataset 39
4 JDK installation path 44
5 Java path 45
6 Hadoop location 46
7 Hadoop installation crosscheck 47
8 Hive installation crosscheck 48
9 Hadoop-env.sh file 49
10 Core-site.xml 49
11 Mapred-site.xml 50
12 Hdfs-site.xml 50
13 Creating Directory 51
14 Listing the Directories 52
15 Moving data to HDFS 53
16 Checking the files in HDFS 54
17 Table created successfully 55
18 Checking created tables 56
19 Data loaded to hive warehouse and table 56
20 Table description 57
21 Verifying the data 57
22 Job execution 58
23 Query executed and data loaded to HDFS 59
24 Result moved to home directory 60
25 Stored output 60
26 Output generated by the query 61
27 Report generated from the query 61
28 Hive prompt 62
29 Query using hive –e 63

8
30 Query from script 64
31 Script home directory 64
32 Query written in script 65
33 Violation codes 71
34 Violation made 71
35 Inspections made area wise 72
36 Violation counts from each restaurant 72
37 Types of cosines inspected 73
38 More inspections in cosines 73
39 Critical and non-critical issues 74

9
1. INTRODUCTION
1.1 MOTIVATION:
Generally, hotels are complex and costly when it comes to maintenance with
various things like quality of food, usage of spaces that have different schedules and uses
for guest room’s restaurants, health club, swimming pool, retail store and each has a
functional engineering system required for its maintenance. Maintenance therefore has to
be done throughout the year, requiring competent staff to undertake building services,
operation and maintenance, supplemented by outsourced contractors.
In the hospitality industry the maintenance of the engineering systems is
important despite its complex processes as its effectiveness will directly affect the quality
of hotel service, food, and beverage which have direct and significant effect on guest’s
impression of the hotel. Here is the data of various inspection done on Hotels in various
parts o USA. The data deals with the violations made by the hotel managements and their
violation codes. Data also explains the action taken by the government according to the
violation codes on the hotel.
We analyze the data based on which parts they are violating the codes, so that the
new hotels will exclude these problems and survive in the market.
1.2 EXISTING SYSTEM:
These days for any organization, company or a business firm the most important
things for them is survive in the market and compete with the competitors. As to do so
the firm need to analyze their position in the market.

10
Analyzing the market needs the data which they have generated from long years. The
data from last year’s, which has been rapidly, multiply in numbers and creating a lot of problems
in terms of storing and analyzing the stored data.
These days we have a tedious improvement in storing technologies rather than in
analyzing techniques. We are having problems in analyzing the data stored in our Traditional
RDBMS ( Mysql, Db2..) and at same time the data size is also exceeding our storage
probabilities.
1.3 PROBLEM DEFINITION:
The following are the problems which we are facing with the existing systems.
1.3.1 Storing:
Since couple of years we can see how the data has rustically increased in its size and
creating lot of problems in storing them.
1.3.2 Processing:
Since the data is very huge we are not able to analyze the dataset in the fixed period of
time and so unable to get the results in an efficient way.
1.4 PROPOSED SYSTEM:
In our proposed system we are using new technologies for analyzing the datasets. The
framework we are using is Hadoop.

11
This is a framework which is capable of storing any tedious amount of data and can
processing the dataset in a less time and in a efficient way compared to other technologies.
1.5 FEATURES OF PROJECT:
1.5.1 Storing the Dataset:
We extract the dataset from an external source to our Hadoop cluster using Sqoop
ecosystem.
1.5.2 Processing the Dataset:
Once the dataset is extracted the data is analyzed using MapReduce and other ecosystem
which works well with the dataset.

12
2. LITERATURE SURVEY
2.1 BIGDATA:
Big data is an evolving term that describes any voluminous amount of structured, semi-
structured and unstructured data that has the potential to be mined for information. Although big
data doesn't refer to any specific quantity, the term is often used when speaking about petabytes
and Exabyte’s of data.
Big data is used to describe a massive volume of data that is so large that it's difficult to
process. The data is too big that exceeds current processing capacity.
Big data can be characterized by 3Vs: the extreme volume of data, the wide variety of
types of data and the velocity at which the data must be must processed.
Big Data 3V’s

13
An example of big data might be pentabytes (1,024 terabytes) or Exabyte’s (1,024
pentabytes) of data consisting of billions to trillions of records.
E.g. Web, sales, customer contact center, social media, and mobile data...
Data Measurements

14
2.2 APACHE HADOOP:
Hadoop-logo
Hadoop is an open-source software framework for storing and processing big data in a
distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two
tasks: massive data storage and faster processing.
Doug Cutting, Cloud era’s Chief Architect, helped create Apache Hadoop out of
necessity as data from the web exploded, and grew far beyond the ability of traditional systems
to handle it. Hadoop was initially inspired by papers published by Google outlining its approach
to handling an avalanche of data, and has since become the de facto standard for storing,
processing and analyzing hundreds of terabytes, and even pet bytes of data.
Why is Hadoop important?
Since its inception, Hadoop has become one of the most talked about technologies. Why?
One of the top reasons (and why it was invented) is its ability to handle huge amounts of data –
any kind of data – quickly. With volumes and varieties of data growing each day, especially from
social media and automated sensors, that’s a key consideration for most organizations. Other
reasons include:
Low cost: The open-source framework is free and uses commodity hardware to store large
quantities of data.
Computing power: Its distributed computing model can quickly process very large volumes
of data. The more computing nodes you use, the more processing power you have.

15
Scalability: You can easily grow your system simply by adding more nodes. Little
administration is required.
Storage flexibility: Unlike traditional relational databases, you don’t have to preprocess data
before storing it. And that includes unstructured data like text, images and videos. You can store
as much data as you want and decide how to use it later.
Inherent data protection and self-healing capabilities: Data and application processing are
protected against hardware failure. If a node goes down, jobs are automatically redirected to
other nodes to make sure the distributed computing does not fail. And it automatically stores
multiple copies of all data.
Components of Hadoop
HDFS: (Hadoop Distributed File System)
HDFS is a fault tolerant and self-healing distributed file system designed to turn a cluster
of industry standard servers into a massively scalable pool of storage. Developed specifically for
large-scale data processing workloads where scalability, flexibility and throughput are critical,
HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming,
and scales to proven deployments of 100PB and beyond.

16
HDFS Data Distribution
Data in HDFS is replicated across multiple nodes for compute performance and data protection.
MapReduce:
MapReduce is a massively scalable, parallel processing framework that works in tandem
with HDFS. With MapReduce and Hadoop, compute is executed at the location of the data,
rather than moving data to the compute location; data storage and computation coexist on the
same physical nodes in the cluster. MapReduce processes exceedingly large amounts of data
without being affected by traditional bottlenecks like network bandwidth by taking advantage of
this data proximity.
MapReduce Compute Distribution
MapReduce divides workloads up into multiple tasks that can be executed in parallel.

17
The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be Serializable by the framework and hence need to
implement the Writable interface. Additionally, the key classes have to implement the
WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
2.2.1 Vendors:
Hadoop vendors share the Hadoop architecture from Apache Hadoop.
EMC:
Pivotal HD, the Apache Hadoop distribution from EMC, natively integrates EMC’s
massively parallel processing (MPP) database technology (formerly known as Greenplum, and
now known as HAWQ) with Apache Hadoop. The result is a high-performance Hadoop
distribution with true SQL processing for Hadoop. SQL-based queries and other business
intelligence tools can be used to analyze data that is stored in HDFS.
Hortonworks: Another major player in the Hadoop market, Hortonworks has the largest
number of committers and code contributors for the Hadoop ecosystem components.
(Committers are the gatekeepers of Apache projects and have the power to approve code
changes.)
Hortonworks is a spin-off from Yahoo!, which was the original corporate driver of the
Hadoop project because it needed a large-scale platform to support its search engine business. Of

18
all the Hadoop distribution vendors, Hortonworks is the most committed to the open source
movement, based on the sheer volume of the development work it contributes to the community,
and because all its development efforts are (eventually) folded into the open source codebase.
The Hortonworks business model is based on its ability to leverage its popular HDP
distribution and provide paid services and support. However, it does not sell proprietary
software. Rather, the company enthusiastically supports the idea of working within the open
source community to develop solutions that address enterprise feature requirements (for
example, faster query processing with Hive).
Hortonworks has forged a number of relationships with established companies in the data
management industry: Teradata, Microsoft, Informatica, and SAS, for example. Though these
companies don’t have their own, in-house Hadoop offerings, they collaborate with Hortonworks
to provide integrated Hadoop solutions with their own product sets.
The Hortonworks Hadoop offering is the Hortonworks Data Platform (HDP), which
includes Hadoop as well as related tooling and projects. Also unlike Cloudera, Hortonworks
releases only HDP versions with production-level code from the open source community.
IBM:
Big Blue offers a range of Hadoop offerings, with the focus around value added on top of
the open source Hadoop stack.

19
Intel:
The Intel Distribution for Apache Hadoop (Intel Distribution) provides distributed
processing and data management for enterprise applications that analyze big data.
MapR:
For a complete distribution for Apache Hadoop and related projects that’s independent of
the Apache Software Foundation, look no further than MapR. Boasting no Java dependencies or
reliance on the Linux file system, MapR is being promoted as the only Hadoop distribution that
provides full data protection, no single points of failure, and significant ease-of-use advantages.
Three MapR editions are available: M3, M5, and M7. The M3 Edition is free and
available for unlimited production use; MapR M5 is an intermediate-level subscription software
offering; and MapR M7 is a complete distribution for Apache Hadoop and HBase that includes
Pig, Hive, Sqoop, and much more.
Cloudera:
Perhaps the best-known player in the field, Cloudera is able to claim Doug Cutting,
Hadoop’s co-founder, as its chief architect. Cloudera is seen by many people as the market
leader in the Hadoop space because it released the first commercial Hadoop distribution and it is
a highly active contributor of code to the Hadoop ecosystem.

20
Performance and scalability

21
2.2.2 Cloudera:
Perhaps the best-known player in the field, Cloudera is able to claim Doug Cutting,
Hadoop’s co-founder, as its chief architect. Cloudera is seen by many people as the market
leader in the Hadoop space because it released the first commercial Hadoop distribution and it is
a highly active contributor of code to the Hadoop ecosystem.
Cloudera Enterprise, a product positioned by Cloudera at the center of what it calls the
“Enterprise Data Hub,” includes the Cloudera Distribution for Hadoop (CDH), an open-source-
based distribution of Hadoop and its related projects as well as its proprietary Cloudera Manager.
Also included is a technical support subscription for the core components of CDH.
Cloudera’s primary business model has long been based on its ability to leverage its
popular CDH distribution and provide paid services and support. In the fall of 2013, Cloudera
formally announced that it is focusing on adding proprietary value-added components on top of
open source Hadoop to act as a differentiator.
Also, Cloudera has made it a common practice to accelerate the adoption of alpha- and
beta-level open source code for the newer Hadoop releases. Its approach is to take components it
deems to be mature and retrofit them into the existing production-ready open source libraries that
are included in its distribution.
2.2.3 Hadoop Ecosystems:
The Hadoop platform consists of two key services: a reliable, distributed file system
called Hadoop Distributed File System (HDFS) and the high-performance parallel data
processing engine called Hadoop MapReduce, described in MapReduce below.

22
The combination of HDFS and MapReduce provides a software framework for
processing vast amounts of data in parallel on large clusters of commodity hardware (potentially
scaling to thousands of nodes) in a reliable, fault-tolerant manner. Hadoop is a generic
processing framework designed to execute queries and other batch read operations against
massive datasets that can scale from tens of terabytes to pentabytes in size.
The popularity of Hadoop has grown in the last few years, because it meets the needs of
many organizations for flexible data analysis capabilities with an unmatched price-performance
curve. The flexible data analysis features apply to data in a variety of formats, from unstructured
data, such as raw text, to semi-structured data, such as logs, to structured data with a fixed
schema.
Hadoop has been particularly useful in environments where massive server farms are
used to collect data from a variety of sources. Hadoop is able to process parallel queries as big,
background batch jobs on the same server farm. This saves the user from having to acquire
additional hardware for a traditional database system to process the data (assume such a system
can scale to the required size). Hadoop also reduces the effort and time required to load data into
another system; you can process it directly within Hadoop. This overhead becomes impractical in
very large data sets.
Many of the ideas behind the open source Hadoop project originated from the Internet
search community, most notably Google and Yahoo!. Search engines employ massive farms of
inexpensive servers that crawl the Internet retrieving Web pages into local clusters where they
are analyzes with massive, parallel queries to build search indices and other useful data
structures.

23
The Hadoop ecosystem includes other tools to address particular needs. Hive is a SQL
dialect and Pig is a dataflow language for that hide the tedium of creating MapReduce jobs
behind higher-level abstractions more appropriate for user goals. Zookeeper is used for
federating services and Oozie is a scheduling system. Avro, Thrift and Protobuf are platform-
portable data serialization and description formats.
MapReduce:
MapReduce is now the most widely-used, general-purpose computing model and runtime
system for distributed data analytics. It provides a flexible and scalable foundation for analytics,
from traditional reporting to leading-edge machine learning algorithms. In the MapReduce
model, a compute “job” is decomposed into smaller “tasks” (which correspond to separate Java
Virtual Machine (JVM) processes in the Hadoop implementation). The tasks are distributed
around the cluster to parallelize and balance the load as much as possible. The MapReduce
runtime infrastructure coordinates the tasks, re-running any that fail or appear to hang. Users of
MapReduce don’t need to implement parallelism or reliability features themselves. Instead, they
focus on the data problem they are trying to solve.
Pig:
Pig is a platform for constructing data flows for extract, transform, and load (ETL)
processing and analysis of large datasets. Pig Latin, the programming language for Pig provides
common data manipulation operations, such as grouping, joining, and filtering. Pig generates
Hadoop MapReduce jobs to perform the data flows. This high-level language for ad hoc analysis
allows developers to inspect HDFS stored data without the need to learn the complexities of the
MapReduce framework, thus simplifying the access to the data.
The Pig Latin scripting language is not only a higher-level data flow language but also
has operators similar to SQL (e.g., FILTER and JOIN) that are translated into a series of map and

24
reduce functions. Pig Latin, in essence, is designed to fill the gap between the declarative style of
SQL and the low-level procedural style of MapReduce.
Hive :
Hive is a SQL-based data warehouse system for Hadoop that facilitates data
summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop-compatible
file systems (e.g., HDFS, MapR-FS, and S3) and some NoSQL databases. Hive is not a relational
database, but a query engine that supports the parts of SQL specific to querying data, with some
additional support for writing new tables or files, but not updating individual records. That is,
Hive jobs are optimized for scalability, i.e., computing over all rows, but not latency, i.e., when
you just want a few rows returned and you want the results returned quickly. Hive’s SQL dialect
is called HiveQL. Table schema can be defined that reflect the data in the underlying files or data
stores and SQL queries can be written against that data. Queries are translated to MapReduce
jobs to exploit the scalability of MapReduce. Hive also support custom extensions written in
Java, including user-defined functions (UDFs) and serializer-deserializers for reading and
optionally writing custom formats, e.g., JSON and XML dialects. Hence, analysts have
tremendous flexibility in working with data from many sources and in many different formats,
with minimal need for complex ETL processes to transform data into more restrictive formats.
Contrast with Shark and Impala.

25
Apache Hadoop Ecosystem
2.3 LINUX UBUNTU:
Ubuntu-logo
Ubuntu is an ancient African word meaning 'humanity to others'. It also means 'I am what
I am because of who we all are'. The Ubuntu operating system brings the spirit of Ubuntu to the
world of computers.
Linux was already established as an enterprise server platform in 2004, but free software
was not a part of everyday life for most computer users. That's why Mark Shuttleworth gathered
a small team of developers from one of the most established Linux projects – Debian – and set
out to create an easy-to-use Linux desktop: Ubuntu.

26
The vision for Ubuntu is part social and part economic: free software, available to
everybody on the same terms, and funded through a portfolio of services provided by Canonical.
The first official Ubuntu release -- Version 4.10, codenamed the 'Warty Warthog' — was
launched in October 2004, and sparked dramatic global interest as thousands of free software
enthusiasts and experts joined the Ubuntu community.
The governance of Ubuntu is somewhat independent of Canonical, with volunteer leaders
from around the world taking responsibility for many critical elements of the project. It remains a
key tenet of the Ubuntu Project that Ubuntu is a shared work between Canonical, other
companies, and the thousands of volunteers who bring their expertise to bear on making it a
world-class platform for anyone to use.
Ubuntu today has eight flavours and dozens of localised and specialised derivatives.
There are also special editions for servers, OpenStack clouds, and mobile devices. All editions
share common infrastructure and software, making Ubuntu a unique single platform that scales
from consumer electronics to the desktop and up into the cloud for enterprise computing.
The Ubuntu OS and the innovative Ubuntu for Android convergence solution make it an
exciting time for Ubuntu on mobile devices. In the cloud, Ubuntu is the reference operating
system for the OpenStack project, it’s a hugely popular guest OS on Amazon's EC2 and
Rackspace's Cloud, and it’s pre-installed on computers from Dell, HP, Asus, Lenovo and other
global vendors. And thanks to that shared infrastructure, developers can work on the desktop,
and smoothly deliver code to cloud servers running the stripped-down Ubuntu Server Edition.

27
After many years Ubuntu still is and always will be free to use, share and develop. We
hope it will bring a touch of light to your computing — and we hope that you'll join us in helping
to build the next version.
2.4 MySQL:
Mysql-logo
MySQL is the world's most popular open source database software, with over 100 million
copies of its software downloaded or distributed throughout it's history. With its superior speed,
reliability, and ease of use, MySQL has become the preferred choice for Web, Web 2.0, SaaS,
ISV, Telecom companies and forward-thinking corporate IT Managers because it eliminates the
major problems associated with downtime, maintenance and administration for modern, online
applications.
Many of the world's largest and fastest-growing organizations use MySQL to save time
and money powering their high-volume Web sites, critical business systems, and packaged
software — including industry leaders such as Yahoo!, Alcatel-Lucent, Google, Nokia,
YouTube, Wikipedia, and Booking.com.
The flagship MySQL offering is MySQL Enterprise, a comprehensive set of production-
tested software, proactive monitoring tools, and premium support services available in an
affordable annual subscription.

28
MySQL is a key part of LAMP (Linux, Apache, MySQL, PHP / Perl / Python), the fast-
growing open source enterprise software stack. More and more companies are using LAMP as an
alternative to expensive proprietary software stacks because of its lower cost and freedom from
platform lock-in.

29
3. SYSTEM REQUIREMENTS
The purpose of this SRS document is to identify the requirements and functionalities for
Intelligent Network Backup Tool . The SRS will define how our team and the client conceive the
final product and the characteristics or functionality it must have. This document also makes a
note of the optional requirements which we plan to implement but are not mandatory for the
functioning of the project.
This phase appraises the needed requirements for the Hotel Inspection dataset for a
systematic way of evaluating the requirements several processes are involved. The first step
involved in analyzing the requirements of the system is recognizing the nature of system for a
reliable investigation and all the case are formulated to better understand the analysis of the
dataset.
Document Conventions:
The convention used in the size of fonts remains the same as for other documents in the
project. The section headings have the largest font of 14, subheadings have a font size of
12(bold), and the text is on font 12. The priorities of the requirements are specified with the
requirement statements.
Intended Audience and Reading Suggestions:
This document is intended for project developers, managers, users, testers and
documentation writers. This document aims at discussing design and implementation constraints,
dependencies, system features, external interface requirements and other non functional
requirements.

30
3.1 IDENTIFICATION OF NEEDS:
The foremost and important necessity for a business firm or an organization is to know
how they are performing in the market and parallel they need to know how to overcome their
competitors in the market.
To do so we need to analysis our data based on all the available factors. The system
requirements for the project to be accomplished are:
3.2 ENVIRONMENTAL REQUIREMENTS:
3.2.1 Software Requirements:
Development & Usage:
Linux Operating System.
Apache Hadoop.
Mozilla Firefox: (or any browser).
Microsoft Excel or Open office.
3.2.2 Hardware Requirements:
Development & Usage:
Pentium 4 processor.
40GB Hard disc.
256 MB RAM. / 4 GB RAM
System with all standard accessories like monitor, keyboard, mouse, etc.,

31
4. BUSINESS LOGIC
Logic Features:
1. Store:
The main intention of the Hotel Inspection Dataset is to analysis the data based
on the violations made by all inspected restaurants and hotel. To handle the
things we first load the data to our Hadoop HDFS Component.
2. Analysis:
This is the other major step for the dataset, this module is done based the type
of dataset we have, any ways our Hotel Inspection Dataset is a structure data.
So we work with Hadoop Ecosystem HIVE.
4.1 SYSTEM ANALYSIS:
4.1.1 FUNCTIONAL REQUIREMENTS:
4.1.1.1 Technical Feasibility:
Evaluating the technical feasibility is the trickiest part of a feasibility study. This is
because, at this point in time, not too many detailed design of the system, making it difficult to
access issues like performance, costs on (on account of the kind of technology to be deployed)
etc.
A number of issues have to be considered while doing a technical analysis. Understand
the different technologies involved in the proposed system.
Before commencing the project, we have to be very clear about what are the technologies
that are to be required for the development of the new system.

32
Find out whether the organization currently possesses the required technologies. Is the
required technology available with the organization?
If so is the capacity sufficient?
For instance –“Will the current printer be able to handle the new reports and forms required for
the new system?”
4.1.1.2 Operational Feasibility
Proposed projects are beneficial only if they can be turned into information systems that
will meet the organizations operating requirements. Simply stated, this test of feasibility asks if
the system will work when it is developed and installed. Are there major barriers to
Implementation? Here are questions that will help test the operational feasibility of a project.
• Is there sufficient support for the project from management from users? If the current
system is well liked and used to the extent that persons will not be able to see reasons for change,
there may be resistance.
• Are the current business methods acceptable to the user? If they are not, Users may
welcome a change that will bring about a more operational and useful systems.
• Have the user been involved in the planning and development of the project? Early
involvement reduces the chances of resistance to the system and in General and increases the
likelihood of successful project.

33
Since the proposed system was to help reduce the hardships encountered in the existing
manual system, the new system was considered to be operational feasible.
4.2 SYSTEM DESIGN:
4.2.1 Business Flow:
4.2.1.1 Apache Hadoop Working Model-I:
Apache hadoop working Model-I
Create Secure Shell Connection
(SSH) from Local host to Linux
(Ubuntu) Kernel – ssh localhost
Start all Demons Name node,
Secondary Name node, Data
node, Job tracker,Task Tracker –
start-all.sh
Check weather all demons are up
Jps
Create a directory and move the
dataset to the HDFS using
terminal Linux.
Check the data format from the
browser view. From the data
point of view chose the
Ecosystem to work
Based on the Ecosystem, design
the platform and execute the
jobs.
Once the jobs are executed.
Generate the Reports based on
the dataset
Analyze the Reports for the
improvement of the firm

34
4.2.1.2 Apache Hadoop Working Model-II:
Apache Hadoop Working Model-II
Install a Virtual
Machine..VMware
Open a virtual machine which is
already created from the
Cloudera
Start the centousfromthe
virtual Machine. Work withthe
terminal.
Create a directory and move the
dataset to the HDFS using
terminal Linux.
Check the data format from the
browser view. From the data
point of view chose the
Ecosystem to work
Based on the Ecosystem, design
the platform and execute the
jobs.
Once the jobs are executed.
Generate the Reports based on
the dataset
Analyze the Reports for the
improvement of the firm

35
4.2.2 Business Logic:
Functional Programming:
Multithreading is one of the popular way of doing parallel programming, but major
complexity of multi-thread programming is to co-ordinate the access of each thread to the shared
data. We need things like semaphores, locks, and also use them with great care, otherwise dead
locks will result.
User defined Map/Reduce functions:
Map/reduce is a special form of such a DAG which is applicable in a wide range of use
cases. It is organized as a “map” function which transform a piece of data into some number of
key/value pairs. Each of these elements will then be sorted by their key and reach to the same
node, where a “reduce” function is use to merge the values (of the same key) into a single result.
Mapper:
map(input_record) {
...
emit(k1, v1)
...
emit(k2, v2)
...
}

36
Reducer:
reduce (key, values) {
aggregate = initialize()
while (values.has_next) {
aggregate = merge(values.next)
}
collect(key, aggregate)
}
MapReduce logic

37
Job execution phase in Hadoop

38
5. PROJECT MODULES
5.1 MODULES INTRODUCTION:
The dataset holds the Hotel Inspection data from last years. We have taken the dataset
from a reference website https://data.ny.gov/. The size of the dataset is very huge with the data
around three lacks of lines. We had taken a part of it as that our basic systems can’t be able to
support that much huge amount of dataset, this needs a well classified configuration to work on.
To deal with the project we have taken the dataset with around twenty five thousand lines.
We have analyzed the raw dataset, by eliminating the unnecessary fields from the data
and given the dataset a well organized format.
The dataset is dividing into two tables based on the data and their fields. The first table
deals with the inspection data with contains the parameters like id, name of restaurant, area,
address, location, inspected data, violated code, critical point of violation, type of inspection. The
second table deals with the violation code and the violation property.
5.2 MODULES:
5.2.1 Analyzing the Data and filtering the Data:
The first step of the project we need to analyze the data, we should check the data how it
has been formatted. We should be aware of the fields that has given to us and need to know the
importance of each and every field, if we think that there are some unnecessary information
which is disturbing our dataset, we need to talk to our client before taking any step in changing
the dataset or removing or moving any columns from the dataset.

39
Raw-dataset
unnecessery data fields

40
The unnecessary fiedls have removed from the raw dataset. The fields have been removed from
the dataset and the dataset has been divided into two separate Tables.
Table1 (violation) – dataset with violation code and its explanation.
Table2 (Hotel) – dataset with voilation code and remaing fields from the filtered dataset.
Now the final dataset would be refered as filtered dataset.
final-dataset
5.2.2 Identifying the headers (Schema):
The schema is generated based on the dataset and the data we are having. This schema is
for the table hotel.

41
Schema for Hotel:
Name of Header Description headername in schema
ID - id (Primary Key) - id
CAMIS - Refers to the Store ID's - camis
DBA - Refers to the Restaurant - dba
BORO - Place - boro
BUILDING - Building Number - building
STREET - Street Address - street
ZIPCODE - Area zipcode - zipcode
PHONE - Store phone - phone
CUISINE DESCRIPTION - Type of Cusine - cuisine_description
INSPECTION DATE - Inspected on Date - inspection_date
ACTION - Type of Action - action
VIOLATION CODE - Voilaton Codes - violation_code
CRITICAL FLAG - Serious of Voilations - critical_flag
SCORE - Rating - score
GRADE - Grade - grade
GRADE DATE - Grade Date - grade_date
RECORD DATE - Record Date - record_date
INSPECTION TYPE - Type of Inspection - inspection_type

42
The scheme is for the table Violation. The schema is violation of code and the voilation
description.
Schema for table Violation:
ID - id (Primary Key) - id
VIOLATION CODE - Refers to Violation code - violation_code
VIOLATION DESCRITION - Refers description of code - v_desc
Table for Hotel:
ID id (Primary Key) Id
CAMIS Refers to the Store ID's Camis
DBA Refers to the Restaurant Dba
BORO Place Boro
BUILDING Building Number Building
STREET Street Address Street
ZIPCODE Area zipcode Zipcode
PHONE Store phone Phone
CUISINE DESCRIPTION Type of Cusine cuisine_description
INSPECTION DATE Inspected on Date inspection_date
ACTION Type of Action Action
VIOLATION CODE Violation Codes Violation_code

43
CRITICAL FLAG Serious of Violations critical_flag
SCORE Rating Score
GRADE Grade Grade
GRADE DATE Grade Date grade_date
RECORD DATE Record Date record_date
INSPECTION TYPE Type of Inspection inspection_type
Table for Violation:
Name of Header Description Headername in schema
ID id (Primary Key) Id
VIOLATION CODE Refers to Violation code violation_code
VIOLATION DESCRITION Refers description of code v_desc
Violation Table Schema diagram:
Violation Table Schema diagram

44
Hotel Table Schema diagram:
Hotel Table Schema diagram
5.2.3 Installing Single Node Hadoop Cluster:
Java Development Kit 1.7:
Download the Java Development Kit 1.7 from the official website of Oracle services.
Once the JDK1.7 is downloaded extract the file from downloads and create a directory named
JAVA in the root directory. The path of the root directory “ /usr/lib/java”

45
Once the java folder is created with sudo (admistrator) permissions, then move the
downloaded jdk to the /usr/lib/java/ so the jdk1.7 would be in the /usr/lib/java/jdk1.7. The java
path would be now “/usr/lib/java/jdk1.7”.
Jdk installation path
Once this part is done, now we need to set the runable file to configure with the java and kernel,
to do so run the below mentioned scripts.
sudo update-alternatives --install "/usr/bin/java"
"java" usr/lib/java/jdk1.7.0_67/bin/java" 1
sudo update-alternatives --install "/usr/bin/javac" "javac"
"/usr/lib/java/jdk1.7.0_67/bin/javac" 1
sudo update-alternatives --install "/usr/bin/javaws" "javaws"
usr/lib/java/jdk1.7.0_67/bin/javaws"

46
To check the java installation is completed, run the command “ java –version”
java path
Now edit the bashrc file in Linux (Ubuntu), to do so run the command
sudo gedit ~/.bashrc
and add the following lines to the file:
export JAVA_HOME="/usr/lib/java/jdk1.7.0_67"
set PATH="$PATH:$JAVA_HOME/bin"
alias jps="/usr/lib/java/jdk1.7.0_67/bin/jps"

47
Install Hadoop1.2.0:
Download hadoop1.2 version from the source website of Apache Hadoop:
Create a file named Hadoop in /usr /lib/ path, once the file is created extract the downloaded
Hadoop file and move it to “/usr/lib/Hadoop” path with the sudo permissions.
Hadoop location
Configure the Hadoop location with bashrc file, sudo gedit ~/.bashrc
Add the lines to the file :
export HADOOP_HOME="/usr/lib/hadoop/hadoop-1.2.1"
PATH=$PATH:$HADOOP_HOME/bin

48
Hadoop installation cross checking
Install Hive:
Now download the hive 0.12.0 file from the target source of Apache hive:
Create a directory with hive in “/usr/lib” directory and move the extracted file to “/usr/lib/hive/”
this path is the hive directory.
Open the bashrc file: sudo gedit ~/.bashrc
Configure the file with the script :
# Hive Home Directory Configuration
HIVE_HOME="/usr/lib/hive/hive-0.12.0"
export PATH=$PATH:$HIVE_HOME/bin

49
hive installation cross check
We need to configure the four important file in the Hadoop environment. Open the Hadoop
directory from the location “/usr/lib/Hadoop.hadoop1.2.1/conf”
Open the files hdfs-site.xml, mapred-site.xml, core-site.xml, Hadoop-env.sh. Add the following
lines to these file respectively:

50
Hadoop-env.sh file
core-site.xml

51
mapred-site.xml
hdfs-site.xml

52
Hadoop installation in Cloudera:
5.2.4 Moving the data to HDFS:
Once the data schema is ready, the Hadoop installation is done , now our next task is to
move the data from our localfilesytem to the Hadoop single node cluster i.e., to the HDFS a
Component of Hadoop where the data is stored in the form of file systems.
The Command we use is: Hadoop fs -mkdir Hotel
This Command is used for creating a directory for our project in HDFS. Here we are
creating a directory Hotel which is used to store our datasets
1) Hotel dataset
2) Violation code dataset
Creating directory

53
Hadoop fs -ls
This command is used to display all the directories in the HDFS, We need to cross check
as to know whether our directory Hotel has been created or not.
Listing the directory
Hadoop fs -copyFromLocal Src ... Dest
This command is used to move our file from localfilesytem to HDFS.
We are copying our file Hotel.csv to the Hotel directory of HDFS
Hadoop fs -copyFromLocal '/home/username/Desktop/hotel.csv' /user/username/hotel/
'/home/username/Desktop/hotel.csv' indicates the location of the file.

54
‘/user/username/hotel/’ indicates the location of HDFS.
Hotel – indicates the HDFS directory
Moving data to hdfs
from the images we can see the two files hotel.csv ad codes.txt had been moved to the hdfs
directory Hotel.
Hadoop fs -ls Hotel
This command is used to display all the files from our specified HDFS directories Hotel. We
need to cross check as to know whether our file has been created or not.
Hadoop fs -ls hotel

55
Checking the files in hdfs
This is clear that we have moved all our files to the HDFS – into the hotel directory.
5.2.5 Creating the tables in hive:
We are all set to create the tables for our dataset.
The query for creating the hotel table:
hive -e "create table 360_hotel ( camis string, dba string, boro string, building string, street
string, zipcode string, phone string, cuisine_description string, inspection_date string, action
string, violation_code string, critical_flag string, score string, grade string, grade_date string,
record_date string, inspection_type string)row format delimited fields terminated by ',' "

56
Table created successfully
To see the table:
hive -e “show tables”

57
Checking created table
5.2.6 Importing data from hdfs to hive warehouse:
To Load Data:
hive -e "load data inpath '/user/training/hotel/hotel.csv' overwrite into table 360_hotel"
Data loaded to hive warehouse and table.

58
Hotel Table Description: hive -e “desc 360_hotel”
Table Description
Checking the tables:
hive -e “select *from hotel limit 3”
verifying the data

59
5.2.7 Analyzing the data based on the queries from the client:
- Frequent violated code.
- How many stores/restaurants have been inspected and location wise.
- Number of violations made by each restaurant.
- How many areas have been covered in the inspection?
- Types of cuisines inspected.
- More inspections were done on descending order.
- ascending order more violation codes.
- No violation cited on from restaurants.
- Critical and noncritical violations.
- Critical Violation and non critical violation codes.
Frequent violated code:
hive -e "SELECT violation_code, COUNT(violation_code) FROM hotel GROUP BY
violation_code HAVING ( COUNT(violation_code) > 1 )limit 5 "
Job execution

60
The above displays the result to the screen, but we need the result set to be reported to an
excel sheet to generate the reports.
To do so we need to store the result set in table or we can store the result in HDFS, then
we can move the result data from HDFS to our localfilesystem, from there the dataset is exported
to excel files to generate reports.
This result is stored in HDFS in the form of output.ods or output.xls
hive -e "insert overwrite directory '/user/training/Desktop/output.csv' SELECT
violation_code, COUNT(violation_code) FROM hotel GROUP BY violation_code
HAVING ( COUNT(violation_code) > 1 )"
The result set has been stored in the HDFS with the file name output.ods and the path to access it
is '/user/username/output.csv'
To export the file from HDFS to Localfilesystem
hadoop fs -copyToLocal '/user/training/output.csv' /home/training/Desktop/
Query executed and the data is loaded to the hdfs directory

61
The result set has been stored in HDFS. Now we need to move the result set to the
Local file system.
The result set has been moved to the home directory. '/home/training/'
Stored output

62
This the result files in CSV format. We need to export this dataset to excel to make the report in
an efficient way.
The output generated by the query.
5.2.8 Generating the Reports:
This Module, Here we deal with all the generated reports. We can use any data reporting
tools or else we can go with excel.
Report generated from the query

63
6. EXECUTIONS OF JOBS
6.1 METHODS OF EXECUTION:
We can execute the jobs in hive in three different ways:
6.1.1 Executing the job from the hive prompt:
The job is written directly in the hive prompt:
hive prompt

64
6.1.2 Executing the job from terminal with Hadoop:
The job is executed here with the help of Hadoop terminal, there will be no contact with the hive
prompt in the job execution:
Query using hive -e
6.1.3 Executing the job as a script:
The job is executed as script here, once the script has been written, the script is placed in the
home directory of the Linux environment

65
query from script
script home directory

66
query written in script
6.2 EXECUTION OF HIVEQL JOBS:
How many stores/restaurants have been inspected and location wise.
hive e "insert overwrite directory '/user/training/output2-1.csv' select count(dba) from hotel
where boro='BRONX'"
hadoop fs -copyToLocal /user/training/output2-1.csv' /home/training/Desktop
hive e "insert overwrite directory '/user/training/output2-2.csv' select count(dba) from hotel
where boro='BROOKLYN'"
hadoop fs -copyToLocal /user/training/output2-2.csv' /home/training/Desktop
hive e "insert overwrite directory '/user/ training/output2-3.csv' select count(dba) from hotel
where boro='MANHATTAN'"
hadoop fs -copyToLocal /user/ training /output2-3.csv' /home/ training/Desktop

67
hive e "insert overwrite directory '/user/ training /output2-4.csv' select count(dba) from hotel
where boro='QUEENS'"
hadoop fs -copyToLocal /user/training/output2-4.csv' /home/ training /Desktop
hive e "insert overwrite directory '/user/ training /output2-5.csv' select count(dba) from hotel
where boro='STATEN ISLAND'"
hadoop fs -copyToLocal /user/ training /output2-5.csv' /home/ training /Desktop
img:
Number of violations made by each restaurant:
hive -e "insert overwrite directory '/user/ training /output2.csv select distinct(dba) from hotel"
hive -e "insert overwrite directory '/user/ training /output3.csv' select count(violation_code) from
hotel where dba = 'MORRIS PARK BAKE SHOP'"
hive -e "insert overwrite directory '/user/ training /output3-1.csv' select count(violation_code)
from hotel where dba = 'WENDY'"
from hotel where dba = 'DJ REYNOLDS PUB AND RESTAURANT'"
from hotel where dba = 'RIVIERA CATERER'"
from hotel where dba = 'TOV KOSHER KITCHEN'"

68
from hotel where dba = 'BRUNOS ON THE BOULEVARD'"
from hotel where dba = 'KOSHER ISLAND'"
from hotel where dba = 'WILKEN'S FINE FOOD'"
from hotel where dba = 'REGINA CATERERS'"
from hotel where dba = 'MAY MAY KITCHEN'"
from hotel where dba = 'NATHAN'S FAMOUS'"
from hotel where dba = 'SEUDA FOODS'"
from hotel where dba = 'CARVEL ICE CREAM'"
from hotel where dba = 'GLORIOUS FOOD'"

69
img
How many areas have been covered in the inspection:
hive -e "select distinct(boro) from hotel"
Types of cusines inspected:
"insert overwrite directory '/user/ training /output.csv' select distinct(cuisine_description) from
hotel"
"select count(cuisine_description) from hotel where cuisine_description='African'"
"select count(cuisine_description) from hotel where cuisine_description='American'"
"select count(cuisine_description) from hotel where cuisine_description='Armenian'"
"select count(cuisine_description) from hotel where cuisine_description='Bagels/Pretzels'"
"select count(cuisine_description) from hotel where cuisine_description='Bakery'"
"select count(cuisine_description) from hotel where cuisine_description='CafÃ©/Coffee/Tea'"
"select count(cuisine_description) from hotel where cuisine_description='Caribbean'"
"select count(cuisine_description) from hotel where cuisine_description='Chicken'"
"select count(cuisine_description) from hotel where cuisine_description='Chinese'"
"select count(cuisine_description) from hotel where cuisine_description='Continental'"
"select count(cuisine_description) from hotel where cuisine_description='Donuts'"
"select count(cuisine_description) from hotel where cuisine_description='German'"
"select count(cuisine_description) from hotel where cuisine_description='Greek'"
"select count(cuisine_description) from hotel where cuisine_description='Hamburgers'"
"select count(cuisine_description) from hotel where cuisine_description='Hotdogs'"
"select count(cuisine_description) from hotel where cuisine_description='Indian'"
"select count(cuisine_description) from hotel where cuisine_description='Japanese'"

70
Critical Violation and non critical violation codes:
hive -e "insert overwrite directory '/user/training/critical.csv' select violation_code from hotel
where critical_flag = 'Critical'"
hive -e "insert overwrite directory '/user/training/not-critical.csv' select violation_code from hotel
where critical_flag = 'Not Critical' "

71
7. TESTING
7.1 INTRODUCTION:
Software testing is a critical element of software quality assurance and represents the
ultimate review of specification, design and coding. The increasing visibility of software as a
system element and attendant costs associated with a software failure are motivating factors for
we planned, through testing. Testing is the process of executing a program with the intent of
finding an error. The design of tests for software and other engineered products can be as
challenging as the initial design of the product itself.
7.2 SAMPLE UNIT TESTING:
Unit testing is done when the data is loaded into hdfs. Once data is loaded we need to cross
the data by seeing it in the browser. Now the sample data from the browser say about once chunk
of the file:
Copy the data to the text file, load the sample data to the hdfs and work on the data, write
the jobs on the sample data execute the jobs store the results. Then if the job is successfully
executed on the sample data then execute the job on the main dataset with the same parameters.

72
8. SCREENS
Violation codes
violation made

73
Inspections made area wise
Violations counts from each restaurant

74
Types of Cosines inspected
More inspections in cosines

75
Critical and non critical issues

76
9. CONCLUSIONS
Hadoop is trending technology in the market. Hadoop solves the big data problem more
effectively and efficiently. More importantly Hadoop can analyze any kind of data. Analyzing
the data based on Hadoop requires very less amount of time, and it reduces the production time
which directly affects the economy of the organization.
Analyzing the dataset based on the Apache Hadoop will overcome all the issues caused by the
traditional RDBMS and Master slave Architecture of Servers.
In our project we are trying to analyze Hotel Inspection dataset using Hadoop.
This analysis makes to analyze total number of hotels, their violations and their descriptions.

77
10 REFERENCES
Hadoop:
https://hadoop.apache.org/
Java:
http://www.oracle.com/technetwork/java/javase/downloads/
Hive:
https://hive.apache.org/
Linux:
http://www.ubuntu.com/

Hotel inspection data set analysis copy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Hotel inspection data set analysis copy

Similar to Hotel inspection data set analysis copy (20)

Recently uploaded

Recently uploaded (20)

Hotel inspection data set analysis copy