Big Data and Hadoop

Author:-
Neeraj Negi
What is Big Data?
Big data is basically a term for large data-sets, these data sets are so complex and large
in size that it becomes difficult to store, access or process in traditional database
applications or tools. It exceeds the processing capacity of conventional database
systems. Big data is too big (petabytes or exabytes), it moves too fast, or it doesn’t fit
structures of database architectures. The data is typically loosely structured data that is
often incomplete and inaccessible.
Specifically, Big Data is related to data creation, retrieval, manipulation and analysis of
data that is exceptional in terms of volume, velocity and variety: -
1. Volume – Facebook consumes more than 500 TB of data in one day. Google
receives 2 million search queries per minute. 40 terabytes of data is generated
every second from nuclear physics experiments at the Large Hadron Collider at
CERN.
This volume presents the most immediate challenge to traditional IT structures. It
demands scalable storage, and a distributed approach to querying.
2. Velocity – It represents the frequency of data processing or data generation.
Many MNCs and organizations have capturing click streams of data from
websites (Google, Yahoo, Facebook, Microsoft, etc.), using that streaming data
these corporations make purchase recommendations in form of ads to web
visitors. Streaming data also have to make sense to analysis that goes with it, at
the same time it also have to produce results and take actions – all in real time.
3. Variety – Big data is not just in form of strings or numbers. It also includes 3d
data, audio, video, pictures, log files, GPS data, etc. Conventional databases were
designed to address smaller volumes of structured data and predictable and

Author:-
Neeraj Negi
consistent data structures. With increasing number of users, traditional RDBMS
has become liability for organizations, making it harder to serve their users.
Every enterprise needs to understand Big data, and how it affects them. Standard
tools and procedures are not designed to analyze and search massive datasets.
Big Data requires exceptional technology to efficiently process large amount of data in
acceptable amount of time lapse. Technologies like massively parallel processing
databases, search-based applications, data mining grids, distributed file systems and
databases, cloud based infrastructure are suitable.

Author:-
Neeraj Negi
Big Data Softwares:-
1. Hadoop - Apache Foundation.
It is an open source software project that enables the distributed processing of
large data sets across clusters of commodity servers. Hadoop makes it possible to
run applications on systems with thousands of nodes involving thousands of
terabytes. Rather than relying on high-end hardware, the resiliency of these clusters
comes from the software’s ability to detect and handle failures at the application
layer.
The Hadoop framework is used by major players including Google, Yahoo and IBM,
largely for applications involving search engines and advertising. The preferred
operating systems are Windows and Linux but Hadoop can also work with BSD and
OS X.
2. MongoDB - MongoDB, Inc.
It is a document-oriented database system classified as NoSQL* database. MySQL
is written using SQL queries, while MongoDB is focused on BSON (Binary
JSON**).
It is a handy tool for smaller database requirements. MongoDB supports complex
operations like join, indexing much easily and efficiently as compared to
traditional RDBMSs
*A NoSQL or Not Only SQL database provides a mechanism for storage and
retrieval of data that is modeled in means other than the tabular relations used in
relational databases.
**JSON an open standard format that uses human-readable text to transmit data
objects consisting of attribute–value pairs.

Author:-
Neeraj Negi
3. Splunk - Splunk Inc.
Splunk is an advanced IT search tool that offers users, administrators, and
developers the ability to instantly search all data generated by applications, servers,
and network devices in the IT infrastructure. It generates reports, graphs, alerts and
visualizations from the data which it captures and correlates in a repository. Splunk
turns machine data into valuable insights no matter what business you're in.

Big Data and Hadoop

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Recently uploaded

Recently uploaded (20)

Big Data and Hadoop