Big Data and Hadoop


Published on

Big Data is a problem emerging in big organizations where traditional database system are used for data storage and analysis.

Published in: Software, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big Data and Hadoop

  1. 1. Author:- Neeraj Negi What is Big Data? Big data is basically a term for large data-sets, these data sets are so complex and large in size that it becomes difficult to store, access or process in traditional database applications or tools. It exceeds the processing capacity of conventional database systems. Big data is too big (petabytes or exabytes), it moves too fast, or it doesn’t fit structures of database architectures. The data is typically loosely structured data that is often incomplete and inaccessible. Specifically, Big Data is related to data creation, retrieval, manipulation and analysis of data that is exceptional in terms of volume, velocity and variety: - 1. Volume – Facebook consumes more than 500 TB of data in one day. Google receives 2 million search queries per minute. 40 terabytes of data is generated every second from nuclear physics experiments at the Large Hadron Collider at CERN. This volume presents the most immediate challenge to traditional IT structures. It demands scalable storage, and a distributed approach to querying. 2. Velocity – It represents the frequency of data processing or data generation. Many MNCs and organizations have capturing click streams of data from websites (Google, Yahoo, Facebook, Microsoft, etc.), using that streaming data these corporations make purchase recommendations in form of ads to web visitors. Streaming data also have to make sense to analysis that goes with it, at the same time it also have to produce results and take actions – all in real time. 3. Variety – Big data is not just in form of strings or numbers. It also includes 3d data, audio, video, pictures, log files, GPS data, etc. Conventional databases were designed to address smaller volumes of structured data and predictable and
  2. 2. Author:- Neeraj Negi consistent data structures. With increasing number of users, traditional RDBMS has become liability for organizations, making it harder to serve their users. Every enterprise needs to understand Big data, and how it affects them. Standard tools and procedures are not designed to analyze and search massive datasets. Big Data requires exceptional technology to efficiently process large amount of data in acceptable amount of time lapse. Technologies like massively parallel processing databases, search-based applications, data mining grids, distributed file systems and databases, cloud based infrastructure are suitable.
  3. 3. Author:- Neeraj Negi Big Data Softwares:- 1. Hadoop - Apache Foundation. It is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer. The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X. 2. MongoDB - MongoDB, Inc. It is a document-oriented database system classified as NoSQL* database. MySQL is written using SQL queries, while MongoDB is focused on BSON (Binary JSON**). It is a handy tool for smaller database requirements. MongoDB supports complex operations like join, indexing much easily and efficiently as compared to traditional RDBMSs *A NoSQL or Not Only SQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. **JSON an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs.
  4. 4. Author:- Neeraj Negi 3. Splunk - Splunk Inc. Splunk is an advanced IT search tool that offers users, administrators, and developers the ability to instantly search all data generated by applications, servers, and network devices in the IT infrastructure. It generates reports, graphs, alerts and visualizations from the data which it captures and correlates in a repository. Splunk turns machine data into valuable insights no matter what business you're in.