Big data

  3. Table of Contents 1. Traditional Approach 2. The Beginning 3. What is Big Data 4. Characteristic of Big Data 5. Why Big Data 6. Big Data Analytics 7. Big Players 8. Hadoop as an Example 9. Components of Hadoop 10.References
  4. The Beginning…  Big data burst upon the scene in the first decade of the 21st century.  The first organizations to embrace it were online and startup firms.  Firms like Google, eBay, LinkedIn and Facebook were built around big data from the beginning.  Big Data may well be the Next Big Thing in the IT world.  Like many new information technologies, big data can bring about dramatic cost reductions, substantial improvements in the time required to perform a computing task and other service offerings.
  5. Traditional Approach  In this approach, an enterprise used to have a computer to store and process big data.  Here data was stored in an RDBMS like Oracle Database, MS SQL Server or DB2 .  Sophisticated softwares were written to interact with the database, process the required data and present it to the users.  This approach works well where we have less volume of data that can be accommodated by standard database servers.
  6. What is Big Data  ‘Big Data’ is similar to ‘small data’, but bigger in size  Big Data refers to technologies and initiatives that involve data that is too diverse, fast-changing or massive for conventional technologies, skills and infra- structure to address efficiently.  Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques.
  7. Characteristics of Big Data(4 V’s)
  8. Volume  Big data implies enormous volumes of data.  Big Data requires processing high volumes of low-density data, that is, data of unknown value, such as twitter data feeds, clicks on a web page, network traffic, sensor-enabled equipment capturing data at the speed of light and many more.  Today, Facebook ingests 500 terabytes of new data every day.  A Boeing 737 will generate 240 terabytes of flight data during a single flight across the US.  Every 2 days we create as much data as we did from the beginning of time until 2003.
  9. Velocity  It refers to the speed at which new data is generated and the speed at which data moves around.  Big data technology now allows us to analyse the data while it is being generated without ever putting it into databases.  Machine to machine processes exchange data between billions of devices.  Infrastructure and sensors generate massive log data in real-time.  On-line gaming systems support millions of concurrent users, each producing multiple inputs per second.
  10. Variety  It refers to the many sources and types of data both structured and unstructured.  Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.  Now data comes in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. This variety of unstructured data creates problems for storage, mining and analyzing data.  The real world have data in many different formats and that is the challenge we need to overcome with the Big Data.
  11. Veracity  Veracity refers to the messiness or trustworthiness of the data.  With many forms of big data, quality and accuracy are less controllable, for example Twitter posts with hashtags, abbreviations, typos and colloquial speech.  Big data and analytics technology now allows us to work with these types of data. The volumes often make up for the lack of quality or accuracy.
  12. Sources of Big Data Today organizations are utilizing, sharing and storing more information in varying formats including:  E-mail and Instant Messaging  Social media channels  Video and audio files This unstructured data adds up to as much as 85% of the information that businesses store. The ability to extract high value from this data to enable innovation and competitive gain is the purpose of Big Data analytics.
  13. Big Data Analytics  Big data is really critical to our life and its emerging as one of the most important technologies in modern world.  Using the information kept in the social networking sites like Facebook, the marketing agencies are learning about the response for their campaigns, promotions and other advertising mediums.  Analyzing the data like preferences and product perception of their consumers, product companies and retail organizations are planning their production.  Using the data regarding the previous medical history of patients, hospitals are providing better and quick service.
  14. Big Players
  15. Hadoop  Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models.  It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.  Doug Cutting took the solution provided by Google and started an Open Source Project called HADOOP in 2005  Operates on unstructured and structured data.  A large and active ecosystem.  Open source under the Apache License.
  16. Hadoop Distributed File System  Data is organized into files and directories  Files are divided into blocks,distributed across nodes.  Blocks replicated to handle failure  Reliable,redundant,distributed file system optimized for large files
  17. MapReduce  The MapReduce framework consists of a single JobTracker and several TaskTrackers in a cluster.  The JobTracker is responsible for resource management, tracking resource consumption/availability and scheduling the job component tasks onto the data nodes.  The TaskTracker execute the tasks as directed by the JobTracker and provide task-status information periodically.
