Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

All About Big Data


Published on

  • Be the first to comment

  • Be the first to like this

All About Big Data

  1. 1. All About Big Data By Sai Venkatesh Attaluri Head – BD & Big Data Analytics Netxcell Limited
  2. 2.  Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions. (Wikipedia)  “Any fool can make things bigger, more complex, and more violent. It takes a touch of genius-and a lot of courage-to move in the opposite direction.” - Albert Einstein Big Data - Definition
  3. 3. Simplifying the Definition • Big data refers to data that is too big to fit on a single server, too unstructured to fit into a row-and-column database, or too continuously flowing to fit into a static data warehouse. - - Thomas H Davenport • Put another way, big data is the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management technologies
  4. 4. About Big Data  Every second of every day, businesses generate more data. Researchers at IDC estimate that by the end of 2013, the amount of stored data will exceed 4 zettabytes, or 4 billion terabytes.  All of that big data represents a big Opportunity for organizations.  Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.  In simplest terms, "Big Data" refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage extremely large data sets. This means terabytes, petabytes or even large collection of data such as zettabytes.
  5. 5. How Does Big Data Differ from Traditional Transactional Systems?
  6. 6. Traditional Transaction Systems Big Data TTS are designed and implemented to track information whose format is and use are known ahead of time Big Data Systems are deployed when the questions to be asked and the data for ats to be exa i ed are t known ahead of time. Data that resides within the fixed confines of a record or file is known as structured data. Data that comes from a variety of sources, such as emails, text documents, videos, photos, audio files, and social media posts, is referred to as unstructured data. Does t support unstructured data Structured Data even in large volumes can be entered, stored, queried, and analyzed in a simple and straightforward manner, this type of data is best served by a Traditional Transaction Database. TTS Vs Big Data
  7. 7. Traditional Transaction Systems Big Data Companies whose data workloads are constant and predictable will be better served by a traditional database. Companies challenged by increasing data demands will want to take advantage of Big Data s scalable infrastructure. Scalability allows servers to be added on demand to accommodate growing workloads. In cases where organizations rely on time- sensitive data analysis, a traditional database is the better fit. That s because shorter time-to -insight is t about analyzing large unstructured datasets. It s about analyzing smaller data sets in real or near-real time, which is what traditional databases are well equipped to do. Big Data is designed for large distributed data processing that addresses every file in the database. And that type of processing takes time. For tasks where fast performance is t critical, such as running end-of-day reports to review daily transactions, scanning historical data, and performing analytics where a slower time- to-insight is acceptable, Big Data is ideal. TTS Vs Big Data..Continued
  8. 8.  Unfortunately, extracting valuable information from big data is t as easy as it sounds. Big data amplifies any existing problems in your infrastructure, processes or even the data itself.  It is also misrepresented by the media making it difficult for organizations to determine investing in Big Data will bring expected results and make it possible to improve efficiency, bring out better products and services. Misconception of big data
  9. 9. The Promise of Big Data  Companies recognize that Big data contains valuable information such as  Obtain Actionable Insights  Product Performance,  Deepen Customer Relationships  Understanding Customer Behavior,  Prevent Threats & Fraud  Identify New Revenue Opportunities. 80-90% of data produced today is unstructured
  10. 10. 11 Evolution of big data
  11. 11. 12 Big Data Volume Variety Velocity Veracity The 4 V s
  12. 12.  To make the most of the information in their systems, companies must successfully deal with the 4 V s that distinguish big data: 1. Variety 2. Volume 3. Velocity and 4. Veracity.  The first three—variety, volume and velocity— define big data; when you have a large volume of data coming in from a wide variety of applications and formats and it s moving and changing at a rapid velocity, that s when you know you have big data. Definition of V s
  13. 13.  Volume – Big Data tools and services are designed to manage extremely large and growing sources of data that require capabilities beyond that found in traditional database engines. Ex: Extreme Large Volumes of Data  Variety – Big Data Tools manage an extensive variety of data as well. This means having the capability to manage structured data, very much like the capabilities offered by a database engine. They go beyond supporting structured data to working with both non-structured data, such as documents, spreadsheets, presentation decks and the like; and log data coming from operating systems, database engines, application framework, retail point of sale systems, mobile communications systems and more. Ex: Structured, Unstructured, images, documents, etc Definition of V s
  14. 14.  Velocity – Ability to gather, analyze and report on rapidly changing sets of data. In some cases, this means having the capability to manage data that changes so rapidly that the updated data cannot be saved to traditional disk drives before it is changed again. Simple Term: Quickly Moving Data  Veracity – Veracity is a measure of the accuracy and trustworthiness of your data. Veracity is a goal one that the variety, volume and velocity of big data make harder to achieve. Simple Term: Trust and integrity Definition of V s
  15. 15. • 2.5 quintillion bytes of data are generated every day! – A quintillion is 1018 • Data come from many quarters. – Social media sites – Sensors – Digital photos – Business transactions – Location-based data Lots of Data Style of Data Source of Data Industry Affected Function Affected Large Volume Online Financial Services Marketing Unstructured Video Health Care Supply Chain Continuous Flow Sensor Manufacturing Human Resources Multiple Formats Genomic Travel / Transport Finance
  16. 16. • Aspects of the way in which users want to interact with their data… – Totality: Users have an increased desire to process and analyze all available data – Exploration: Users apply analytic approaches where the schema is defined in response to the nature of the query – Frequency: Users have a desire to increase the rate of analysis in order to generate more accurate and timely business intelligence – Dependency: Users eed to balance investment in existing technologies and skills with the adoption of new techniques • So in a Nutshell, Big Data is about better analytics The Need of Big Data
  17. 17. Term Time Frame Specific Meaning Decision Support 1970-1985 Use of data analysis to support decision making Executive Support 1980-1990 Focus on Data Analysis for decisions by Senior Executives Online Analytical Processing (OLAP) 1990-2000 Software for analyzing multidimensional data tables Business Intelligence 1989-2005 Tools to support data driven decisions, with emphasis on reporting. Analytics 2006-2010 Focus on Statistical and Mathematical analysis for decisions Big Data 2010-Present & Next 10 Years Focus on very large, unstructured, fast-moving data Terminology For Using and Analyzing data
  18. 18.  Your company can take advantage of the opportunities available in big data only when you have processes and solutions that can handle all 4 V's.  Many of the previous attempts to address the need to gather information from the rapidly growing, rapidly changing and broad types of data have been based upon the use of special-purpose, complex and highly expensive computing systems. Today's Big Data Solutions are built upon a different foundation.  Rather than trying to use a very powerful, dedicated database system, cluster of inexpensive, powerful, industry standard (X86) systems are harnessed to attack these very small problems.  The clustered approach uses commodity systems, storage, and memory. It also adds the benefit of being more reliable. The failure of any single system in the cluster will not stop processing. Technology Shift
  19. 19. Gartner s Visualization on Big Data
  20. 20. • Problems: – Although there is a massive spike available data, the percentage of the data that an enterprise can understand is on the decline – The data that the enterprise is trying to understand is saturated with both useful signals and lots of noise Big Data – Conundrum
  21. 21. Benefits of Big Data
  22. 22. Big Data Platform Manifesto
  23. 23. High Level Architecture of Recognizer Big Data Platform on Hadoop FM API s to 3rd Party API s Enterprise Recommendation Engine OBD IVR DATA PCA Greybox Others Historical Data Business Intelligence Churn Prediction Predictive Analysis
  24. 24. Medium Level Architecture of Recognizer
  25. 25. What is Hadoop? • Hadoop is a free software framework that is developed by Apache Software Foundation to support distributed processing of data. Initially, Java™ language was used to develop Hadoop, but today many other languages are used for scripting Hadoop. Hadoop is used as the core platform to structure Big Data and helps in performing data analytics. • This Distributed processing framework designed to harness together the power of many computers, each having its own processing and storage, and provide the capability to quickly process large, distributed data sets.
  26. 26. Hadoop Distributed File System (HDFS) • Hadoop Distributed File System designed to support large data sets to made up of rapidly changing structured and non-structured data. MapReduce • MapReduce is a tool designed to allow analysts and developers to rapidly shift through massive amounts of data to examine only those data items that match a specified set of criteria. Introduction to Hadoop
  27. 27. Hadoop Components Hadoop Components Sqoop Flume ZooKeeper Oozie Pig Mahout R Connectors Hive Map Reduce HDFS Hbase MongoDB Cloudera Horton Works Kafka Yarn Cassandra VMware Player SQL NOSQL MetaStore Scala Query Compiler Hadoop Cluster Execution Engine Ambari Hadoop Architecture & Components
  28. 28. Apache Hadoop Architecture • Hadoop is a master and slave architecture that includes the NameNode as the master and the DataNode as the slave. Apache Sqoop • Apache Sqoop is a command-line tool for transferring data between relational databases and Hadoop. Sqoop, similar to other ETL tools, uses schema metadata to infer data types and ensure type-safe data handling when the data moves from the source to Hadoop. Apache HBase • Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS). HBase is designed to support high table-update rates and to scale out horizontally in distributed compute clusters. Its focus on scale enables it to support very large database tables Apache Zookeeper • Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data. Let Us See Hadoop Components
  29. 29. Apache Hive • Hive is an open-source data warehousing system used to analyze a large amount of dataset that is stored in Hadoop files. It has three key functions like summarization of data, query, and analysis. HDFS • The Hadoop Distributed File System (HDFS) is a distributed file system that shares some of the features of other distributed file systems. It is used for storing and retrieving unstructured data. MapReduce • The MapReduce is a core component of Hadoop, and is responsible for processing jobs in distributed mode. Pig • The Apache Pig is a platform which helps to analyze large datasets that includes high-level language required to express data analysis programs. Pig is one of the components of the Hadoop eco-system. Let Us See Hadoop Components – Contd..
  30. 30. NoSQL (Not Only SQL database) • NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloud. MongoDB • MongoDB database management system is designed for running modern applications that rely on structured and unstructured data and support rapidly changing data.. Apache Cassandra • Apache Cassandra is a free, open-source, distributed storage system for managing large amounts of structured data. It differs from traditional relational database management systems in some significant ways. Cassandra is designed to scale to a very large size across many commodity servers, with no single point of failure, and provides a simple schema- optional data model designed to allow maximum power and performance at scale. Apache Hadoop YARN (Yet Another Resource Negotiator) • Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology. YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation's open Let Us See Hadoop Components – Contd..
  31. 31. Oozie • Oozie is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. Oozie is implemented as a Java Web-Application that runs in a Java Servlet- Container. Apache Ambari • The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Flume • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Cloudera Impala • Cloudera Impala is a query engine that runs on Apache Hadoop. Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Let Us See Hadoop Components – Contd..
  32. 32. Apache Spark • Apache Spark is an open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers. Apache Spark can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores such as Apache Hive. Scala (Scalable Language) • Scala (Scalable Language) is a software programming language that mixes object-oriented methods with functional programming capabilities that support a more concise style of programming than other general-purpose languages like Java, reducing the amount of code developers have to write.. Apache Kafka • Apache Kafka is a distributed publish-subscribe messaging system designed to replace traditional message brokers. Originally created and developed by LinkedIn, then open sourced in 2011, Kafka is currently developed by the Apache Software Foundation to exploit new data infrastructures made possible by massively parallel commodity clusters Jaspersoft • Jaspersoft provides the most flexible, cost-effective, and widely-deployed business intelligence software in the world, enabling better decision making through highly interactive Web-based reports, dashboards, and analysis. Let Us See Hadoop Components – Contd..
  33. 33. Hadoop Cluster • A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. Distributed File System • A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server. Catastrophic Failure • Catastrophic failure is a complete, sudden, often unexpected breakdown in a machine, electronic system, computer or network. Such a breakdown may occur as a result of a hardware event such as a disk drive crash, memory chip failure or surge on the power line. Catastrophic failure can also be caused by software conflicts or malware. Sometimes a single component in a critical location fails, resulting in downtime for the entire system. Python • Python is an interpreted, object-oriented programming language similar to PERL, that has gained popularity because of its clear syntax and readability. Python is said to be relatively easy to learn and portable, meaning its statements can be interpreted in a number of operating systems. Let Us See Hadoop Components – Contd..
  34. 34. Hadoop Architecture & Components
  35. 35. The ‘ E viro e t • R is an integrated suite of software facilities for data analysis and graphics. Among other things it has • An effective data handling and storage facility, • A suite of operators for calculations on arrays, in particular matrices, • A large, coherent, integrated collection of intermediate tools for data analysis, • As a set of statistical methodologies and models. • As a graphical tool, facilitates data analysis and display either directly at the computer or on hardcopy, and • As a well developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities. An Introduction to ‘
  36. 36. Thank you