Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

  • Be the first to comment

  • Be the first to like this


  1. 1. 1. INTRODUCTION 1.1 About BigData: Big data is a buzz word, or catch-phrase, used to describe a massive volume of both structured and unstructured data that it is so large that it is difficult to process using traditional database and software techniques. In most enterprise scenarios the data is too big or moves too fast or it exceeds current processing capacity. While the term may seem to reference the volume of data, that isn’t always the case. The term, Big data, especially when used by vendors, may refer to the technology (which includes tools and processes) that an organization requires to maintain large amounts of data and storage facilities. The term Big Data is believed to have originated with the web search companies who had to query very large distributed aggregations of loosely structured data. Big data has become viable as cost-effective approaches have emerged to tame the volume, velocity and variability of massive data. Within this data lies the valuable patterns and information, previously hidden because of the amount of work required to extract them. To leading corporations, such as Walmart or Google, this power has been in reach for some time, but at fantastic cost. Today’s commodity hardware, cloud architectures and open source software bring big data processing into the reach of the less well-resourced. Big data processing is eminently feasible for even the small garage startups, who can cheaply rent server time in the cloud. 1.2 An Example of Big Data: An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is typically loosely structured data that is often incomplete and inaccessible. When dealing with larger datasets, organizations face difficulties in being able to create, manipulate, and manage big data. Big data is particularly a problem in business analytics because standard tools and procedures are not designed to search and analyze massive datasets. 1
  2. 2. 1.3 SIZE OF BIGDATA: The social networking sites, need to process data of huge size on a daily basis. An example of such big data processed on daily basis is given as follows: 2
  3. 3. 1.4 CHARACTERISTICS OF BIG DATA The characteristics of big data are as follows:  Volume. A typical PC might have had 10 gigabytes of storage in 2000. Today, Facebook ingests 500 terabytes of new data every day; a Boeing 737 will generate 240 terabytes of flight data during a single flight across the US; the proliferation of smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video  Velocity. Click streams and ad impressions capture user behavior at millions of events per second; high-frequency stock trading algorithms reflect market changes within microseconds; machine to machine processes exchange data between billions of devices; infrastructure and sensors generate massive log data in real-time; on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.  Variety. Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media. Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure. Traditional database systems are also designed to operate on a single server, making increased capacity expensive and finite. As applications have evolved to serve large volumes of users, and as application development practices have become agile, the traditional use of the relational database has become a liability for many companies rather than an enabling factor in their business. Big Data databases, such as Mongo DB, solve these problems and provide companies with the means to create tremendous business value. 3
  4. 4. 2. BIGDATA ANALYTICS Big data analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective marketing and increased revenue. 2.1 GOAL OF BIGDATA ANALYTICS: The primary goal of big data analytics is to help companies make better business decisions by enabling data scientists and other users to analyze huge volumes of transaction data as well as other data sources that may be left untapped by conventional business intelligence (BI)programs. These other data sources may include Web server logs, social media activity reports, mobilephone call detail records and information captured by sensors. Some people exclusively associate big data and big data analytics with unstructured of that sort, but consulting firms like Gartner Inc. and Forrester Research Inc. also consider transactions and other structured data to be valid forms of big data. 2.2 TECHNOLOGIES ASSOCIATED WITH BIGDATA: Big data analytics can be done with the software tools commonly used as part of advanced analytics disciplines such as predictive analytics and data mining. But the unstructured data sources used for big data analytics may not fit in traditional data warehouses. Furthermore, traditional data warehouses may not be able to handle the processing demands posed by big data. As a result, a new class of big data technology has emerged and is being used in many big data analytics environments. The technologies associated with big data analytics include NoSQL databases, Hadoop and Map Reduce. These technologies form the core of an open source software framework that supports the processing of large data sets across clustered systems. 4
  5. 5. 2.3 Challenges in Big Data Analysis Heterogeneity and Incompleteness: When humans consume information, a great deal of heterogeneity is comfortably tolerated. In fact, the nuance and richness of natural language can provide valuable depth. However, machine analysis algorithms expect homogeneous data, and cannot understand nuance. In consequence, data must be carefully structured as a first step in (or prior to) data analysis Even after data cleaning and error correction, some incompleteness and some errors in data are likely to remain. This incompleteness and these errors must be managed during data analysis. Doing this correctly is a challenge. Scale: Of course, the first thing anyone thinks of with Big Data is its size. After all, the word “big” is there in the very name. Managing large and rapidly increasing volumes of data has been a challenging issue for many decades. Timeliness: The larger the data set to be processed, the longer it will take to analyze. The design of a system that effectively deals with size is likely also to result in a system that can process a given size of data set faster. Privacy: The privacy of data is another huge concern, and one that increases in the context of Big Data. For electronic health records, there are strict laws governing what can and cannot be done. However, there is great public fear regarding the inappropriate use of personal data, particularly through linking of data from multiple sources. Managing privacy is effectively both a technical and a sociological problem, which must be addressed jointly from both perspectives to realize the promise of big data. 5
  6. 6. 2.4 USES OF BIGDATA ANALYTICS:  Enable data analysts to rapidly produce insights with no IT involvement  Connect to any data source of any data type using a simple, guided interface  Utilize 220+ built-in functions to quickly and easily analyze your data  Create scenarios such as market compensation comparison, global payroll cost analysis, retention risk and impact analysis, competitive benchmark, revenue pulse, and more 6
  7. 7. 3.BIG DATA TECHNOLOGY 3.1 Selecting a Big Data Technology: Operational vs. Analytical The Big Data landscape is dominated by two classes of technology: systems that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored; and systems that provide analytical capabilities for retrospective, complex analysis that may touch most or all of the data. These classes of technology are complementary and frequently deployed together. Operational and analytical workloads for Big Data present opposing requirements and systems have evolved to address their particular demands separately and in very different ways. Each has driven the creation of new technology architectures. Operational systems, such as the NoSQL databases, focus on servicing highly concurrent requests while exhibiting low latency for responses operating on highly selective access criteria. Analytical systems, on the other hand, tend to focus on high throughput; queries can be very complex and touch most if not all of the data in the system at any time. Both systems tend to operate over many servers operating in a cluster, managing tens or hundreds of terabytes of data across billions of records. OPERATIONAL BIG DATA For operational Big Data workloads, NoSQL Big Data systems such as document databases have emerged to address a broad set of applications, and other architectures, such as key-value stores, column family stores, and graph databases are optimized for more specific applications. NoSQL technologies, which were developed to address the shortcomings of relational databases in the modern computing environment, are faster and scale much more quickly and inexpensively than relational databases. Critically, NoSQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational Big Data workloads much easier to manage, and cheaper and faster to implement. 7
  8. 8. In addition to user interactions with data, most operational systems need to provide some degree of real-time intelligence about the active data in the system. For example in a multi-user game or financial application, aggregates for user activities or instrument performance are displayed to users to inform their next actions. Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and without the need for data scientists and additional infrastructure. ANALYTICAL BIG DATA Analytical Big Data workloads, on the other hand, tend to be addressed by MPP database systems and MapReduce. These technologies are also a reaction to the limitations of traditional relational databases and their lack of ability to scale beyond the resources of a single server. Furthermore, MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL. As applications gain traction and their users generate increasing volumes of data, there are a number of retrospective analytical workloads that provide real value to the business. Where these workloads involve algorithms that are more sophisticated than simple aggregation, MapReduce has emerged as the first choice for Big Data analytics. Some NoSQL systems provide native MapReduce functionality that allows for analytics to be performed on operational data in place. Alternately, data can be copied from NoSQL systems into analytical systems such as Hadoop for MapReduce. OVERVIEW OF OPERATIONAL VS. ANALYTICAL SYSTEMS Operational Analytical Latency 1 ms - 100 ms 1 min - 100 min Concurrency 1000 - 100,000 1 - 10 Access Pattern Writes and Reads Reads 8
  9. 9. Operational Analytical Queries Selective Unselective Data Scope Operational Retrospective End User Customer Data Scientist Technology NoSQL MapReduce, MPP Database Combining Operational and Analytical Technologies; Using Hadoop New technologies like NoSQL, MPP databases, and Hadoop have emerged to address Big Data challenges and to enable new types of products and services to be delivered by the business. One of the most common ways companies are leveraging the capabilities of both systems is by integrating a NoSQL database such as MongoDB with Hadoop. The connection is easily made by existing APIs and allows analysts and data scientists to perform complex, retroactive queries for Big Data analysis and insights while maintaining the efficiency and ease-of-use of a NoSQL database. NoSQL, MPP databases and Hadoop are complementary: NoSQL systems should be used to capture Big Data and provide operational intelligence to users, and MPP databases and Hadoop should be used to provide analytical insight for analysts and data scientists. Together, NoSQL, MPP databases and Hadoop enable businesses to capitalize on Big Data. 3.2 Considerations for Decision Makers While many Big Data technologies are mature enough to be used for mission-critical, production use cases, it is still nascent in some regards. Accordingly, the way forward is not always clear. As organizations develop Big Data strategies, there are a number of dimensions to consider when 9
  10. 10. selecting technology partners, including: 1. Online vs. Offline Big Data 2. Software Licensing Models 3. Community 4. Developer Appeal 5. Agility 6. General Purpose vs. Niche Solutions 1. ONLINE VS. OFFLINE BIG DATA Big Data can take both online and offline forms. Online Big Data refers to data that is created, ingested, trans- formed, managed and/or analyzed in real-time to support operational applications and their users. Big Data is born online. Latency for these applications must be very low and availability must be high in order to meet SLAs and user expectations for modern application performance. This includes a vast array of applications, from social networking news feeds, to analytics to real-time ad servers to complex CRM applications. Examples of online Big Data databases include MongoDB and other NoSQL databases. Offline Big Data encompasses applications that ingest, transform, manage and/or analyze Big Data in a batch context. They typically do not create new data. For these applications, response time can be slow (up to hours or days), which is often acceptable for this type of use case. Since they usually produce a static (vs. operational) output, such as a report or dashboard, they can even go offline temporarily without impacting the overall goal or end product. Examples of offline Big Data applications include Hadoop-based workloads; modern data warehouses; extract, transform, load (ETL) applications; and business intelligence tools. Organizations evaluating which Big Data technologies to adopt should consider how they intend to use their data. For those looking to build applications that support real-time, operational use cases, they will need an operational data store like MongoDB. For those that need a place to conduct long-running analysis offline, perhaps to inform decision-making processes, offline solutions like Hadoop can be an effective tool. Organizations pursuing both use cases can do so in tandem, and they will sometimes find integrations between online and offline Big Data technologies. For instance, MongoDB provides integration with Hadoop. 2. SOFTWARE LICENSE MODEL There are three general types of licenses for Big Data software technologies:  Proprietary. The software product is owned and controlled by a software company. The source code is not available to licensees. Customers typically license the product through 10
  11. 11. a perpetual license that entitles them to indefinite use, with annual maintenance fees for support and software upgrades. Examples of this model include databases from Oracle, IBM and Terradata.  Open-Source. The software product and source code are freely available to use. Companies monetize the software product by selling subscriptions and adjacent products with value-added components, such as management tools and support services. Examples of this model include MongoDB (by MongoDB, Inc.) and Hadoop (by Cloudera and others).  Cloud Service. The service is hosted in a cloud- based environment outside of customers’ data centers and delivered over the public Internet. The predominant business model is metered (i.e., pay-per-use) or subscription-based. Examples of this model include Google App Engine and Amazon Elastic MapReduce. For many Fortune 1000 companies, regulations and internal policies around data privacy limit their ability to leverage cloud-based solutions. As a result, most Big Data initiatives are driven with technologies deployed on-premise. Most of the Big Data pioneers are web companies that developed powerful software and hardware, which they open-sourced to the larger community. Accordingly, most of the software used for Big Data projects is open-source. 3. COMMUNITY In these early days of Big Data, there is an opportunity to learn from others. Organizations should consider how many other initiatives are being pursued using the same technologies and with similar objectives. To understand a given technology’s adoption, organiza- tions should consider the following:  The number of users  The prevalence of local, community-organized events  The health and activity of online forums such as Google Groups and StackOverflow  The availability of conferences, how frequently they occur and whether they are wellattended 11
  12. 12. 4. DEVELOPER APPEAL The market for Big Data talent is tight. The nation’s top engineers and data scientists often flock to companies like Google and Facebook, which are known havens for the brightest minds and places where one will be exposed to leading edge technology. If enterprises want to compete for this talent, they have to offer more than money. By offering developers the opportunity to work on tough problems, and by using a technology that has strong developer interest, a vibrant community, and an auspicious long-term future, organizations can attract the brightest minds. They can also increase the pool of candidates by choosing technologies that are easy to learn and use — which are often the ones that appeal most to developers. Furthermore, technologies that have strong developer appeal tend to make for more productive teams who feel they are empowered by their tools rather than encumbered by poorly-designed, legacy technology. Productive developer teams reduce time to market for new initiatives and reduce development costs, as well. 5. AGILITY Organizations should use Big Data products that enable them to be agile. They will benefit from technologies that get out of the way and allow teams to focus on what they can do with their data, rather than how to deploy new applications and infrastructure. This will make it easy to explore a variety of paths and hypotheses for extracting value from the data and to iterate quickly in response to changing business needs. In this context, agility comprises three primary components:  Ease of Use. A technology that is easy for developers to learn and understand -- either because of the way it’s architected, the availability of tools and information, or both -will enable teams to get Big Data projects started and to realize value quickly. Technologies with steep learning curves and fewer resources to support education will make for a longer road to project execution.  Technological Flexibility. The product should make it relatively easy to change requirements on the fly—such as how data is modeled, which data is used, where data is pulled from and how it gets processed as teams develop new findings and adapt to 12
  13. 13. internal and external needs. Dynamic data models (also known as schemas) and scalability are capabilities to seek out.  Licensing Freedom. Open-source products are typically easier to adopt, as teams can get started quickly with free community versions of the software. They are also usually easier to scale from a licensing standpoint, as teams can buy more licenses as requirements increase. By contrast, in many cases proprietary software vendors require large, upfront license purchases, which make it harder for teams to get moving quickly and to scale in the future. MongoDB’s ease of use, dynamic data model and open- source licensing model make it the most agile Big Data solution available. 6. GENERAL PURPOSE VS. NICHE SOLUTIONS Organizations are constantly trying to standardize on fewer technologies to reduce complexity, to improve their competency in the selected tools and to make their vendor relationships more productive. Organizations should consider whether adopting a Big Data technology helps them address a single initiative or many initiatives. If the technology is general purpose, the expertise, infrastructure, skills, integrations and other investments of the initial project can be amortized across many projects. Organizations may find that a niche technology may be a better fit for a single project, but that a more general purpose tool is the better option for the organization as a whole. 13
  14. 14. 4. ADVANTAGES OF BIGDATA The practical advantages of bigdata are as follows: . Dialogue with consumers Today’s consumers are a tough nut to crack. They look around a lot before they buy, talk to their entire social network about their purchases, demand to be treated as unique and want to be sincerely thanked for buying your products. Big Data allows you to profile these increasingly vocal and fickle little ‘tyrants’ in a far-reaching manner so that you can engage in an almost oneon-one, real-time conversation with them. This is not actually a luxury. If you don’t treat them like they want to, they will leave you in the blink of an eye. Just a small example: when any customer enters a bank, Big Data tools allow the clerk to check his/her profile in real-time and learn which relevant products or services (s)he might advise. Big Data will also have a key role to play in uniting the digital and physical shopping spheres: a retailer could suggest an offer on a mobile carrier, on the basis of a consumer indicating a certain need in the social media. Re-develop your products Big Data can also help you understand how others perceive your products so that you can adapt them, or your marketing, if need be. Analysis of unstructured social media text allows you to uncover the sentiments of your customers and even segment those in different geographical locations or among different demographic groups. On top of that, Big Data lets you test thousands of different variations of computer-aided designs in the blink of an eye so that you can check how minor changes in, for instance, material affect costs, lead times and performance. You can then raise the efficiency of the production process accordingly. Perform risk analysis Success not only depends on how you run your company. Social and economic factors are crucial for your accomplishments as well. Predictive analytics, fueled by Big Data allows you to 14
  15. 15. scan and analyze newspaper reports or social media feeds so that you permanently keep up to speed on the latest developments in your industry and its environment. Detailed health-tests on your suppliers and customers are another goodie that comes with Big Data. This will allow you to take action when one of them is in risk of defaulting. Keeping your data safe You can map the entire data landscape across your company with Big Data tools, thus allowing you to analyze the threats that you face internally. You will be able to detect potentially sensitive information that is not protected in an appropriate manner and make sure it is stored according to regulatory requirements. With real-time Big Data analytics you can, for example, flag up any situation where 16 digit numbers – potentially credit card data - are stored or emailed out and investigate accordingly. Create new revenue streams The insights that you gain from analyzing your market and its consumers with Big Data are not just valuable to you. You could sell them as non-personalized trend data to large industry players operating in the same segment as you and create a whole new revenue stream. One of the more impressive examples comes from Shazam, the song identification application. It helps record labels find out where music sub-cultures are arising by monitoring the use of its service, including the location data that mobile devices so conveniently provide. The record labels can then find and sign up promising new artists or remarket their existing ones accordingly. Customize your website in real time Big Data analytics allows you to personalize the content or look and feel of your website in real time to suit each consumer entering your website, depending on, for instance, their sex, nationality or from where they ended up on your site. The best-known example is probably offering tailored recommendations: Amazon’s use of real-time, item-based, collaborative filtering (IBCF) to fuel its ‛Frequently bought together’ and ‛Customers who bought this item also bought’ features or LinkedIn suggesting ‛People you may know’ or ‛Companies you may 15
  16. 16. want to follow’. And the approach works: Amazon generates about 20% more revenue via this method. Reducing maintenance costs Traditionally, factories estimate that a certain type of equipment is likely to wear out after so many years. Consequently, they replace every piece of that technology within that many years, even devices that have much more useful life left in them. Big Data tools do away with such unpractical and costly averages. The massive amounts of data that they access and use and their unequalled speed can spot failing grid devices and predict when they will give out. The result: a much more cost-effective replacement strategy for the utility and less downtime, as faulty devices are tracked a lot faster. Offering tailored healthcare We are living in a hyper-personalized world, but healthcare seems to be one of the last sectors still using generalized approaches. When someone is diagnosed with cancer they usually undergo one therapy, and if that doesn’t work, the doctors try another, etc. But what if a cancer patient could receive medication that is tailored to his individual genes? This would result in a better outcome, less cost, less frustration and less fear. With human genome mapping and Big Data tools, it will soon be commonplace for everyone to have their genes mapped as part of their medical record. This brings medicine closer than ever to finding the genetic determinants that cause a disease and developing drugs expressly tailored to treat those causes — in other words, personalized medicine. Offering enterprise-wide insights Previously, if business users needed to analyze large amounts of varied data, they had to ask their IT colleagues for help as they themselves lacked the technical skills for doing so. Often, by the time they received the requested information, it was no longer useful or even correct. With Big Data tools, the technical teams can do the groundwork and then build repeatability into algorithms for faster searches. In other words, they can develop systems and install interactive and dynamic visualization tools that allow business users to analyze, view and benefit from the data. 16
  17. 17. Making our cities smarter To help them deal with the consequences of their fast expansion, an increasing number of smart cities are indeed leveraging Big Data tools for the benefit of their citizens and the environment. The city of Oslo in Norway, for instance, reduced street lighting energy consumption by 62% with a smart solution. Since the Memphis Police Department started using predictive software in 2006, it has been able to reduce serious crime by 30 %. The city of Portland, Oregon, used technology to optimize the timing of its traffic signals and was able to eliminate more than 157,000 metric tons of CO2emissions in just six years – the equivalent of taking 30,000 passenger vehicles off the roads for an entire year. These are few practical examples of BIGDATA. 17
  18. 18. 5. RISKS OF BIGDATA Big data has gotten a lot of press recently – and rightly so. With the vast amounts of data now available, we can do more than could have been imagined in previous decades. But there is another face to big data … and that is, companies now have to manage some very big risks. It’s hard to visualize the amount of data we’re talking about. But as on a article put it, “In 2011 alone, 1.8 zettabytes (or 1.8 trillion gigabytes) of data will be created, the equivalent to every U.S. citizen writing 3 tweets per minute for 26,976 years.” And this number is anticipated to grow by a magnitude of 50 times by the year 2020. Risk #1: Loss of agility In a typical large-scale organization, data is housed on multiple platforms. There is transactional data, email data, analytics data, etc. Management wants people to be able to locate, analyze, and make decisions based on this data quickly. It is a necessity in today’s marketplace where conditions can change in an instant. But if the data isn’t evaluated, organized, and stored properly, critical information can be either difficult or impossible to find – slowing a business down at the exact moment when speed is essential. Risk #2: Loss of compliance Laws are getting more and more complex with regard to how long companies need to retain data, how they need to retain it, and where they need to retain it. There are both general regulations in place as well as state- or industry-specific regulations that may apply. It is not uncommon for regulators to perform random audits to examine a company’s policies regarding data and their actual management of that data. A compliance failure can result in significant fine or damage to reputational risk. Risk #3: Loss of security With more data located in and moving between more places than ever before, there are also a vastly increased number of ways to hack into that data. A security breach can result in theft, fraud, fines … and, of course, reputational loss. No company wants to be featured on the front page of the Wall Street Journal because they’ve been hacked. 18
  19. 19. Risk #4: Loss of money As the amount of data grows, it is all too tempting to simply throw more servers at the problem. After all, storage is cheap, isn’t it? But consider this: I once worked with a client who said they needed an entire new data center to house their data. SunGard Availability Services did studies and found that not only did they not need a new data center; they actually needed only half their current storage because they simply weren’t managing their data well. A server may seem inexpensive at first glance – but never assume that storage is cheap. Big data is a good thing. No question about it. But big risky data is a bad thing. Companies today need to manage their data to minimize their risk. This involves having policies that are in compliance with regulatory standards, processes that cover all contingencies, retention schedules that are up to date, and a consistent self-evaluation to determine what data is necessary for the proper functioning of the company. The more efficiently companies store, manage, and host their data, the more agile, compliant, secure, and cost-effective they will be. And that will take the big risk out of big data. 19
  20. 20. 6. CONCLUSION We have entered an era of Big Data. Through better analysis of the large volumes of data that are becoming available, there is the potential for making faster advances in many scientific disciplines and improving the profitability and success of many enterprises. However, many technical challenges described in this paper must be addressed before this potential can be realized fully. The challenges include not just the obvious issues of scale, but also heterogeneity, lack of structure, error-handling, privacy, timeliness, provenance, and visualization, at all stages of the analysis pipeline from data acquisition to result interpretation. These technical challenges are common across a large variety of application domains, and therefore not cost-effective to address in the context of one domain alone. Furthermore, these challenges will require transformative solutions, and will not be addressed naturally by the next generation of industrial products. We must support and encourage fundamental research towards addressing these technical challenges if we are to achieve the promised benefits of BigData. 20