Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data - what, why, where, when and how


Published on

This paper talks about the growing prevalence of bigdata

Published in: Technology
  • Boost your brainpower with brain pill! find out more... ➤➤
    Are you sure you want to  Yes  No
    Your message goes here

Big data - what, why, where, when and how

  1. 1. BIG DATA What, Why, Where, When and How senthil sundaresan BI/SQL/Data Visualization Evangelist Abstract In this paper, we talk about what is big data, the growing prevalence of big data, the opportunities, the challenges and architectural framework that will facilitate the delivery of opportunities while addressing challenges. The architecture for the ‘Big Data Management’ will be demonstrated through Hadoop technology with Map- Reduce framework and its Open Source ecosystem.
  2. 2. BIG DATA SENTHIL SUNDARESAN 1 Author’s Page I am senthil, a BI/SQL/Data visualization evangelist. I have donned many roles during my short career of 13+ years such as Analyst, Developer, Lead, Project Manager, Principal Data and Visualization Architect, Consultant, DB Administrator, Unix Administrator to name a few. My BI and Visualization skills are SAP BO/BODS, TABLEAU, QLIKVIEW, MSBI, ESSBASE, R, OMNISCOPE, SQLSERVER, SYBASE IQ, SYBASE, TERADATA (again) to name a few. Been in this industry and especially in BI for so many years it’s imperative for me to understand the nuances and intricacies of the Big Data Tech Stack. That’s the trigger for me to write this paper and while doing so I’ve started exploring big data more. This paper would be a stepping stone for those who thinks whether it’s possible or plausible. Thanks for reading! My blogs:
  3. 3. BIG DATA SENTHIL SUNDARESAN 2 1. INTRODUCTION “Big data” is a big vibrating phrase in the IT and business world right now – and there are a dizzying array of opinions on just what these two simple words really mean. Technology vendors in the legacy database or data warehouse spaces say “big data” simply refers to a traditional data warehousing scenario involving data volumes in either the single or multi-terabyte range. Others disagree with this by saying that “big data” isn’t limited to traditional data warehouse situations, but includes real-time or operational data stores used as the primary data foundation for online applications that power key external or internal business systems. In 2011, people have created 1.8 Zetabytes of data and this is increasing exponentially every year. This ever increasing data contains information that could give rise to many business opportunities. Few of the Business Drivers of Big data are: Finance: Better and deeper understanding of risk to avoid credit crisis – Basel III Telecommunications: More reliable network where we can predict and prevent failure Media: More content that is lined up with your personal preferences Life science: Better targeted medicines with fewer complications and side effects Retail: A personal experience with products and offers that are just what you need Government: Government services that are based on hard data, not just gut. Big Data is here. Analysts and research organizations have made it clear that mining machine generated data is essential to future success. Embracing new technologies and techniques are always challenging, but as architects, you are expected to provide a fast, reliable path to business adoption. Big Data Characteristics, Architecture Capabilities, Technologies, Market vendors, and Sample implementation are explained in the subsequent sections. 2. WHAT IS BIG DATA? Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, an alternative way has to be chosen to process it. 2.1 Characteristics of Big Data Big data has the following characteristics Very large distributed aggregations of loosely structured data are often incomplete and inaccessible:  Petabytes/exabytes of data  Billions/Trillions of records  Loosely-structured and often distributed data  Flat schemes with few complex interrelationships  Often involving time-stamped events  Often made up of incomplete data  Often including connections between data elements that must be probabilistically inferred Applications that involved Big-data can be:  Transactional (e.g.: Facebook, Photobox etc)  Analytic (e.g., ClickFox, Merced Applications)
  4. 4. BIG DATA SENTHIL SUNDARESAN 3 Fig 1: Big Data Evolution According to a new global report from IBM and the Said Business School at the University of Oxford, less than half of the organizations engaged in active Big Data initiatives are currently analyzing external sources of data, like social media. 2.2 Key Metrics: The Three V’s As a catch-all term, “big data” can be pretty nebulous, in the same way that the term “cloud” covers diverse technologies. Input data to big data systems could be chatter from social networks, web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, MP3s of rock music, the content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, the list goes on. Are these all really the same thing? To classify matters the three Vs of volume, velocity, and variety are commonly used to categorize different aspects of big data. They are a helpful lens through which to view and understand the nature of the data and the software platforms that is available to exploit them. 2.2.1 Volume Terabyte records, transactions, tables, files. A Boeing Jet engine spews out 10TB of operational data for every 30 minutes they run. Hence a 4-engine Jumbo jet can create 640TB on one Atlantic crossing. Multiply that to 25,000 flights flown each day and you get the picture. 2.2.2 Velocity Batch, near-time, real-time, streams. Today’s on-line ad serving requires 40ms to respond with a decision. Financial services need near 1MS to calculate customer scoring probabilities. Stream data, such as movies, need to travel at high speed for proper rendering. 2.2.3 Variety Variety: Structured, Semi-structured, Unstructured; Text, image, audio, video, record and all the above in a mix. WalMart processes 1M customer transactions per hour and feeds information to a database estimated at 2.5PB (Petabytes). There are old and new data sources like RFID, sensors, mobile payments, in-vehicle tracking, etc. Data Variety and Complexity S t o r a g e ERP CRM Web Big Data Mega bytes Giga bytes Tera bytes Peta Bytes
  5. 5. BIG DATA SENTHIL SUNDARESAN 4 Fig 2: Volume, Velocity and Variety 3. BIG DATA PROCESSING Before big data, traditional analysis involved crunching data in a traditional database. This was based on the relational database model, where data and the relationship between the data were stored in tables. The data was processed and stored in rows. Databases have progressed over the years, however, and are now using massively parallel processing (MPP) to break data up into smaller lots and process it on multiple machines simultaneously, enabling faster processing. Instead of storing the data in rows, the databases can also employ columnar architectures, which enable the processing of only the columns that have the data needed to answer the query and enable the storage of unstructured data. Fig 3: Big Data Architecture MapReduce MapReduce is the combination of two functions to better process data. First, the map function separates data over multiple nodes, which are then processed in parallel. The reduce function then combines the results of the calculations into a set of responses. Google used MapReduce to index the web, and has been granted a patent for its MapReduce framework. However, the MapReduce method has now become commonly used, with the most famous implementation being in an open-source project called Hadoop. Bridging the Gap – The Key – Value pair
  6. 6. BIG DATA SENTHIL SUNDARESAN 5 Key-value pair is the data model underlying Map-Reduce (and thus Hadoop) that is actually the fundamental driver of performance. A file of key value pairs has exactly two columns. One is structured – the KEY. The other, the value, is unstructured – at least as far as the system is concerned. The Mapper then allows you to move (or split) the data between the structured and unstructured sections at will. The reducer then allows data to be collated and aggregated provided it has an identical key. Massively parallel processing (MPP) Like MapReduce, MPP processes data by distributing it across a number of nodes, which each process an allocation of data in parallel. The output is then assembled to create a result. However, MPP products are queried with SQL, while MapReduce is natively controlled via Java code. MPP is also generally used on expensive specialized hardware (sometimes referred to as big-data appliances), while MapReduce is deployed on commodity hardware. 4. BIG DATA ARCHITECTURE In this section, we will take a closer look at the overall architecture for big data. Traditional Information Architecture Capabilities To understand the high-level architecture aspects of Big Data, let’s first review a well formed logical information architecture for structured data. In the illustration, you see two data sources that use integration (ELT/ETL/Change Data Capture) techniques to transfer data into a DBMS data warehouse or operational data store, and then offer a wide variety of analytical capabilities to reveal the data. Some of these analytic capabilities include: dashboards, reporting, EPM/BI applications, summary and statistical query, semantic interpretations for textual data, and visualization tools for high-density data. In addition, some organizations have applied oversight and standardization across projects, and perhaps have matured the information architecture capability through managing it at the enterprise level. Fig 4: Traditional Capabilities – Courtesy Oracle The key information architecture principles include treating data as an asset through a value, cost, and risk lens, and ensuring timeliness, quality, and accuracy of data. And, the EA oversight responsibility is to establish and maintain a balanced governance approach including using center of excellence for standards management and training. Adding Big Data Capabilities The defining processing capabilities for big data architecture are to meet the volume, velocity, variety, and value requirements. Unique distributed (multi-node) parallel processing architectures have been created to parse these large data sets. There are differing technology strategies for real-time and batch processing requirements. For real-time, key-value data stores, such as NoSQL, allow for high performance, index-based retrieval. For batch processing, a technique known as “Map Reduce,” filters data according to a specific data discovery strategy. After the filtered data is discovered, it can be analyzed directly, loaded into other unstructured databases, sent to mobile devices, or merged into traditional data warehousing environment and correlated to structured data
  7. 7. BIG DATA SENTHIL SUNDARESAN 6 Fig 5: Big Data Capabilities – Courtesy Oracle In addition to new unstructured data realms, there are two key differences for big data. First, due to the size of the data sets, we don’t move the raw data directly to a data warehouse. However, after MapReduce processing we may integrate the “reduction result” into the data warehouse environment so that we can leverage conventional BI reporting, statistical, semantic, and correlation capabilities. It is ideal to have analytic capabilities that combine a conventional BI platform along with big data visualization and query capabilities. And second, to facilitate analysis in the Hadoop environment, sandbox environments can be created. For many use cases, big data needs to capture data that is continuously changing and unpredictable. And to analyze that data, a new architecture is needed. In retail, a good example is capturing real time foot traffic with the intent of delivering in-store promotion. To track the effectiveness of floor displays and promotions, customer movement and behavior must be interactively explored with visualization or query tools. In other use cases, the analysis cannot be complete until you correlate it with other enterprise data - structured data. In the example of consumer sentiment analysis, capturing a positive or negative social media comment has some value, but associating it with your most or least profitable customer makes it far more valuable. So, the needed capability with Big Data BI is context and understanding. Using powerful statistical and semantic tools allow you to find the needle in the haystack, and will help you predict the future. In summary, the Big Data architecture challenge is to meet the rapid use and rapid data interpretation requirements while at the same time correlating it with other data. 5. STEPS TO BIG DATA Before you go down the path of big data, it's important to be prepared and approach an implementation in an organized manner, following these steps.  What do you wish you knew?  This is where it will be decided as what is expected out of big data that you can't get from your current systems.  If the answer is nothing, then perhaps big data isn't the right thing to use.  What are the current data assets?  Can the data be cross referenced to produce insights?  Is it possible to build new data products on top of the current assets?  If not, what needs to be implemented to do so? Once the above are understood, it's time to prioritize. Select the potentially most valuable opportunity for using big-data techniques and technology, and prepare a business case for a proof of concept, keeping in mind the skill sets you'll need to do it. You will need to talk to the owners of the data assets to get the full picture Another example of applying architecture principles differently is data governance. The quality and accuracy requirements of big data can vary tremendously. Using strict data precision rules on user sentiment data might filter out too much useful information, whereas data standards and common definitions are still going to be critical for fraud detections scenarios. Start the proof of concept, and make sure that there's a clear end point, so that you can evaluate what the proof of concept has achieved. This might be the time to give the owner of the data assets to take responsibility for the project
  8. 8. BIG DATA SENTHIL SUNDARESAN 7 Once your proof of concept has been completed, evaluate whether it worked. Are you getting real insights delivered? Is the work that went in to the concept bearing fruit? Could it be extended to other parts of the organization? Is there other data that could be included? This will help you to discover whether to expand your implementation or revamp it. Once the evaluation is done and the need for big data is inevitable, then it’s imperative to choose the vendors and technologies. 5.1 Architecture Decisions Information Architecture is perhaps the most complex area of IT. It is the ultimate investment payoff. Today’s economic environment demands that business be driven by useful, accurate, and timely information. And, the world of Big Data adds another dimension to the problem. However, there are always business and IT tradeoffs to get to data and information in a most cost-effective way.  Key Drivers to Consider Here is a summary of various business and IT drivers you need to consider when making these architecture choices. Fig 6: Key Drivers Planning Big Data architecture is not about understanding just what is different. It’s also about how to integrate what’s new to what you already have – from database-and-BI infrastructure to IT tools, and end user applications. 5.2 Technologies To derive real business value from big data, you need the right tools to capture and organize a wide variety of data types from different sources, and to be able to easily analyze it within the context of all your enterprise data. Here is a brief outline of Big Data capabilities and their primary technologies: 5.2.1 Hadoop Derived from MapReduce technology, Hadoop is an open-source framework to process large amounts of data over multiple nodes in parallel, running on inexpensive hardware. Data is split into sections and loaded into a file store — for example, the Hadoop Distributed File System (HDFS), which is made up of multiple redundant nodes on cheap storage. A name node keeps track of which data is on which nodes. The data is replicated over more than one node, so that even if a node fails, there's still a copy of the data. The data can then be analyzed using MapReduce, which discovers from the name node where the data needed for calculations resides. Processing is then done at the node in parallel. The results are aggregated to determine the answer to the query and then loaded onto a node, which can be further analyzed using other tools. Alternatively, the data can be loaded into traditional data warehouses for use with transactional processing. Apache is considered to be the most noteworthy Hadoop distribution.
  9. 9. BIG DATA SENTHIL SUNDARESAN 8 Fig 7: Hadoop in the Enterprise RDBMS and Hadoop Here is a comparison of the overall differences between the RDBMS and MapReduce-based systems such as Hadoop: Fig 8: RDBMS vs. Hadoop 5.2.2 Hive Databases like Hadoop's file store make ad hoc query and analysis difficult, as the programming map/reduce functions that are required can be difficult. Realizing this when working with Hadoop, Facebook created Hive, which converts SQL queries to map/reduce jobs to be executed using Hadoop. 5.2.3 Pig Procedural data processing language designed for Hadoop where you specify a series of steps to perform on the data. It’s often described as “the duct tape of Big Data” for its usefulness there and it is often combined with custom streaming code written in a scripting language for more general operations. 5.2.4 Social Network and Hadoop Twitter uses Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. Also it uses Cloudera's CDH2 distribution of Hadoop, and stores all data as compressed LZO files.  It uses both Scala and Java to access Hadoop's MapReduce APIs  It uses Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements.  It employs committers on Pig, Avro, Hive, and Cassandra, and contribute much of our internal Hadoop work to open source Facebook uses Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.  Currently Facebook has 2 major clusters as:  An 1100-machine cluster with 8800 cores and about 12 PB of raw storage.  A 300-machine cluster with 2400 cores and about 3 PB of raw storage.  Each (commodity) node has 8 cores and 12 TB of storage.
  10. 10. BIG DATA SENTHIL SUNDARESAN 9  Facebook is heavy users of both streaming as well as the Java APIs. It has built a higher level data warehousing framework using these features called Hive. It has also developed a FUSE implementation over HDFS. 5.2.5 NoSQL NoSQL database-management systems are unlike relational database-management systems, in that they do not use SQL as their query language. The idea behind these systems is that that they are better for handling data that doesn't fit easily into tables. They dispense with the overhead of indexing, schema and ACID transactional properties to create large, replicated data stores for running analytics on inexpensive hardware, which is useful for dealing with unstructured data. Types of NoSQL Databases The following are the types of NoSQL Databases  Key-value Store  Document Databases  Column oriented database Fig 9: NoSQL Types Cassandra Cassandra is a NoSQL database alternative to Hadoop's HDFS. 5.3 Sample Implementation Big-data projects have a number of different layers of abstraction from abstraction of the data through to running analytics against the abstracted data. Figure 1 shows the common components of analytical Big-data and their relationship to each other. The higher level components help make big data projects easier and more productive. Hadoop (is an apache project, written in java and being built and used by a global community of contributors) is often at the center of Big-data projects, but it is not a prerequisite.  Packaging and support of Hadoop by organizations such as Cloudera; to include MapReduce - essentially he compute layer of big data.  File-Systems such as the Hadoop Distributed File System (HDFS), which manages the retrieval and storing of data and metadata required for computation. Other file systems or databases such as Hbase (a NoSQL tabular store) or Cassandra (a NoSQL Eventually‐consistent key‐value store) can also be used.  Instead of writing in JAVA, higher level languages as Pig (part of Hadoop) can be used such, simplifying the writing of computations.
  11. 11. BIG DATA SENTHIL SUNDARESAN 10  Hive is a Data Warehouse layer built on top of Hadoop, developed by Facebook programmers.  Cascading is a thin Java library that sits on top of Hadoop that allows suites of MapReduce jobs to be run and managed as a unit. It is widely used to develop special tools.  Semi-automated modeling tools such as CR-X allow models to develop interactively at great speed, and can help set up the database that will run the analytics.  Specialized scale-out analytic databases such as Greenplum or Netezza with very fast loading load & reload the data for the analytic models  ISV big data analytical packages such as ClickFox and Merced run against the database to help address the business issues (e.g., the customer satisfaction issues mentioned in the introduction).  Transactional Big-data projects cannot use Hadoop, as it is not real-time. For transactional systems that do not need a database with ACID2 guarantees, NoSQL databases can be used, though there are constraints such as weak consistency guarantees (e.g., eventual consistency) or restricting transactions to a single data item. For big-data transactional SQL databases that need the ACID guarantees the choices are limited. Traditional scale-up databases are usually too costly for very large-scale deployment, and don't scale out very well. Most social medial databases have had to hand-craft solutions. Recently a new breed of scale-out SQL database have emerged with architectures that move the processing next to the data (in the same way as Hadoop), such as Clustrix. These allow greater scale out ability. Fig 10: Sample Implementation This area is extremely fast growing, with many new entrants into the market expected over the next few years. 5.4 Vendors There is scarcely a vendor that doesn't have a big-data plan in train, with many companies combining their proprietary database products with the open-source Hadoop technology as their strategy to tackle velocity, variety and volume. Many of the early big-data technologies came out of open source, posing a threat to traditional IT vendors that have packaged their software and kept their intellectual property close to their chests. However, the open-source nature of the trend has also provided an opportunity for traditional IT vendors, because enterprise and government often find open-source tools off- putting. Therefore, traditional vendors have welcomed Hadoop with open arms, packaging it in to their own proprietary systems so they can sell the result to enterprise as more comfortable and familiar packaged solutions. Below are the plans of some of the larger vendors. 5.4.1 Cloudera Cloudera was founded in 2008 by employees who worked on Hadoop at Yahoo and Facebook. It contributes to the Hadoop open-source project, offering its own distribution of the software for free. It also sells a subscription-based, Hadoop-based distribution for the enterprise, which includes production support and tools to make it easier to run Hadoop.
  12. 12. BIG DATA SENTHIL SUNDARESAN 11 5.4.2 Hortonworks Cloudera rival Hortonworks was birthed by key architects from the Yahoo Hadoop software engineering team. In June 2012, the company launched a high-availability version of Apache Hadoop, the Hortonworks Data Platform on which it collaborated with VMware, as the goal was to target companies deploying Hadoop on VMware's vSphere. Teradata has also partnered with Hortonworks to create products that "help customers solve business problems in new and better ways". 5.4.3 Teradata Teradata made its move out of the "old-world" data-warehouse space by buying Aster Data Systems and Aprimo in 2011. Teradata wanted Aster's ability to manage "a variety of diverse data that is not structured", such as web applications, sensor networks, social networks, genomics, video and photographs. Teradata has now gone to market with the Aster Data nCluster, a database using MPP and MapReduce. Visualization and analysis is enabled through the Aster Data visual-development environment and suite of analytic modules. The Hadoop connecter, enabled by its agreement with Cloudera, allows for a transfer of information between nCluster and Hadoop. 5.4.1 Oracle Oracle made its big-data appliance available earlier this year— a full rack of 18 Oracle Sun servers with 864GB of main memory; 216 CPU cores; 648TB of raw disk storage; 40Gbps InfiniBand connectivity between nodes and engineered systems; and 10Gbps Ethernet connectivity. The system includes Cloudera's Apache Hadoop distribution and manager software, as well as an Oracle NoSQL database and a distribution of R (an open-source statistical computing and graphics environment). It integrates with Oracle's 11g database, with the idea being that customers can use Hadoop MapReduce to create optimized datasets to load and analyze in the database. 5.4.2 IBM IBM combined Hadoop and its own patents to create IBM InfoSphere BigInsights and IBM InfoSphere Streams as the core technologies for its big-data push. The BigInsights product, which enables the analysis of large-scale structured and unstructured data, "enhances" Hadoop to "withstand the demands of your enterprise", according to IBM. It adds administrative, workflow, provisioning and security features into the open-source distribution. Meanwhile, streams analysis has a more complex event-processing focus, allowing the continuous analysis of streaming data so that companies can respond to events. IBM has partnered with Cloudera to integrate its Hadoop distribution and Cloudera manger with IBM BigInsights. Like Oracle's big-data product, IBM's BigInsights links to: IBM DB2, its Netezza data-warehouse; its InfoSphere Warehouse; and its Smart Analytics System. 5.4.3 SAP At the core of SAP's big-data strategy sits a high-performance analytic appliance (HANA) data-warehouse appliance, unleashed in 2011. It exploits in-memory computing, processing large amounts of data in the main memory of a server to provide real-time results for analysis and transactions. Business applications, like SAP's Business Objects, can sit on the HANA platform to receive a real-time boost. SAP has integrated HANA with Hadoop, enabling customers to move data between Hive and Hadoop's Distributed File System and SAP HANA or SAP Sybase IQ server. It has also set up a "big-data" partner council, which will work to provide products that make use of HANA and Hadoop. One of the key partners is Cloudera. SAP wants it to be easy to connect to data, whether it's in SAP software or software from another vendor. 5.4.4 Microsoft Microsoft is integrating Hadoop into its current products. It has been working with Hortonworks to make Hadoop available on its cloud platform Azure, and on Windows Servers. The former is available in developer preview. It already has connectors
  13. 13. BIG DATA SENTHIL SUNDARESAN 12 between Hadoop, SQL Server and SQL Server Parallel Data Warehouse, as well as the ability for customers to move data from Hive into Excel and Microsoft BI tools, such as PowerPivot. 5.4.5 EMC EMC has centered its big-data technology on technology that it acquired when it bought Greenplum in 2010. It offers a unified analytics platform that deals with web, social, document, mobile machine and multimedia data using Hadoop's MapReduce and HDFS, while ERP, CRM and POS data is put into SQL stores. The data mining, neural nets and statistics analysis is carried out using data from both sets, which is fed in to dashboards. 6. VALUE TO AN ORGANIZATION Value of Big Data falls into two categories: 1. Analytical use 2. Enabling new markets/products Big data analytics can reveal insights hidden previously by data too costly to process. , such as peer influence among customers, revealed by analyzing shoppers’ transactions, social and geographical data The past decade’s successful web startups are prime examples of big data used as an enabler of new products and services. For example, by combining a large number of signals from a user’s actions and those of their friends, Facebook has been able to craft a highly personalized user experience and create a new kind of advertising business. It’s no coincidence that the lion’s share of ideas and tools underpinning big data has emerged from Google, Yahoo, Amazon and Facebook. 7. FIRMS AND BIG DATA Now that there are products that make use of big data, what are companies' plans in the space? We've outlined some of them below. 7.1 Ford Ford is experimenting with Hadoop to understand better how the car operates and how consumers use the vehicles, and feed that information back into our design process and help optimize the user's experience in the future, as well so as to gain value out of the data it generates from its business operations, vehicle research and even its customers' cars. 7.2 HCF HCF has adopted IBM's big-data analytics solution, including the Netezza big-data appliance, to better analyze claims as they are made in real time. This helps to more easily detect fraud and provide ailing members with information they might need to stay fit and healthy. 7.3 Klout Klout's job is to create insights from the vast amounts of data coming in from the 100 million social-network users indexed by the company, and to provide those insights to customers. For example, Klout might provide information on how certain peoples' influence on social networks (or Klout score) might affect word-of-mouth advertising, or provide information on changes in demand. To deliver the analysis on a shoestring, Klout built custom infrastructure on Apache Hadoop, with a separate data silo for each social network. 7.4 Mitsui Knowledge Industry Mitsui analyses genomes for cancer research. Using HANA, R and Hadoop to pre-process DNA sequences, the company was able to shorten genome-analysis time from several days to 20 minutes. 7.5 Nokia Nokia is using Apache Hadoop and Cloudera's CDH to pull the unstructured data (generated by its phones around the world) into a structured environment to create 3D maps that show traffic, inclusive of speed categories, elevation, current events and video.
  14. 14. BIG DATA SENTHIL SUNDARESAN 13 7.6 WalMart WalMart uses a product it bought, called Muppet, as well as Hadoop to analyze social-media data from Twitter, Facebook, Foursquare and other sources. Among other things, this allows WalMart to analyze in real time which stores will have the biggest crowds, based on Foursquare check-ins. 8. BIG DATA – CHANGING WORLD Computers are leaner, meaner and cheaper than ever before. With computing power no longer at a premium, we're swimming in numbers that describe everything from how a small town in Minnesota behaves during rush hour to the probability of a successful drone strike in Yemen. The advent of so-called "big data" means that companies, governments and organizations can collect, interpret and wield huge stores of data to an amazing breadth of ends. From shoe shopping to privacy concerns, here's a look at five ways "big data" is changing the world: 8.1 Data as a deadly weapon The traditional battlefield has dissolved into thin air. In the big data era, information is the deadliest weapon and leveraging massive amounts of it is this era's arms race. But current military tech is buckling under the sheer weight of data collected from satellites, unmanned aircraft, and more traditional means. As part of the Obama administration's "Big Data Initiative," the Department of Defense launched XDATA, a program that intends to invest $25 million toward systems that analyze massive data sets in record time. With more efficient number crunching, the U.S. military can funnel petabytes of data toward cutting edge advances, like making unmanned drones smarter and more deadly than ever. 8.2 Saving the Earth Beyond powering predator drones and increasing retail revenue, big data can do a literal world of good. Take Google Earth Engine, an open source big data platform that allowed researchers to map the first high-resolution map of Mexico's forests. The map would have taken a traditional computer over three years to construct, but using Google Earth Engine's massive data cloud it was completed in the course of a day. Massive sets of data like this can help us understand environmental threats on a systemic level. The more data we have about the changing face of the earth's ecosystems and weather patterns, the better we can model future environmental shifts -- and how to stop them while we still can. 8.3 Watching you shop Big data can mean big profits. By understanding what you want to buy today, companies large and small can figure out what you'll want to buy tomorrow -- maybe even before you do? Online retailers like Amazon scoop up information about our shopping and e-window shopping habits on a huge scale, but even brick and mortar retailers are starting to catch on. A clever company called RetailNext helps companies like Brookstone and American Apparel record video of shoppers as they browse and buy. By transforming a single shopper's path into as many as 10,000 data points, companies can see how they move through a store, where they pause and how that tracks with sales. 8.4 Scientific research in overdrive Data has long been the cornerstone of scientific discovery, and with big data -- and the big computing power necessary to process it -- research can move at an exponentially fast clip. Take the Human Genome Project, widely considered to be one of the landmark scientific accomplishments in human history. Over the course of the $3 billion project, researchers analyzed and sequenced the roughly 25,000 genes that make up the human genome in 13 years. With today's modern methods of data collection and analysis, the same process can be completed in hours -- all by a device the size of a USB memory stick and for less than $1,000.
  15. 15. BIG DATA SENTHIL SUNDARESAN 14 8.5 Big data, bigger privacy concerns You might just be a number in the grand scheme of things, but that adage isn't as reassuring as it used to be. It's true that big data is about breadth, but it's about depth, too. Web mega-companies like Facebook and Google not only scoop up data on a huge number of users -- 955 million, in Facebook's case -- but they collect an incredible depth of data as well. From what you search and where you click to who you know (and who they know, and who they know), the web's biggest players own data stockpiles so robust that they border on omniscient. Where technological power, cultural advancement and profit intersect, one thing's clear: with big data comes even bigger responsibility. 9. DEPLOYMENT CONSIDERATIONS We have explored the nature of big data, and surveyed the landscape of big data from a high level. As usual, when it comes to deployment there are dimensions to consider over and above tool selection. 9.1 Cloud or In-house The majority of big data solutions are now provided in three forms: software‐only, as an appliance or cloud‐based. Decisions between which routes to take will depend, among other things, on issues of data locality, privacy and regulation, human resources and project requirements. Many organizations opt for a hybrid solution: using on-demand cloud resources to supplement in-house deployments. Cloud Computing and Big Data Experts in the IT industry, including Cloud Computing and Big Data, agree that a flexible and fast IT infrastructure is needed to support Big Data. The cloud removes the infrastructure challenges, provides the necessary speed and adds scalability. However, four areas must still be investigated more deeply: store and process, stewardship, sense making and security. Significant Change in Cloud Computing Traditionally, cloud computing operates in three primary layers: Software as a Service, Platform as a Service and Infrastructure as a Service. However, the architecture of Big Data adds another layer into the stack, which is concerned with analyzing and managing Big Data. It includes different binding concepts like lineage, pedigree and provenance. Big Data is complex and comes with daunting challenges. Phenomenal corporate balance is required for success. For organizations to harness Big Data effectively, they must change their business processes, implement multiple technologies and give their workforce relevant training. 9.2 Skills shortages Even if a company decides to go down the big‐data path, it may be difficult to hire the right people. The data scientist requires a unique blend of skills, including a strong statistical and mathematical background, a good command of statistical tools such as SAS, SPSS or the open‐source R and an ability to detect patterns in data (like a data‐mining specialist), all backed by the domain knowledge and communications skills to understand what to look for and how to deliver it. 9.3 Privacy Tracking individuals' data in order to be able to sell to them better will be attractive to a company, but not necessarily to the consumer who is being sold the products. Not everyone wants to have an analysis carried out on their lives, and depending on how privacy regulations develop, which is likely to vary from country to country, companies will need to be careful with how invasive they are with big-data efforts, including how they collect data. Regulations could lead to fines for invasive policies, but perhaps the greater risk is loss of trust.
  16. 16. BIG DATA SENTHIL SUNDARESAN 15 9.4 Security Individuals trust companies to keep their data safe. However, because big data is such a new area, products haven't been built with security in mind, despite the fact that the large volumes of data stored mean that there is more at stake than ever before if data goes missing. 9.5 Big Data is messy It’s not all about infrastructure. Big data practitioners consistently report that 80% of the effort involved in dealing with data is cleaning it up in the first place. 9.6 Big Data is big It is a fundamental fact that data that is too big to process conventionally is also too big to transport anywhere. Even if the data isn’t too big to move, locality can still be an issue, especially with rapidly updating data. 9.7 Culture The phenomenon of big data is closely tied to the emergence of data science, a discipline that combines math, programming and scientific instinct. Benefiting from big data means investing in teams with this skill set, and surrounding them with an organizational willingness to understand and use data for advantage. 9.8 Pitfalls 9.8.1 Do you know where your data is? It's no use setting up a big-data product for analysis only to realize that critical data is spread across the organization in inaccessible and possibly unknown locations. 9.8.2 A lack of direction "Collecting and analyzing the data is not enough; it must be presented in a timely fashion, so that decisions are made as a direct consequence that has a material impact on the productivity, profitability or efficiency of the organization. Most organizations are ill prepared to address both the technical and management challenges posed by big data; as a direct result, few will be able to effectively exploit this trend for competitive advantage." Unless firms know what questions they want to answer and what business objectives they hope to achieve, big-data projects just won't bear fruit. 10. CONCLUSION Finally, remember that big data is no panacea. You can find patterns and clues in your data, but then first, decide what problem you want to solve. If you pick a real business problem, such as how you can change your advertising strategy to increase spend per customer, it will guide your implementation. While big data work benefits from an enterprising spirit, it also benefits strongly from a concrete goal head As you explore the ‘what’s new’ across the spectrum of Big Data capabilities, we suggest that you think about their integration into your existing infrastructure and BI investments. As examples, align new operational and management capabilities with standard IT, build for enterprise scale and resilience, unify your database and development paradigms as you embrace Open Source, and share metadata wherever possible for both integration and analytics. Last but not least, expand the IT governance to include a Big Data center of excellence to ensure business alignment, grow your skills, manage Open Source tools and technologies, share knowledge, establish standards, and to manage best practices.
  17. 17. BIG DATA SENTHIL SUNDARESAN 16 Fig 11: McKinsey Survey Corporates vs. Big Data “Experience Certainty” - big data is imperative for Corporates to face the future. Scale- Out Storage Systems - Hadoop Technology Stack and Services Corporates need to have strong partnerships with storage vendors and is involved in architecture of large Data Centers with Big Data storage requirements. Most Scale-Out storage solutions today includes Hadoop as part of the stack. BI, Advanced and Predictive Analytics Corporates need to have strong capability on Business Intelligence, Data Warehousing and Advanced Analytics. This experience is around Industry Leading products and advanced and Predictive Analytics Solutions as in the case of “Listening Platform for Social Media” and “Supply Chain Predictive Analytics“. Vertical Domain Experience Corporates need to have deep knowledge of Business Imperatives of Semiconductor, Computer Platforms, Consumer Electronics and Software Product Companies. This knowledge in turn helps setting the right patterns for Advanced Analytics and also for defining the correct rules for Big Data analytics. What can be done? The scarcity in the Big Data and Hadoop knowledge creates the gap between the requirements and resource availability. It can be avoided by choosing the interested associates and train them properly in order to create a larger pool of associates having big data expertise available for the future.
  18. 18. BIG DATA SENTHIL SUNDARESAN 17 11. REFERENCES [1] Edd Dumbill, [2] David Floyer, [3] [4] Taylor Hatmaker, [5] Scott Jarr, [6] Oracle white paper in Enterprise Architecture [7] McKinsey Global Institute Analysis Victor Daily, [8] [9] TCS Hadoop and Data Xplode [10] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.