How many in the room are executing upon data analytics?How many of you are reaping benefits to make intelligent decisions from Data Analytics?
Classically, there are three major levels of management and decision making within an organization: operational, tactical and strategic (see Figure 1). While these levels feed one another, they are essentially distinct. Operational data deals with day-to- day operations. Tactical data deals with medium-term decisions. Strategic data deals with long- term decisions. Decision making changes as one goes from level to level. At the operational level, decisions are structured. This means they are based on rules. (A credit card charge may not exceed the customer's credit limit.) At the tactical level, decisions are semi-structured. (Did we meet our branch quota for new loans this week?) Strategic decisions are unstructured. (Should a bank lower its minimum balances to retain more customers and acquire more new customers?)
Big data analytics is an area of rapidly growing diversity. Big data analytics is more emergent and multifaceted, but less understood by the IT generalist. Development of Big data analytics processes has been driven historically by the web. However, the rapid growth of applications for Big data analytics is taking place in all major vertical industry segments and now represents a growth opportunity to vendors that's worth all the hype.Therefore, trying to define it is probably not helpful. What is helpful, however, is identifying the characteristics that are common to the technologies now identified with Big data analytics. These include:The perception that traditional data warehousing processes are too slow and limited in scalabilityThe ability to converge data from multiple data sources, both structured and unstructuredThe realization that time to information is critical to extract value from data sources that include mobile devices, RFID, the web and a growing list of automated sensory technologies
Animations to go away.How many of you are reaping benefits to make intelligent decisions from Data Analytics?How many are at this level (slide 9)? – Intermediary 1So how many of you are at this level (slide 10)? - AdvancedNow how many are doing this (slide 10) with all available data (no sampling)?We need something for people who are in advanced – They learn 1. what are other tools available for advanced analytics…Methodology and thought processCredibilityHP capabilities, solution for meet their advanced analytics requirement
Analytics Big data: Business intelligence is scaling out beyond its traditional boundaries to "every corner of the enterprise"—from point of sale terminals to HR to, of course, IT. The role of data warehousing for IT, or "big data," is emerging as a core focus for both vendors and IT adopters seeking more effective ways to apply mature data warehousing techniques to the business of IT. One of the more interesting, emerging areas is social data analytics—both for IT and beyond IT—as businesses seek to apply techniques such as sentiment analysis, geo-location, behavioral, social graph, and rich media social data to better understand everything from customer likes and dislikes and more effective risk management, to leveraging social media within IT as a foundation for problem resolution and requirements definitions. Advanced Threat Intelligence: As targeted threats continue to flourish and increase in sophistication, the requirements for better information gathering and data-driven security are self evident. These requirements go far beyond looking at isolated denial of service or virus issues to broader situational analysis. This is another application of big data, but one which may also run into privacy issues as advanced threat intelligence expands its reach. Advanced Performance-to-Business Management Analytics: Parallel but fundamentally distinct advances in analytics as applied to service, application and infrastructure performance management are also becoming significant game changers in 2012. With new solutions from both platforms and smaller suite providers automating insights into cross-domain performance interdependencies—across what sometimes become hundreds of different sources (or many hundreds of thousands depending on how it's measured) the chances for IT to break through the insoluble areas of triage is more promising than ever. Given the many multiple advances in this area (and multiple analyst predictions in this space), it's worth noting a few distinct areas within this broader direction: User Experience Management (UEM) has come into its own and cloud has helped it along as an ultimate point of IT governance. Along with application performance insights, UEM may also explore business process and business behavior impacts, as well as shed light on how customers actually use IT services—perhaps the biggest single gap in running IT as a business. Executive Dashboards will thrive atop these advancing trends, and some will also have roots in data warehousing. Application Discovery and Dependency Mapping and the modeling it can deliver in connection with Melds: Capacity planning, performance, and business impact are all beginning to intersect in analytic "melds" across domains with both real-time and historical/trending values. Network: Applications and services all come together over the network—and network management will continue to drive forward with "application-aware" solutions with more powerful capabilities for leveraging application flows for performance, capacity, and even governance and compliance requirements. Along with this, EMA predicts the rise of next generation network management platforms, optimized to support virtualized infrastructures, more rapid deployment, and the consolidation of roles that EMA has documented with the advent of cloud computing. Predictive Analytics in Support of Automation: While automation deserves its own heading, the relation between predictive analytics and automation technologies —from Workload Automation (WLA) to IT process automation (or run book)—will continue to transform the automation landscape. Another, and not unrelated transformative factor will continue to be service modeling from the CMDB/CMS as modeled interdependencies and the policies around them will begin to advance in defining automation routines and associating them with larger processes
Big data storage is related in that it also aims to address the vast amounts of unstructured data fueling data growth at the enterprise level. But the technologies underpinning Big data storage, such as scale-out NAS and object-based storage, have existed for a number of years and are relatively well understood.At a very simplistic level, Big data storage is nothing more than storage that handles a lot of data for applications that generate huge volumes of unstructured data. This includes high-definition video streaming, oil and gas exploration, genomics -- the usual suspects. A marketing executive at a large storage vendor that has yet to make a statement and product introduction told me his company was considering “Huge Data” as a moniker for its Big data storage entry.
Scale horizontally (scale out)To scale horizontally (or scale out) means to add more nodes to a system, such as adding a new computer to a distributed software application. An example might be scaling out from one Web server system to three.As computer prices drop and performance continues to increase, low cost "commodity" systems can be used for high performance computing applications such as seismic analysis and biotechnology workloads that could in the past only be handled by supercomputers. Hundreds of small computers may be configured in a cluster to obtain aggregate computing power that often exceeds that of single traditional RISC processor based scientific computers. This model has further been fueled by the availability of high performance interconnects such as Myrinet and InfiniBand technologies. It has also led to demand for features such as remote maintenance and batch processing management previously not available for "commodity" systems.The scale-out model has created an increased demand for shared data storage with very high I/O performance, especially where processing of large amounts of data is required, such as in seismic analysis. This has fueled the development of new storage technologies such as object storage devices.Scale out solutions for database servers generally seek to move toward a shared nothing architecture going down the path blazed by Google of sharding.
Next Generation Data Warehousing The three leading, until recently independent Next Generation Data Warehouse vendors – Vertica, Greenplum, and Aster Data – are upending the traditional enterprise data warehouse market with massively parallel, columnar analytic databases that deliver lightening fast data loading and near real-time query capabilities. The latest iteration of the Vertica Analytic Platform, Vertica 5.0, for example, includes new elasticity capabilities to easily expand or contract deployments and a slew of new in-database analytic functions. Aster Data has pioneered a novel SQL-MapReduce framework, combining the best of both data processing approaches, while Greenplum’s unique collaborative analytic platform, Chorus, provides a social environment for Data Scientists to experiment with Big data. All three vendors experienced significant revenue growth over the last two-to-three years, with Vertica leading the way with an estimated $84 million in revenue in 2011, followed by Aster Data with $52 million, and Greenplum with $40 million.
HP Converged Infrastructure uses a common modular architecture resulting in a simpler, more automated, and integrated infrastructure that truly accelerates the business. Other solutions in the market loosely integrate systems and solutions. This results in continued silos, wasted resources, and puts you at a competitive disadvantage.Gen 8 sources: Press release and brochureCI sources: http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA3-3333ENWWe have more than 180,000 channel partners worldwide, including major and emerging software and hardware vendors and system integrators. Through our AllianceONE program, we work closely with these partners to deliver integrated solutions based on open standards. Each offering is tightly integrated and pre-tested, and brings together all the key hardware, software, and services components.HP provides infrastructure roadmap, infrastructure design & build as well as advisory consulting
Determining the proper workload sizing and Hadoop configuration requires experienceApplying generic workload sizing ignores your anticipated workload growth requirementsMost guidelines ignore platform features and scalability leversHow do you balance and scale the resources as you grow and evolve?What are the planning requirements to host the cluster in your data center?What may appear affordable at 10s nodes may not at 100s of nodesBest Practice: Collaborate with your chosen vendor to properly size and configure based upon your anticipated needsLeverage vendor reference architectures, appliances and experienceUtilize vendor guidelines, your anticipated needs and current projections, and Pilot application lessons-learnedSize and configure with scalability, cost-performance and evolution needs in mind
1 - Hadoop is a framework, not a solution – For many reasons, people have an expectation that Hadoop answers Big data analytics questions right out of the box. For simple queries, this works. For harder analytics problems, Hadoop quickly falls flat and requires you to directly develop Map/Reduce code directly. For that reason, Hadoop is more like J2EE programming environment than a business analytics solution.2 - Hive and Pig are good, but do not overcome architectural limitations – Both Hive and Pig are very well thought-out tools that enable the lay engineer to quickly being productive with Hadoop. After all, Hive and Pig are two tools that are used to translate analytics queries in common SQL or text into Java Map/Reduce jobs that can be deployed in a Hadoop environment. However, there are limitations in the Map/Reduce framework of Hadoop that prohibit efficient operation, especially when you require inter-node communications (as is the case with sorts and joins).3 - Deployment is easy, fast and free, but very costly to maintain and develop – Hadoop is very popular because within an hour, an engineer can download, install, and issue a simple query. It’s also an open source project, so there are no software costs, which makes it a very attractive alternative to Oracle and Teradata. The true costs of Hadoop become obvious when you enter maintenance and development phase. Since Hadoop is mostly a development framework, Hadoop-proficient engineers are required to develop an application as well as optimize it to execute efficiently in a Hadoop cluster. Again, it’s possible but very hard to do.4 - Great for data pipelining and summarization, horrible for AdHoc Analysis – Hadoop is great at analyzing large amounts of data and summarizing or “data pipelining” to transform the raw data into something more useful for another application (like search or text mining) – that’s what’s it’s built for. However, if you don’t know the analytics question you want to ask or if you want to explore the data for patterns, Hadoop becomes unmanageable very quickly. Hadoop is very flexible at answering many types of questions, as long as you spend the cycles to program and execute MapReduce code.5 - Performance is great, except when it’s not – By all measures, if you wanted speed and you are required to analyze large quantities of data, Hadoop allows you to parallelize your computation to thousands of nodes. The potential is definitely there. But not all analytics jobs can easily be parallelized, especially when user interaction drives the analytics. So, unless the Hadoop application is designed and optimized for the question that you want to ask, performance can quickly become very slow – as each map/reduce job has to wait until the previous jobs are completed. Hadoop is always as slow as the slowest compute MapReduce job.
Characteristic MapReduce Data volumes • Can handle petabytes (or possibly scale up to greater orders of magnitude)Performance and scalability Automatic parallelization allows linear scaling, even with greater numbers of nodes Communication (phase switch from Map to Reduce) is potential performance bottleneck When application is not collocated with the data, the channel for loading data into the application becomes a potential bottleneck Incrementally adding nodes is easy Data integration • Supports structured, unstructured, and streaming data • Potentially high communication cost at transition between Map and Reduce phasesFault Tolerance • Map reduce model is designed to withstand failure without restarting the process with exception of name node. • Map reduce often involves larger cluster of 50 or moreCharacteristic In-Database analyticsData volumes • Can handle terabytes and can scale to petabytesPerformance and scalability • Designed for rapid access for analytic purposes (queries, reports, OLAP) • Shared-nothing approach provides eminent scalability • Direct operation on compressed columnar data improves performance • Compression decreases amount of data to be paged in and out of memory, and consequently, disk I/OData integration • Supports structured data • Supports real time analytics • Less amenable to integration with unstructured dataFault Tolerance • Generally assume infrequent failures. Small and medium size clusters are less likely to experience failures
The Big data market is on the verge of a rapid growth spurt that will see it top the $50 billion mark worldwide within the next five years. As of early 2012, the Big data market stands at just over $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of Big data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make Big data a practical reality, will result in a super-charged CAGR of 58% between now and 2017. As explained in our Big data Manifesto, Big data is the new definitive source of competitive advantage across all industries. For those organizations that understand and embrace the new reality of Big data, the possibilities for new innovation, improved agility, and increased profitability are nearly endlessCheck this web site http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues
Time Series Analysis – For example, in the financial services industry, quantitative analysts can develop MapReduce applications that use the time-series data in the analytical DBMS to look for profitable tradingpatterns. Continuous Aggregation – The aggregations resulting from the MapReduce application managed within a high performance database for analysis or even operational purposes. This enables analysts to drill down at different levels of aggregation.ETL – It is often said that the bulk of the work of instituting a data warehouse involves data extraction, integration, and consolidation. A large part of that effort involves extraction, transformation, and loading (ETL) of data into the warehouse.Real-time embedded analytics – From enhancing operational activities to complex event processing, combining the results of analytics with continuous applications can add value to the bottom line.Large-scale graph and network analysis – Social network environments demonstrate the utility of managing connectivity.Data volumes –the analysis platform must be able to absorb and handle larger volumes.Performance – Scale-out infrastructure in proportion to the computational, network bandwidth, and storage resources.Data integration – combining of both structured and unstructured dataFault tolerance – It is desirable to enable recovery from a failure without having to restart the entire process Heterogeneity –resource allocation and usage by scaling the using homogenous or heterogeneous systems.Knowledge delivery –support the computational needs to deliver and present the actionable results.Latency – The time from when data is recorded to when questions are answered is a critical .
Machine data or “data exhaust” analysis is one of the fastest growing segments of “big data”–generated by websites, applications, servers, networks, mobile devices and other sources. The goal is to aggregate, parse and visualize this data – log files, scripts, messages, alerts, changes, IT configurations, tickets, user profiles etc – to spot trends and act.By monitoring and analyzing data from customer clickstreams, transactions, log files to network activity and call records–and more, there is new breed of startups that are racing to convert “invisible” machine data into useful performance insights. The label for this type of analytics – operational or application performance intelligence.Web log file analysis (who is visiting my website?)Sentiment analysis (what are customers saying about me?)Recommendation engines (what are my customers/visitors likely to buy?)Ad targeting (which ads will appeal to a specific viewer?)Risk modeling (what is the default risk of my credit card holders?)Customer churn analysis (why are my customers leaving?)Web crawling (traditional web search)Predictive analytics (what predictions can I make based on my data?)Ad infinitum…
What is the Big data Workshop?HP Big data Strategy Workshop—HP offers guidance right from the start. We work with you to address all your big data challenges: volume, variety, velocity, and value of data In this 3-day workshop, we work with you to discover your current data sources, including business and technical requirementsWe help you architect your business intelligence (BI) platform beginning with guiding sound decision making around new technologyAs part of the HP Big data Strategy Workshop, our subject-matter experts take a holistic approach with key stakeholders involved in your BI and storage infrastructure initiative. During this three day workshop, we can help you understand big data benefits and challenges—and how to address your challenges with available technologies and solutionsWhat problems does it solve?Sorting out how to harness Big data—as a rich repository of information and comes with variety, velocity and volume challenges. Traditional tools won’t mine that information, leaving customers poor in information but awash in data.Guidance on addressing the problem of organizing and protecting data assets by efficiently storing huge amounts of data while also making that data secure and accessible.Grappling with understanding the impact of rapid growth in structured and unstructured data, and the evolution of big data analytics projects impacting other storage areas, such as data management, backup and recovery, data security and compliance What are the benefits?Understand the big data landscape and its challenges, benefits and critical success factorsDefine or refine your big data strategy to include your unique requirementsDiscover and uncover the hidden potential of unstructured dataSet your overall big data strategy to create a roadmap of recommendations and initiativesIntegrate structured and unstructured data in enterprise search systems data collectionsFocus on how and when certain elements of Hadoop can be used to process data volumesImprove your ability to make intelligent decisions through advanced exploratory analyticsLeverage use cases to determine when and how big data needs to be protected, archived, and secured
Transcript of "KB Ramesh - TB2957 - Real-time, big data analytics "
Products in this solution: IDOL + Vertica + HadoopThe ideal platform for social graphing and analytics Executive Dashboard OEM Explore Mobile Semi Human Structured Structured Extreme Social Connectors