Big Data Analytics: Beyond Beer and Diapers 2012/2/22 Kai Zhao @Teradata email@example.com by Kai Zhao 2011.12Disclaimer:Any views or opinions presented in this article are solely those of the author and do NOT necessarily represent those of Teradata or other companies .
ContentBackground: Traditional Business Intelligent(BI) What is Big Data What is Big Data Analytics Big Data Analytics: State of the ArtBig Data Analytics Technology Stack ETL/ELT/ETLT(Demo) MPP Data Warehouse Map Reduce NoSQL Web Service Data Analytics Data Visualization BI Tools(Demo)Big Data Analytics Platform Architecture
云计算风起云涌，商业智能方兴未艾，大数据分析势在必行。Cloud Computing storming, BI revolution and It is time for BIG DATA.Shared-nothing Massively Parallel Processing(MPP)Petabyte ScalingIn-database Analytics
What is Big DataVolume: The increase in data volumes withinenterprise systems is caused by transaction volumesand other traditional data types, as well as by newtypes of data. Too much volume is a storage issue,but too much data is also a massive analysis issue.Variety: IT leaders have always had an issuetranslating large volumes of transactionalinformation into decisions — now there are moretypes of information to analyze — mainly comingfrom social media and mobile (context-aware).Variety includes tabular data (databases),hierarchical data, documents, e-mail, metering data,video, still images, audio, stock ticker data, financialtransactions and more.Velocity: This involves streams of data, structuredrecord creation, and availability for access anddelivery. Velocity means both how fast data is beingproduced and how fast the data must be processedto meet demand.
What is Big Data (cont.)Broadly speaking, Big Data is generated by a number of sources, including:Social Networking and Media: There are currently over 700 million Facebook users, 250 million Twitter users and 156million public blogs. Each Facebook update, Tweet, blog post and comment creates multiple new data points, bothstructured, semi-structured and unstructured, sometimes called Data Exhaust.Mobile Devices: There are over 5 billion mobile phones in use worldwide. Each call, text and instant message islogged as data. Mobile devices, particularly smart phones and tablets, also make it easier to use social media and useother data-generating applications. Mobile devices also collect and transmit location data.Internet Transactions: Billions of online purchases, stock trades and other transactions happen every day, includingcountless automated transactions. Each creates a number of data points collected by retailers, banks, credit cards,credit agencies and others.Networked Devices and Sensors: Electronic devices of all sorts – including servers and other IT hardware, smartenergy meters and temperature sensors -- all create semi-structured log data that record every action.
What is Big Data AnalyticsSee Video Big Data Visualization
Big Data Analytics: State of the ArtAcquisitions and InvestmentsBig Data Vendors and Their ProductionsForrester ReportGartner Report
Acquisitions and Investments Acquirer Acquiree(Est. date) Date of Acq. Deal Summary Teradata AsterData - 2005 2011.3.3 $0.263 billion Traditional Data HP Vertica – 2005 2011.2.14 $1.2 billion Warehouse Vendors needs Big Data IBM Netezza – 2000 2010.11.11 $1.7 billion Analytics technology. EMC Greenplum – 2003 2010.7.6 $0.1~0.15 billion SAP Sybase 2010.5.12 $0.58 billionInvestee InvestmentCloudera $76 millionMapR $29 millionHortonworks $50 millionDatameer $10 millionSummary New Big Data Analytics Startups Source: http://www.leiphone.com/why-2012-the-year-of-hadoop.html
Big Data Vendors and Their Productions Source: http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond
Big Data Analytics Technology StackData Import Data Storage Data Computing Data Analytics XXX as a Service
ETL/ELT/ETLTExtract – The process by which data is extracted from the data sourceTransform – The transformation of the source data into a format relevant to the solutionLoad – The loading of data into the warehouseThis approach to data warehouse development is the traditional and widely accepted approach.The following diagram illustrates each of the individual stages in the process.
ETLThis approach to data warehouse development is the traditional and widely accepted approach.The following diagram illustrates each of the individual stages in the process. Source: Robert J Davenport ETL vs ELT A Subjective View
ETLStrengths Development Time Designing from the output backwards ensures that only data relevant to the solution is extracted and processed, potentially reducing development, extract, and processing overhead; and therefore time. Targeted data Due to the targeted nature of the load process, the warehouse contains only data relevant to the presentation. Administration Overhead Reduced warehouse content simplifies the security regime implemented and hence the administration overhead. Tools Availability The prolific number of tools available that implement ETL provides flexibility of approach and the opportunity to identify a most appropriate tool. The proliferation of tools has lead to a competitive functionality war, which often results in loss of maintainability.Weaknesses Flexibility Targeting only relevant data for output means that any future requirements, that may need data that was not included in the original design, will need to be added to the ETL routines. Due to nature of tight dependency between the routines developed, this often leads to a need for fundamental re-design and development. As a result this increases the time and costs involved. Hardware Most third party tools utilize their own engine to implement the ETL process. Regardless of the size of the solution this can necessitate the investment in additional hardware to implement the tool’s ETL engine. Skills Investment The use of third party tools to implement ETL processes compels the learning of new scripting languages. Learning Curve Implementing a third party tool that uses foreign processes and languages results in the learning curve that is implicit in all technologies new to an organization and can often lead to following blind alleys in their use due to lack of experience.
ELTWhilst this approach to the implementation of a warehouse appears on the surface to besimilar to ETL, it differs in a number of significant ways.The following diagram illustrates the process.
ELTStrengthsProject ManagementBeing able to split the warehouse process into specific and isolated tasks, enables a project to be designed on a smallertask basis, therefore the project can be broken down into manageable chunks.Flexible & Future ProofIn general, in an ELT implementation all data from the sources are loaded into the warehouse as part of the extract andload process. This, combined with the isolation of the transformation process, means that future requirements can easilybe incorporated into the warehouse structure.Risk minimizationRemoving the close interdependencies between each stage of the warehouse build process enables the developmentprocess to be isolated, and the individual process design can thus also be isolated. This provides an excellent platform forchange, maintenance and management.Utilize Existing HardwareIn implementing ELT as a warehouse build process, the inherent tools provided with the database engine can be used.Alternatively, the vast majority of the third party ELT tools available employ the use of the database engine’s capabilityand hence the ELT process is run on the same hardware as the database engine underpinning the data warehouse, usingthe existing hardware deployed.Utilize Existing Skill setsBy using the functionality provided by the database engine, the existing investments in database skills are re-used todevelop the warehouse.WeaknessesAgainst the NormELT is an emergent approach to data warehouse design and development. Whilst it has proven itself many times overthrough its abundant use in implementations throughout the world, it does require a change in mentality and designapproach against traditional methods. To get the best from an ELT approach requires an open mind.Tools AvailabilityBeing an emergent technology approach, ELT suffers from a limited availability of tools.
Map Reduce: Hadoop Comparing with MPP Data Warehouse. Source: http://www.capgemini.com/technology-blog/2012/01/what-is-hadoop/
Map Reduce: Hadoop Professional Service Enterprise- Database OLTP grade Distribution Hadoop Subscription replacements: Service Teradata Aster/MongoDB Hadoop Cluster Data Integration Management with Hadoop EDW BI
MPP Data Warehouse Comparing MPP Data Warehouse with Hadoop stack. Draw a picture.
Data Visualization: It is VERY IMPORTANT to Attract User Source:打破陈规-数据及信息的可视化 向怡宁
Data Visualization: It is VERY IMPORTANT to Compete Source:打破陈规-数据及信息的可视化 向怡宁
Data Visualization: It is VERY IMPORTANT to User Experience Source:打破陈规-数据及信息的可视化 向怡宁
BI ToolsBI Tools fall into three categories:Query Tools A query tool is software setup for users to ask questions about the data. The user can search for patterns or details.Multidimensional Analysis Tools A multidimensional analysis tool, also called Online Analytical Processing (OLAP), is software that allows the user to view the same data from different aspects. Eg: Business Objects, Hyperio, Cognos, MicroStrategy, Pentaho, Microsoft Analysis Services and Palo OLAP Server etc.Data Mining Tools A data mining tool is software that is automated to search data, seeking out ways that the data correlates to other data. Eg: SPSS Clementine, Weka3, R and Apache Mahout etc.
BI Tools List Source: BI Tool Survey 2012 http://www.businessintelligencetoolbox.com/list-of-business-intelligence-bi-tools/
BI Tools: Gartner Evaluation Business intelligence (BI) platforms enable all types of users – from IT staff to consultants to business users – to build applications that help organizations learn about and understand their business
BI Demo – JasperSoft iReport Demo Session: JasperSoft iReport