Big data analytics beyond beer and diapers


Published on

Some basic idea of Big Data Analytics

Published in: Business, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big data analytics beyond beer and diapers

  1. 1. Big Data Analytics: Beyond Beer and Diapers 2012/2/22 Kai Zhao @Teradata by Kai Zhao 2011.12Disclaimer:Any views or opinions presented in this article are solely those of the author and do NOT necessarily represent those of Teradata or other companies .
  2. 2. ContentBackground: Traditional Business Intelligent(BI) What is Big Data What is Big Data Analytics Big Data Analytics: State of the ArtBig Data Analytics Technology Stack ETL/ELT/ETLT(Demo) MPP Data Warehouse Map Reduce NoSQL Web Service Data Analytics Data Visualization BI Tools(Demo)Big Data Analytics Platform Architecture
  3. 3. 云计算风起云涌,商业智能方兴未艾,大数据分析势在必行。Cloud Computing storming, BI revolution and It is time for BIG DATA.Shared-nothing Massively Parallel Processing(MPP)Petabyte ScalingIn-database Analytics
  4. 4. Traditional Business Intelligent(BI)
  5. 5. What is Big DataVolume: The increase in data volumes withinenterprise systems is caused by transaction volumesand other traditional data types, as well as by newtypes of data. Too much volume is a storage issue,but too much data is also a massive analysis issue.Variety: IT leaders have always had an issuetranslating large volumes of transactionalinformation into decisions — now there are moretypes of information to analyze — mainly comingfrom social media and mobile (context-aware).Variety includes tabular data (databases),hierarchical data, documents, e-mail, metering data,video, still images, audio, stock ticker data, financialtransactions and more.Velocity: This involves streams of data, structuredrecord creation, and availability for access anddelivery. Velocity means both how fast data is beingproduced and how fast the data must be processedto meet demand.
  6. 6. What is Big Data (cont.)Broadly speaking, Big Data is generated by a number of sources, including:Social Networking and Media: There are currently over 700 million Facebook users, 250 million Twitter users and 156million public blogs. Each Facebook update, Tweet, blog post and comment creates multiple new data points, bothstructured, semi-structured and unstructured, sometimes called Data Exhaust.Mobile Devices: There are over 5 billion mobile phones in use worldwide. Each call, text and instant message islogged as data. Mobile devices, particularly smart phones and tablets, also make it easier to use social media and useother data-generating applications. Mobile devices also collect and transmit location data.Internet Transactions: Billions of online purchases, stock trades and other transactions happen every day, includingcountless automated transactions. Each creates a number of data points collected by retailers, banks, credit cards,credit agencies and others.Networked Devices and Sensors: Electronic devices of all sorts – including servers and other IT hardware, smartenergy meters and temperature sensors -- all create semi-structured log data that record every action.
  7. 7. What is Big Data AnalyticsSee Video Big Data Visualization
  8. 8. Big Data Analytics: State of the ArtAcquisitions and InvestmentsBig Data Vendors and Their ProductionsForrester ReportGartner Report
  9. 9. Acquisitions and Investments Acquirer Acquiree(Est. date) Date of Acq. Deal Summary Teradata AsterData - 2005 2011.3.3 $0.263 billion Traditional Data HP Vertica – 2005 2011.2.14 $1.2 billion Warehouse Vendors needs Big Data IBM Netezza – 2000 2010.11.11 $1.7 billion Analytics technology. EMC Greenplum – 2003 2010.7.6 $0.1~0.15 billion SAP Sybase 2010.5.12 $0.58 billionInvestee InvestmentCloudera $76 millionMapR $29 millionHortonworks $50 millionDatameer $10 millionSummary New Big Data Analytics Startups Source:
  10. 10. Big Data Vendors and Their Productions Source:,_Business_Analytics_and_Beyond
  11. 11. Forrester Report
  12. 12. Hype Cycle Source: Gartner
  13. 13. Gartner Report: Hype Cycle 2011 Source: Gartner
  14. 14. Big Data Analytics Technology StackData Import Data Storage Data Computing Data Analytics XXX as a Service
  15. 15. ETL/ELT/ETLTExtract – The process by which data is extracted from the data sourceTransform – The transformation of the source data into a format relevant to the solutionLoad – The loading of data into the warehouseThis approach to data warehouse development is the traditional and widely accepted approach.The following diagram illustrates each of the individual stages in the process.
  16. 16. ETLThis approach to data warehouse development is the traditional and widely accepted approach.The following diagram illustrates each of the individual stages in the process. Source: Robert J Davenport ETL vs ELT A Subjective View
  17. 17. ETLStrengths Development Time Designing from the output backwards ensures that only data relevant to the solution is extracted and processed, potentially reducing development, extract, and processing overhead; and therefore time. Targeted data Due to the targeted nature of the load process, the warehouse contains only data relevant to the presentation. Administration Overhead Reduced warehouse content simplifies the security regime implemented and hence the administration overhead. Tools Availability The prolific number of tools available that implement ETL provides flexibility of approach and the opportunity to identify a most appropriate tool. The proliferation of tools has lead to a competitive functionality war, which often results in loss of maintainability.Weaknesses Flexibility Targeting only relevant data for output means that any future requirements, that may need data that was not included in the original design, will need to be added to the ETL routines. Due to nature of tight dependency between the routines developed, this often leads to a need for fundamental re-design and development. As a result this increases the time and costs involved. Hardware Most third party tools utilize their own engine to implement the ETL process. Regardless of the size of the solution this can necessitate the investment in additional hardware to implement the tool’s ETL engine. Skills Investment The use of third party tools to implement ETL processes compels the learning of new scripting languages. Learning Curve Implementing a third party tool that uses foreign processes and languages results in the learning curve that is implicit in all technologies new to an organization and can often lead to following blind alleys in their use due to lack of experience.
  18. 18. ELTWhilst this approach to the implementation of a warehouse appears on the surface to besimilar to ETL, it differs in a number of significant ways.The following diagram illustrates the process.
  19. 19. ELTStrengthsProject ManagementBeing able to split the warehouse process into specific and isolated tasks, enables a project to be designed on a smallertask basis, therefore the project can be broken down into manageable chunks.Flexible & Future ProofIn general, in an ELT implementation all data from the sources are loaded into the warehouse as part of the extract andload process. This, combined with the isolation of the transformation process, means that future requirements can easilybe incorporated into the warehouse structure.Risk minimizationRemoving the close interdependencies between each stage of the warehouse build process enables the developmentprocess to be isolated, and the individual process design can thus also be isolated. This provides an excellent platform forchange, maintenance and management.Utilize Existing HardwareIn implementing ELT as a warehouse build process, the inherent tools provided with the database engine can be used.Alternatively, the vast majority of the third party ELT tools available employ the use of the database engine’s capabilityand hence the ELT process is run on the same hardware as the database engine underpinning the data warehouse, usingthe existing hardware deployed.Utilize Existing Skill setsBy using the functionality provided by the database engine, the existing investments in database skills are re-used todevelop the warehouse.WeaknessesAgainst the NormELT is an emergent approach to data warehouse design and development. Whilst it has proven itself many times overthrough its abundant use in implementations throughout the world, it does require a change in mentality and designapproach against traditional methods. To get the best from an ELT approach requires an open mind.Tools AvailabilityBeing an emergent technology approach, ELT suffers from a limited availability of tools.
  20. 20. ETL Demo - Kettle Demo of Pentaho Kettle.
  21. 21. Map Reduce: Hadoop Comparing with MPP Data Warehouse. Source:
  22. 22. Map Reduce: Hadoop Professional Service Enterprise- Database OLTP grade Distribution Hadoop Subscription replacements: Service Teradata Aster/MongoDB Hadoop Cluster Data Integration Management with Hadoop EDW BI
  23. 23. MPP Data Warehouse Comparing MPP Data Warehouse with Hadoop stack. Draw a picture.
  24. 24. NoSQL
  25. 25. NoSQL/SQL/NewSQL Non-Relational Relational Analytics(OLAP) SQL MPP Teradata IBM Netezza EMC Greenplum HP Vertica Hadoop Teradata Aster VectorWise Operational(OLTP) Oracle IBM DB2 SQL Server NoSQL KeyValue Graph Cloud Service MongoDB Neo4j Amazon Amazon RDS SQL Azure BDB SimpleDB Voldemort Toyko Cabinet Document Columnar CouchDB HBase MySQL PostgreSQL Ingres Sybase EnterpriseDB Cassandra Redis MongoDB Data Grid/Cache Memcached
  26. 26. Web Service There are a lot of Web Services.
  27. 27. Data Analytics A lot of…..
  28. 28. Data Visualization: It is VERY IMPORTANT to Attract User Source:打破陈规-数据及信息的可视化 向怡宁
  29. 29. Data Visualization: It is VERY IMPORTANT to Compete Source:打破陈规-数据及信息的可视化 向怡宁
  30. 30. Data Visualization: It is VERY IMPORTANT to User Experience Source:打破陈规-数据及信息的可视化 向怡宁
  31. 31. BI ToolsBI Tools fall into three categories:Query Tools A query tool is software setup for users to ask questions about the data. The user can search for patterns or details.Multidimensional Analysis Tools A multidimensional analysis tool, also called Online Analytical Processing (OLAP), is software that allows the user to view the same data from different aspects. Eg: Business Objects, Hyperio, Cognos, MicroStrategy, Pentaho, Microsoft Analysis Services and Palo OLAP Server etc.Data Mining Tools A data mining tool is software that is automated to search data, seeking out ways that the data correlates to other data. Eg: SPSS Clementine, Weka3, R and Apache Mahout etc.
  32. 32. BI Tools List Source: BI Tool Survey 2012
  33. 33. BI Tools: Gartner Evaluation Business intelligence (BI) platforms enable all types of users – from IT staff to consultants to business users – to build applications that help organizations learn about and understand their business
  34. 34. BI Demo – JasperSoft iReport Demo Session: JasperSoft iReport
  35. 35. Big Data Analytics Platform Architecture
  36. 36. Any Questions?