Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation


Published on

Contact details:
In a world where the HiPPO’s (Highest Paid Person’s Opinion) is final, how can we use technology to drive the organisation towards data-driven decision making as part of their organizational DNA? R provides a range of functionality in machine learning, but we need to expose its richness in a world where it is made accessible to decision makers. Using Data Storytelling with R, we can imprint data in the culture of the organization by making it easily accessible to everyone, including decision makers. Together, the insights and process of machine learning are combined with data visualisation to help organisations derive value and insights from big and little data.

Published in: Business

Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation

  1. 1. Big Data, Business Intelligence and Data Visualisation Contact Details: Jen Stirrup @Jenstirrup
  2. 2. Who Am I? • Postgraduate degrees in Artificial Intelligence and Cognitive Science • But you don’t need any of these to do Data Visualisation
  3. 3. Credit: Mico Yuk
  4. 4. Digital Pragmatism is about collecting, sharing, quality-checking, streamlining, improving, visualizing data.
  6. 6. $97B spend on Business Intelligence by 2017 (Forrester Research) • Average adoption rate…. 21%
  7. 7. Genius depends upon the data within its reach.Ernest Dimnet
  8. 8. You have to start with the truth. The truth is the only way that we can get anywhere. Because any decision- making that is based upon lies or ignorance can't lead to a good conclusion. Julian Assange, Wikileaks
  9. 9. You have to start with the truth. The truth is the only way that we can get anywhere. Because any decision- making that is based upon lies or ignorance can't lead to a good conclusion. Julian Assange, Wikileaks
  10. 10. Pie Charts
  11. 11. Pie Charts
  12. 12. Key Trends
  13. 13. Decision Making with Data
  14. 14. Internet of things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates WEB 2.0Mobile Advertisin g Collaboratio n eCommerce Digital Marketing Search Marketing Web Logs Recommendation s ERP / CRM Sales Pipeline Payables Payroll Inventor y Contacts Deal Tracking Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety - variability Volume 1980 190,000$ 2010 0.07$ 1990 9,000$ 2000 15$ Storage/GB ERP / CRM WEB 2.0 Internet of things What Is Big Data?
  15. 15. DIGITAL ANALOG 1985 1990 1995 2000 2005 2010 2015 2020 The world’s data Credit: 17:15-19:04 of Joseph Sirosh’s PASS Keynote: euLaQ4&
  16. 16. The world’s data DIGITAL ANALOG 1985 1990 1995 2000 2005 2010 2015 2020 ANALOG DATACENTERS (CLOUD) PC / DEVICE DIGITAL TAPE DVD / BLU-RAY CD Credit: 17:15-19:04 of Joseph Sirosh’s PASS Keynote: euLaQ4&
  17. 17. Connected data CONNECTED DIGITAL ANALOG 1985 1990 1995 2000 2005 2010 2015 2020 DATACENTERS (CLOUD) PC / DEVICE DIGITAL TAPE DVD / BLU-RAY CD Credit: 17:15-19:04 of Joseph Sirosh’s PASS Keynote: euLaQ4&
  18. 18. Connected data CONNECTED DIGITAL ANALOG 1985 1990 1995 2000 2005 2010 2015 2020 CLOUD / IoT PC / MOBILE Credit: 17:15-19:04 of Joseph Sirosh’s PASS Keynote: euLaQ4&
  19. 19. Connected data CONNECTED DIGITAL ANALOG 1985 1990 1995 2000 2005 2010 2015 2020 CLOUD / IoT MOBILE
  20. 20. Embracing data transforms business It is central to outperforming competitors Agriculture EducationManufacturing Aerospace FinancialAutomotive GovernmentRetailHealthcare Credit: us/making_the_right_analytics_investments_whitepaper.pdf
  21. 21. { } Relational Cloud • Disparate systems and processes • Multiple tools and skillsets • Siloed insights on disconnected data • High cost of ownership Challenges of the modern data platform Inefficiencies from fragmented architecture Beyond relational On-premises Credit: us/making_the_right_analytics_investments_w hitepaper.pdf
  22. 22. Azure SQL DB Azure SQL DW Analytics Platform System Azure Data Lake SQL Server 2016 Analytics Platform System SQL Relational Beyond relational On-premisesCloud Data Management Power BI Cortana Analytics Azure IoT Business Analytics Business Analytics & Data Management Platform Credit: ents/en- us/making_the_right_analytics_investm ents_whitepaper.pdf
  23. 23. 25
  24. 24. So what IS Big Data, then?
  25. 25. Hadoop vs RDBMs • Unstructured / Semi structured • Structured • Works together with RDBMs
  26. 26. Hadoop vs RDBMs Apache Hadoop isn’t a substitute for a database • It is not Relational • Key Value pairs • Big Data
  27. 27. How can we make Big Data ‘Human Scale’ and comprehensible?
  28. 28. Microsoft Power 1 Billion Office Users Analyze Visualize Share Find Q&A MobileDiscover Scalable | Manageable | Trusted
  29. 29. “Every American should have above average income, and my Administration is going to see they get it.” (Bill Clinton on campaign trail)
  30. 30. The ‘Golden Record’ problem
  31. 31. Bystander Effect
  32. 32. Effective visualizations help stakeholders use that information for decisionmaking.
  33. 33. In “about five to eight seconds, someone’s going to make the decision of do they devote any more time to looking at what you’ve got in front of them or do they move on to the next thing.” Cole Naussbaumer From:
  34. 34. London Cholera Map – John Snow 1854. London. Cholera strikes. In just 10 days, over 500 people have been killed in one neighborhood. The mysterious cluster of deaths is especially terrifying because no one understands the source. No one besides John Snow, an epidemiologist who realized the water supply was spreading the disease.
  35. 35. 5. London Cholera Map – John Snow He plotted every death on a map with ingenious mapped bar charts (see left) and was able to show that the closer to the Broad Street water pump he plotted, the greater the number of deaths. The information helped convince the public a true sewage system was needed and spurred the city to action.
  36. 36. Gapminder – Hans Rosling The Swedish scientist Hans Rosling had been working with developmental data for over 30 years – but it took a great visualization and a 2007 TED talk for him to share his passion with the world. His original viz (now one of many) shows the relationship between income and life expectancy. The data is simple but Rosling’s visual storytelling has allowed him to spread his passion for this fascinating, overlooked data to millions.
  37. 37. War Mortality – Florence Nightingale
  38. 38. War Mortality – Florence Nightingale 1855. The Crimea. Britain is fighting a battle with both Russia and disease. As a nurse, how do you convince an army to invest in hospitals and healthcare instead of guns and ammunition? Florence Nightingale told her story with data by showing the staggering amount of deaths due to preventable disease (shown in blue/grey). After this viz, sanitation became a major priority for the British Army.
  39. 39. Designing visualizations that communicate clearly doesn’t have to be complicated.
  40. 40. Consider the kind of data story you have. Distribution Part to Whole Correlation Time Series Compare Categories Ranking Image credit: Column Five Media’s Visage Data Visualization 101
  41. 41. What’s next? More data! Data Visualisation User Centred
  42. 42. So, I know what a database is, but what’s Big Data?
  43. 43. Microsoft Hadoop Vision Insights to all users by activating new types of data
  44. 44. RDBMS vs. Hadoop
  45. 45. Why Big Data, now?
  46. 46. 1980s Architecture Database Application
  47. 47. 1990s – database as an integration hub Database ApplicationApplicationApplication
  48. 48. 1990s – Decoupled Services Database Database Database ApplicationApplicationApplication
  49. 49. Key NOSQL Concepts and Architectures
  50. 50. Relational Analytical (OLAP) 6 Data Sources Prior to NoSQL
  51. 51. Tipping Point to NoSQL New Paradigm Large Data Sets Scalability Social Media Structured / Unstructured Data
  52. 52. What is NOSQL • Any database that is not Non-Relational SQL Not ‘No SQL’ But Not Only SQL relational • • •
  53. 53. Where is NOSQL used? Cassandra used on: Digg, Facebook, Twitter, Reddit, Rackspace, Cloudkick, Cisco Hadoop used on: Amazon Web Services, Pentaho, Yahoo!, The New York Times CouchDB used on: CERN, BBC, Interactive Mediums MongoDB used on: Foursquare,, SourceForge, Fotopedia, Joomla Ads Riak used on: Widescript, Western Communications, Ask Sponsored Listings
  54. 54. Data Structured DocumentsTransactional Relational DW XML JSON Semi- structured Unstructured VideoScientific Data Different Types of Data need Different Solutions
  55. 55. Relational OLAP 6 Data Sources Prior to NoSQL Data Sources including NoSQL Key Value Key Value Key-Value Column - Family Graph Document
  56. 56. Relational • Tabular format • SQL concepts • Tables • Joins • Rows / Columns • SQL Language • Rigid Data Modelling
  57. 57. Relational • Built for the business • Dimensions / Facts • Fast reads • Historical Data
  58. 58. Key-Value Stores • Keys are used to access blobs of data • Video • Images.. • A key uniquely identifies each record. • Dictionaries have records that are stored and retrieved using a key. • If it fast because the key uniquely identifies each record. • Data is a single opaque collection Key Value Key Value
  59. 59. Locker Analogy • Keys are used to access blobs of data • Video • Images.. • A key uniquely identifies each record. • Dictionaries have records that are stored and retrieved using a key. • The Value is simply an object.
  60. 60. Graph Store • Data is stored in nodes, which have properties • They are connected by critical relationships
  61. 61. Documents • Data stored in nested​ • hierarchies​ • Logical data remains stored together as a unit​ • Any item in the document can be queried​ • Pros: No object-relational mapping layer, ideal for search​ • Cons: Complex to implement, incompatible with SQL​
  62. 62. Database Availability Online Database Availability Means CAP Theorem (BASE vs ACID) Partitioning and Replication Replication Diagram “Ring” of Consistent Hashing Next …. → Database Integrity
  63. 63. What is Database Availability? ● High Availability: database and application is available in scheduled period, when maintenance period system is temporarily down. ● Continuous Operation: system available all the time with no scheduled outages. ● Continuous Availability: combination of HA & CO, data is always available, and maintenance is done without shutdown the system
  64. 64. CAP Theorem Consistency, Availability and Partition Tolerance. A shared-data system can have at most two of those three.
  65. 65. ACID and BASE ACID Atomicity: All or nothing Consistency: Any transaction should result in valid tables Isolation: separate transactions Durability: Database will survive a system failures.
  66. 66. BASE BASE Basically Available - system seems to work all the time Soft State - it doesn't have to be consistent all the time Eventually Consistent - becomes consistent at some later time
  67. 67. Scalability Vertical scale Improving server RAM, and storage Horizontal scale specification by adding more processor, device. Limited and expensive. Adding more cheap computer as server expansion. Do sharding and partitioning which is hard to implement and expensive using relational databases (RDBMS)
  68. 68. Partitioning Sharing the data between different nodes Each node placed on a ring Advantage : ability to scale incrementally Issues : non-uniform data distribution (data host)
  69. 69. Replication Multiple nodes Multiple datacenters High availability and durability
  70. 70. •NoSQL solutions need to solve real-world business problems •Search •High Availability •Agility
  71. 71. • Big Data is not the same as NoSQL. • NoSQL is more than dealing with big datasets. • NoSQL includes concepts that can be managed by a single processor • However, big data problems are a primary use case for NoSQL.
  72. 72. One or many databases? One Database • Easy to understand • Easy to set up and configure • Easy to administer • Single source • Limited scalability
  73. 73. Linear Scaling Performance Number of Processors
  74. 74. Expressivity Degree of distribution Key-value Expressivity Column Family Row Store JSON XML Column Family Raw Stores Graph- stores In memory cache Scalability Document Stores
  75. 75. Big Data Problems Big Data Read-mostly Documents Full Text Event Log Real Time Batch Graph Read-write Transactions Transactions
  76. 76. Why do databases fail? • Anything that can go wrong, will go wrong – Murphy’s law. • Human error • Network failure • Hardware failure • Security
  77. 77. What can we do to support Hadoop? • Hadoop helps manage and process large datasets • Hadoop provides linear scalability • Hadoop brings computing logic to the data rather than bringing the data to computing logic.
  78. 78. Hadoop Clustering basics •Hadoop uses a cluster for data storage and computation purposes. •It runs and writes distributed applications for huge amounts of data
  79. 79. What is the purpose of Hive? 83 Hive is a data warehousing system for Hadoop To meet the needs of businesses, data scientists, analysts and BI professionals Data, Summarized Fit a structure onto data Data, Analyzed Analysis of Large Datasets stored in Hadoop File Systems SQL-Like language called HiveQL Custom mappers and reduces when HiveQL isn’t enough
  80. 80. Hive History 84
  81. 81. Hive History 85
  82. 82. 86 What can Hive offer you? Hive can help with a range of business problems: • Log Processing • Predictive Modelling • Hypothesis testing • And Business Intelligence
  83. 83. 87 Hive is not a replacement for SQL So don’t throw out your SQL Server instances! • Hive is for processing large data sets that may span hundreds, or even thousands, of machines • Hive as a high overhead for starting a job. It translates queries to MR so it takes time • Hive does not cache data, like SQL Server • Hive performance tuning is mainly Hadoop performance tuning • Similarity of the query engine, but different architectures for different purposes
  84. 84. HiveQL 88 Hive QL is a SQL-like language It outputs naturally occurring groups for further analysis Easy Data Summarization Large Datasets, summarized Fit a structure onto data Analysis of Large Datasets stored in Hadoop file systems SQL-Like language called HiveQL Custom mappers and reduces when HiveQL isn’t enough
  85. 85. HiveQL Queries like SQL Queries? 89 Similarities in Syntax and Features Similar features SELECT FROM WHERE GROUP BY / HAVING Table Aliases Computed Columns
  86. 86. HiveQL Queries like SQL Queries? 90 Similarities in Syntax and Features Similar features Aggregate Functions Nested Select CASE LIKE / RLIKE JOIN ORDER BY / SORT BY
  87. 87. How does Hive work? 91 Hive as a Translation Tool Compiles and executes queries Hive translates the SQL Query to a Map Reduce Job These are chained together Queries are compiled and executed
  88. 88. How does Hive work? 92 Hive as a structuring Tool Creates a schema around the data Tables stored in Directories Hive Tables Rows and columns, like SQL tables Hive Metastore Namespace with a set of tables Holds table definitions Physical Layout Column Types Partition Information
  89. 89. Hive and SQL Data Types Hive SQL Tinyint Tinyint SmallInt Smallint Int Int BigInt BigInt Boolean Bit (setting as NOT NULL) Float Float Double Real BigDecimal Decimal 94
  90. 90. Hive and SQL Data Types HEADING HEADING String Char, varchar, nvarchar, ntext, text, image Binary binary Timestamp Timestamp (note that this is being deprecated). RowVersion 95
  91. 91. Hive Mathematical Operations • Plus • Negative • Addition • Subtraction • Multiplication • Division • Modulus • Primitive Types • Complex Types • Arrays • Maps • Structs • Union 96
  92. 92. Power View Power Map • Highly Visual Design Experience • Power View is an interactive, ad hoc, query and visualization experience. • It is for business question ‘mystery’ solving • Power Map is a new 3D visualization add-in for Excel helping you to analyse geographical and temporal data • Mapping • Exploring • Interacting Different Tools for Different Jobs
  93. 93. Hive and Pig: Similarities 98 Hive and Pig are great at crunching large amounts of data from HDFS to database Both compile to Map Reduce jobs Pig is Procedural, Hive is Declarative Hive is much closer to SQL in terms of querying – this can be a good or a bad thing!
  94. 94. Hive and Pig: Differences 99 Pig Hive Procedural Declarative Fits cleanly into pipeline paradigm; no need for temporary tables Temporary tables are ubiquitous but can be disjointed; may involve clean up. Greater control over dataflow: - Checkpoints - Naturally handles splitting of data streams SQL expects one result and works towards it. Handles trees but not splits Optimizing done by developer Hive optimisation is passed to the Hive Query Optimizer
  95. 95. Hive and Pig: When are they best used? 100 Different Tools with Different Jobs Pig is akin to SSIS Great for dataflows and automated batch jobs Hive is akin to ad-hoc, analytics SQL Queries Results that make sense of the data
  96. 96. Why, Who & How of Power BI More Specialized BI Pros Power Users Decision Makers Business Analysts Information Workers Self-Service • Power Pivot • Power View • Power Query • Power Map Clients • Excel Services • Office Professional
  97. 97. Easy Access to Data, Big and Small
  98. 98. Easy Access to Data, Big and Small
  99. 99. Microsoft Power BI for Office 365 1 in 4 enterprise customers on Office 3651 Billion Office Users Analyze Visualize Share Find Q&A MobileDiscover Scalable | Manageable | Trusted
  100. 100. Power QueryEnable self-service data discovery, query, transformation and mashup experiences for Information Workers, via Excel and PowerPivot Discovery and connectivity to a wide range of data sources, spanning volume as well as variety of data. Highly interactive and intuitive experience for rapidly and iteratively building queries over any data source, any size. Consistency of experience, and parity of query capabilities over all data sources. Joins across different data sources; ability to create custom views over data that can then be shared with team/department.
  101. 101. Power Query Discover, combine, and refine Big Data, small data, and any data with Data Explorer for Excel.
  102. 102. Power Query
  103. 103. S Power Query Data Sources Windows Azure Marketplace Windows Active Directory Azure SQL Database Azure HDInsight
  104. 104. Analyse and Model with Excel 2013
  105. 105. Power View
  106. 106. Powerful Self-Service BI with Excel 2013
  107. 107. Power View – Business Mysteries, Solved Power View is an interactive data exploration, visualization, and presentation experience Highly visual design experience Rich meta-driven interactivity Presentation-ready at all times It delivers intuitive ad-hoc reporting for business users
  108. 108. Introducing Power View It is now also available in Excel 2013, and with new features: • Maps • Pie charts • Hierarchies • KPIs • Drill down/Drill up • Report styles, themes and text resizing • Backgrounds with images • Hyperlinks • Printing
  109. 109. Power View in Excel Excel Database server SQL AS (Tabular) Power View SQL RS ADOMD.NET SQL AS (PowerPivot)
  110. 110. Power View in SharePoint Browser SharePoint web server Database server SharePoint app server SQL AS (PowerPivot) SQL AS (Tabular) SQL RS Add-In SQL RS Power View
  111. 111. Powerful Self-Service BI with Excel 2013
  112. 112. Power Map for Microsoft Excel enables information workers to discover and share new insights from geographical and temporal data through three-dimensional storytelling. What Is Power Map?
  113. 113. Map Data • Data in Excel • Geo-Code • 3D and 3 Visuals Discover Insights • Play over Time • Annotate points • Capture scenes Share Stories • Cinematic Effects • Interactive Tours • Share Workbook Power Map: Steps to 3D insights
  114. 114. Map Data
  115. 115. Power Map Excel Add-in to Enhance Data Visualization Map data, discover insight, and share stories