Investigative Analytics- What's in a Data Scientists Toolbox


Published on

Mike Ferguson CEO Intelligent Business Strategies talk at Data Science London @ds_ldn

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Investigative Analytics- What's in a Data Scientists Toolbox

  1. 1. 26/04/2012 Investigative Analytics - Whats In The Data Scientists’ Toolkit Mike Ferguson Managing Director Intelligent Business Strategies Data Science London April 2012 About Mike Ferguson Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specializes in business intelligence, data management and enterprise business integration. With over 30 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. He is an expert on the B-EYE-Network. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates. Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700 2Copyright © Intelligent Business Strategies 2012- All Rights Reserved 1
  2. 2. 26/04/2012 Topics Big Data Workloads Data science tools for near real-time analytics Data science tools for investigative analytics of multi-structured data Data science tools for investigative analytics of structured data Trends in a fast moving Big Data marketplace Governance of data science projects 3 The Application Processing Spectrum - Big Data Is Pushing Storage Options Towards Optimized Systems Source: BI-Research Copyright © BI-Research, 2011 4Copyright © Intelligent Business Strategies 2012- All Rights Reserved 2
  3. 3. 26/04/2012 Big Data Processing – The Number of Data Stores Optimized for Operational or Analytical Workloads Is Growing • ACID support missing in some NoSQL DBMSs Analytical RDBMS • Can you live with losing a transaction? • OK for sensor data for example OLTP RDBMS NoSQL DBMS NoSQL 5 Data Science Tools – Different Analytical Workloads Need Different Tools Some tools work across multiple platforms Analytical Analytical Analytical Analytical tools tools tools tools streaming data Data Data Data Data management management management management tools tools tools tools CRM ERP SCM Machine generated, markets data, sensors RDF/OWL 6Copyright © Intelligent Business Strategies 2012- All Rights Reserved 3
  4. 4. 26/04/2012 Data Science Tools – Near Real-Time Analytics On Data In Motion Stream analytics / CEP Workload analytical Near real-time automated characteristics analytics on text or semi- structured data Data characteristics Highly volatile data-in-motion, streaming data large volumes Product Examples IBM InfoSphere Streams, Stream Informatica RulePoint analytics / Trends CEP vendors moving to analyse CEP text as well as structured data Some CEP vendors may get acquired Machine generated, markets data, sensors 7 Trends – Streaming Event Data Can Also Be Stored In Hadoop or DW Appliance Analytical tools streaming data Data management tools Machine generated, markets data, sensors 8Copyright © Intelligent Business Strategies 2012- All Rights Reserved 4
  5. 5. 26/04/2012 Data Science Tools - Investigative Analytics on Multi- Structured Data In Hadoop (various distributions) Workload analytical Investigative analysis characteristics Analytical Data characteristics Up to very large volumes of tools multi-structured data (Variety) Data management E.g. Informatica HParser, tools Pentaho ETL, Pervasive, Talend ETL Studio Analysis Batch analytics: Custom MapReduce apps with Data Mahout and R management BI Tools (MapReduce) tools Karmasphere, Datameer IBM Cognos Content Analytics, BI Tools (Search Based) Connexica, Quid BI Tools (Hive interface) JasperSoft, MicroStrategy, Tableau…. 9 Data Deluge - Data Is Arriving Faster Than We Can Consume It – How Good Is Your Filter? Enterprise F DI A L Enterprise systems TT AE R 10Copyright © Intelligent Business Strategies 2012- All Rights Reserved 5
  6. 6. 26/04/2012 Data Management Tools Are Being Extended To Embrace And Exploit Massively Parallel Hadoop Clusters Approaches: • Custom code • Data Management tools suites: e.g. IBM InfoSphere Datastage and Smart Consolidation (uses InfoSphere Blueprint Director), Informatica, Pervasive, Petaho, Talend Extract Data from Hadoop Invoke Custom Analytics on Hadoop Transform & Cleanse Data in Hadoop (MapReduce) Parse & Prepare Data in Hadoop (MapReduce) Data management Discover data in Hadoop tools Load Data into Hadoop Trends: Expect MUCH more from data management tool vendors including generation of MapReduce code to clean and transform data 11 Processing Text Is A Key Part of Hadoop Based Analysis What Is Text Analytics?– deriving data from unstructured content Popular data sources include • Social media, email, news articles, on-line forums Requires pre-processing prior to analysis • Parsing, correction, phase extraction, semantic grouping 12Copyright © Intelligent Business Strategies 2012- All Rights Reserved 6
  7. 7. 26/04/2012 Tools Are Appearing To Make It Easier To Parse Data In Hadoop To Make It Easier To Analyse Product Example: Informatica HParser Source: Informatica 13 Big Data Integration - Talend Open Studio for Big Data Enhancing a big data job with Data Quality Several data quality components are included in Source: Talend the open source version 14Copyright © Intelligent Business Strategies 2012- All Rights Reserved 7
  8. 8. 26/04/2012 Accelerating Custom Data Integration and Preparation Pervasive DataRush for Hadoop • Syncsort DMExpress Hadoop Edition: • Call DataRush from MapReduce • Move data in and out of HDFS • MapReduce runs faster • Create jobs using the DMExpress GUI and run them within the • Less code to write Hadoop Of interest to Map Reduce • Shift transformations to the developers DMExpress engine • Invokes high performance compression Hadoop Distributed File System Mapper Mapper Mapper Mapper DataRush DataRush DataRush DataRush DMX DMX DMX DMX Reducer Reducer DataRush DataRush Hadoop Acceleration 15 Leveraging Hadoop For Data Integration On Massive Volumes Of Data To Bring Additional Insights Into A DW Hundreds of Cloud Data e.g. Deriving insight from huge terabytes up volumes of social web content on to petabytes sites like twitter, facebook. Digg, mySpace, tripAdvisor, Linkedin….for sentiment analytics Operational systems Extract D DW Cloud Data Transform I Map/ Reduce apps HDFS relevant e.g. PIG, IBM JAQL insight 16Copyright © Intelligent Business Strategies 2012- All Rights Reserved 8
  9. 9. 26/04/2012 Product Example – Pentaho Enterprise Data Services Suite Support For Hadoop Source: Pentaho 17 In-Hadoop Analytics – Example Technologies Analytical tools Hadoop MapReduce programs with custom analytics Hadoop MapReduce programs with Hadoop Mahout • Several analytical algorithms for use in batch analysis Pervasive DataRush For Hadoop Analytics Engine Radoop (UI on RapidMiner) Data management Revolution Analytics RevoScaleR tools …. 18Copyright © Intelligent Business Strategies 2012- All Rights Reserved 9
  10. 10. 26/04/2012 New Big Data Analytics Technologies Are Emerging On Hadoop – E.g. Radoop Radoop interfaces RapidMiner (open source) with Hadoop and integrates with Hive and Mahout providing a UI for Hadoop based analytics Source: “Radoop – It’s Like Yahoo Pipes for Hadoop” 19 Revolution Analytics RevoScaleR for Distributed Computing Clusters Scaling R for Big Data Analytics • Portions of the data source are made available to each Compute compute node Data Node Partition (RevoScaleR) • RevoScaleR on the master node assigns a task to each Compute Data Node compute node Partition (RevoScaleR) Master • Each compute node Node Compute (RevoScaleR) independently processes its Data Node Partition (RevoScaleR) data, and returns it’s intermediate results back to Compute the master node Data Node Partition (RevoScaleR) • master node aggregates all of the intermediate results from each compute node and produces the final result Source: Revolution Analytics 2020Copyright © Intelligent Business Strategies 2012- All Rights Reserved 10
  11. 11. 26/04/2012 Analysing Hadoop Data – Multiple Options Batch analytics: • Custom MapReduce applications • Analytical Tools generating MapReduce Analytical • Karmasphere, Datameer, • IBM Cognos Content Analytics tools Search Based Tools (built on Lucene) Connexica, Quid BI Tools using Hive QL JasperSoft, MicroStrategy, Tableau Data e.g. Log files Social networks management Clickstream tools Source: Datameer 21 Big Data Analysis - Exploratory Analysis of Multi-Structured Data In Hadoop via Search e.g. IBM BigIndex (part of IBM BigInsights) File Use massively parallel Map Reduce servers to build a partitioned search index index partitions Web sites BI Tools, Applications, email Mashups CMS LOAD index index Index Image partition server Collab tools Useful for analysing un-modelled semi-structured Web content that is not well understood feeds 22Copyright © Intelligent Business Strategies 2012- All Rights Reserved 11
  12. 12. 26/04/2012 Search Based Analytical Tools For Big Data - E.g. Connexica (runs on top of Lucene indexes) Connexica Venn Diagrams Connexica Dashboard 23 Data Warehouse Appliances – Analytical Workloads on Structured Data using ADBMSs and BI Tools Analytical IBM Cognos, IBI WebFocus MicroStrategy, tools Oracle BIEE, SAP BusinessObjects, SAS, Pentaho, Jaspersoft, QlikView, Tableau MPP analytical DBMS, in-database analytics, Columnar and row storage IBM InfoSphere DataStage, Informatica Data PowerCentre, Microsoft SSIS, Oracle Data management Integrator, Pervasive, Pentaho ETL tools Talend ETL CRM ERP SCM Workload analysis characteristics Historical reporting and analysis, investigative analytics Data characteristics Medium and large volumes, structured data 24Copyright © Intelligent Business Strategies 2012- All Rights Reserved 12
  13. 13. 26/04/2012 In-Database Analytics – E.g. SAS Has Completely Re-Written Analytics to Exploit Parallelism E.g. SAS High Performance Analytics and Teradata Runs ‘alongside’ the ADBMS as peers in the same MPP nodes • In-memory passing of data between DBMS and analytic models within every node without data movement • Highly parallel, in-memory execution of analytics delivered across a distributed computing environment – Linear regression and variable selection with classical and modern methods – Nonlinear regression and maximum likelihood – Correlation analysis In-Database vs. Alongside-DBMS – Logistic regression – Neural nets – Linear mixed models – Optimization GA Q4 2011 25 Trends in Data Science Tooling – Tools Are Broadening Their Reach Analytical Analytical tools tools streaming data Data Data management tools management tools CRM ERP SCM Machine generated, markets data, RDF/OWL sensors 26Copyright © Intelligent Business Strategies 2012- All Rights Reserved 13
  14. 14. 26/04/2012 Microsoft Big Data Solution – SQL Server 2012 Hive ODBC Driver & Hive Add-in For Excel and PowerPivot Source: Microsoft 27 Front End Tools Interfacing With Hadoop And Analytical RDBMS e.g. Karmasphere Datameer, IBM Cognos Content Analytics e.g.Connexica, Quid BI tools platform & Map Reduce Search based Custom data visualisation tools BI tools BI tools Map Reduce applications SAP BO, SQL IBM Cognos, Oracle BIEE, Indexes MicroStrategy, JasperSoft, MPP RDBMS Pentaho, MS Excel Polymorphic table function 28Copyright © Intelligent Business Strategies 2012- All Rights Reserved 14
  15. 15. 26/04/2012 Tools To Govern Data Science Projects – Data Sources, Sandboxes, People, Results governance governance governance Sandbox MPP Analytical RDBMS Graph DBMS DW governance governance Social graph data Unstructured / semi-structured content clickstream Files RDBMS Web logs governance 29 Governance: Big Data Projects Need To Be Managed – E.g. EMC GreenPlum Chorus Workspaces, sandboxes, people and data sources can all be governed Source: EMC GreenPlum 30Copyright © Intelligent Business Strategies 2012- All Rights Reserved 15
  16. 16. 26/04/2012 Architectures – Integrating Big Data Analytics Into The Enterprise users Business analysts BI tools platform & Map Reduce Search based data visualisation tools BI tools BI tools developers actions SQL Custom real-time Indexes MR apps Stream processing MPP RDBMS Graph DBMS Polymorphic table function(s) Event Social streams graph data OLTP data Unstructured / semi-structured content Information Management and Services XML, clickstream JSON Cloud Data Files web services RDBMS Cubes Web logs office web content docs 31 Thank You! Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700 32Copyright © Intelligent Business Strategies 2012- All Rights Reserved 16