Big Data and BI Tools - BI Reporting for Bay Area Startups User Group


Published on

This presentation was presented at the July 8th 2014 user group meeting for BI Reporting for Bay Area Start Ups

Content - Creation Infocepts/DWApplications
Presented by: Scott Mitchell - DWApplications

Published in: Data & Analytics, Technology
  • Be the first to comment

Big Data and BI Tools - BI Reporting for Bay Area Startups User Group

  1. 1. BI Reporting SF Bay User Group 08 July 2014 #BI Reporting for Bay Area Start-ups Presented by: Scott Mitchell DWApplications
  2. 2. Presenter – Scott Mitchell Background  Currently based in San Francisco Bay Area  Consultant – Working for DWApplications  Partnering with Infocepts for their off-shore blended staffing capacity  BI and DW Experience  Started working with BI/DW tools in 1997 (17yrs)  Worked on all sides of the fence  Reporting, DBA, ETL, Solution Architect  Significant experience in Agile BI Application Integration  Previous Implementations  Start-ups - ePredix, Telephia/Nielsen Mobile, Quantros, mFoundry/FIS, TradePulse, iQ-ity  Enterprise - Victoria Secrets, eBay, Ross, Safeway, Bank of America, VISA 2 BI Reporting for Bay Area Start-ups
  3. 3. BIG Data #
  4. 4. Big Data Agenda # • BIG Data  Standard BIG Data Reference Architecture  5/7 Vs of BIG Data  Hadoop Ecosystem  Connecting Hadoop  Components of Hadoop Ecosystem • BIG Data Questions  What Hadoop can do Vs What Hadoop can’t do?  When to use Hadoop  When not to use Hadoop  When BIG Data over RDBMS  Can Big Data and traditional RDBMS co-exist?  RDBMS or BIG Data or both?  Real time Analytics using Big Data • BIG Data Platform Comparison • BI Tool Comparison
  5. 5. Standard BIG Data Reference Architecture # reference-architecture/
  6. 6. 5 Vs of BIG Data 6  Volume: This is the aspect that comes to most people’s minds when they think of Big Data. Volumes of data have increased exponentially in recent times. It is not uncommon for businesses to deal with petabytes of data, and typically analysis is performed over the entire data set, not just a sample  Velocity: Big Data is not just about the volume though. Just as important is the rate of change of the data. For a large volume of data which doesn’t change very often, analysis that takes a number of hours or days to complete may be acceptable, but if the dataset is growing by terabytes per day, or the data is changing at a high rate of speed, the processing time of analysis becomes much more important  Variety: Big Data is not always structured data and it is not always easy to put big data into a relational database. Big Data includes data types such as videos, music files, emails, unstructured word documents and social media feeds. Dealing with a variety of structured and unstructured data greatly increases the complexity of both storing and analyzing Big Data
  7. 7. 5 Vs of BIG Data 7  Veracity: When we are dealing with a high volume, velocity and variety of data, it is inevitable that not all of the data is going to be 100% correct – there will be dirty data. The question is, how clean is good enough for the analysis to be performed? Often the data does not need to be perfect, but does need to be close enough to gain relevant insight. Dependent on the application, the veracity, or verification of the data may be essential, or simply “nice to have”  Value : This is the most important aspect of big data. It costs a lot of money to implement IT infrastructure systems to store big data, and businesses are going to require a return on investment. At the end of the day, if you can’t extract value from your data, there is No point in building the capability to store and manage it.
  8. 8. Additional Vs – Part of 7Vs of BIG Data 8 Additionally some experts also add:  Validity: The interpreted data having a sound basis in logic or fact – is a result of the logical inferences from matching data. One of the most common errors being the confusion between correlation and causation. Context of the data becomes very important.  Visibility: The state of being able to see or be seen – is implied. Data from disparate sources need to be stitched together where they are visible to the technology stack making up Big Data. Critical data that is otherwise available, but not visible to the processes of Big Data may be one of the Achilles Heels of the Big Data paradigm. Conversely, unauthorized visibility is a risk.
  9. 9. Hadoop Ecosystem # Components that can directly use YARN Components using MapReduce framework SQL based database tools
  10. 10. 10 BI Tools ETL Tools JDBC/ODBC JDBC/ODBC/Native Databases Connecting Hadoop
  11. 11. Modules of Hadoop Ecosystem # • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data • Hadoop YARN: A framework for job scheduling and cluster resource management • Hadoop Common: The common utilities that support the other Hadoop modules • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets
  12. 12. # • HBase: A scalable, distributed database that supports structured data storage for large tables • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying • Pig: A high-level data-flow language and execution framework for parallel computation. Used for constructing data flows for (ETL) extract, transform, load • ZooKeeper: A high-performance coordination service for distributed applications • Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner Other Components
  13. 13. Gartner’s 12 Dimensions of Big Data – Extreme Information # There are three tiers of Information management in the model with four dimensions in each tier
  14. 14. Quantification # Physical data characteristics with reference to  Complexity  Velocity  Growing by terabytes per day, or the data is changing at a high rate of speed, the processing time of analysis becomes much more important  Variety  Various structures of data from different data sources like unstructured (from websites, sensors, social media, etc.), semi-structured (from xml, web services, etc.) and structured (from transactional systems)  Velocity  Speed of data collection, processing and access in real-time, near real-time, historical and older.  Volume  High volume of data generated during different timeframes  Complexity  Individual data sets with different standards, business domain rules, storage formats for each asset type.
  15. 15. Access Enablement and Control Information control based on nature of the data and information provided by it (like, confidential HR, Finance and Sales data, Customer details, negative tweets, etc.)  Classification  Classification of data in various classes depending on information hidden in it (like sensitive, non-sensitive, private, public, etc.)  Contracts  Governance rules of enterprise data governance framework to allow access to specific data (like agreements on who will share what information and how).  Pervasiveness  Spread and availability of data across various levels of organization depending on the requirement by organization and details of information in the data (like how long does data remain active, how long the Aggregation of data is valid for summary reports, when data refreshes, etc…)  Technology-enablement (specifications for tools and technology)  Controlling empowerment of users to access various functionalities of Tools and technologies to get information from Data (like security roles in MicroStrategy, etc.) #
  16. 16. Qualification and Assurance  Fidelity  Reliability of data source and authenticity of data.  Linked data  Association of data with its context (Affiliation)  Validation of data  Validity of data for its business use case and rules.  Perishability  Longevity, i.e. how long data is relevant to its context and analysis?  Aging of data while retaining its state and originality #
  17. 17. BIG Data Questions #
  18. 18. When to use Hadoop?  Your Data Sets Are Really Big – If data is in GBs, use Excel, a SQL BI tool on Postgres, or some similar combination, but if data is in TBs or Petabytes, Hadoop’s superior scalability will save you a considerable amount of time and money  You Celebrate Data Diversity – It doesn’t matter whether your raw data is structured (like out of an ERP system), semi-structured (like XML and log files), unstructured (like video files) or all three–Hadoop and its forgiving schema will gobble it up  You Have Strong Programming Skills – Hadoop is written in Java, and therefore requires Java programming skills to master. That is changing with new tools in Hadoop ecosystem but right now it largely remains a venue for excellent Java skills  You Are Building an ‘Enterprise Data Hub’ for the Future – If you work for large enterprise, you might sign up for Hadoop even if your data isn’t particularly massive or diverse or fast at this point in time. It might make sense to start experimenting with Hadoop to be ready to take advantage when the elephant really starts sizzling and goes mainstream in a few years.  You Find Yourself Throwing Away Perfectly Good Data –Hadoop can store petabytes of data. If you find that you are throwing away potentially valuable data because its costs too much to archive, you may find that setting up a Hadoop cluster allows you to retain this data, and gives you the time to figure out how to best make use of that data #
  19. 19. When not to use Hadoop?  You Want to Store Sensitive Data – One of the things that Hadoop is not particularly good at today is storing sensitive data. Hadoop today has basic data and use access security. And while these features are improving by the month, the risks of accidentally losing personally identifiable information due to Hadoop’s less-than- stellar security capabilities is probably not worth the risk  You Want to Replace Your Data Warehouse – Still a majority of data pros tell that Hadoop is complementary to a traditional data warehouse, not a replacement for it. The superior economics of Hadoop-based storage make it an excellent place to land raw data and pre-process it before siphoning it over to a traditional data warehouse to run analytic workloads  You want to Delete or Update data frequently – Hive does not support DELETE and UPDATE commands, so if there exists a business need where frequent deletion or updating data is paramount, Hadoop is not the way to go #
  20. 20. When to use BIG Data technologies over RDBMS? When you No longer can achieve the desired results with your RDBMS # • When data is highly unstructured. Ex: Scanner data, social media data, streaming data, videos, documents tweets, photos, etc. • When data is huge in volume and complexity (Greater than 1TB and complex data) • Customers adopt BIG Data for specific roles – especially exploratory data-science sandboxes and unstructured data staging And for some very technical issue oriented reasoning: • Count Distinct Queries: A count distinct query by definition has to process every record, including sorting and counting. And this becomes a difficult problem when the volume of data is huge. Mixing one or more such distinct aggregates with non- distinct aggregates in the same select list, or mixing two or more distinct aggregates causes more performance issues as it leads to spooling and re-reading of intermediate results. • Cursors: A cursor is where you are stepping through a table row by row in a database. If you are doing some analysis using some kind of a case statement using a cursor on each row of the database and if the table is of any significant size, this is a very bad situation. Cursors are good for iterating through small metadata tables. RDBM Systems are not optimized for stepping through large datasets one entry at a time
  21. 21. # • Alter Table: You have a big data warehouse of a customer and you have a table X. This table X is so big and is so important with so many columns that if you want to alter it by adding a column, changing a column data type or running any DML operation, it would require a long time to complete. Such operations need to be planned and done very carefully as they lock out the table during this whole operation until the statement completes. In addition if the column that you are adding has a NOT NULL clause it would be very painful as the DBMS has to insert default values into all of the existing rows which may overburden your transaction logs. • Data Merge and Mashup (Structured meeting Unstructured): Most retailers today have both online and in-store presence. Consider a scenario where you have customers' online product search data (search logs) in the retailer’s website for the last 15 days, their past in-store purchase history (RDBMS), their in-store charge card transaction data and their daily commute pattern data that you have from their cellphone provider. If you want to build an analytical model that aims to combine these myriad sources of data to send custom discount offers that are valid in a specific store located along the customer’s daily commute path, then you would need to combine all of these sources of data to achieve this. It’s difficult to deal with unstructured data using an RDBMS, let alone combining unstructured data with structured.
  22. 22. When using Big Data Technologies like Hadoop and Hive, Do we still need standard RDBMS to perform Analytics? No # Hive is essentially a data warehouse infrastructure that provides data summarization and ad hoc querying. It performs the role of a Data Warehouse platform for using the organization’s structured data in the Hadoop Ecosystem. The Hadoop long term vision is that an organization can completely rely on Hadoop ecosystem for Analytics even in absence of RDBMS. However Right Now: - Hadoop is IT heavy and business users need IT hand holding - Lacks highly accessible self-service tools for business users - Hadoop does not have extensive pre-existing adapters for ERP systems - Would require significant investment to re-write advanced ETL feeding DW Do I need a RDBMS or a BIG Data database or Both? Varies from one organization to other As organizations become aware of their data and their needs, they will be in a better position to decide which technology fits their requirement. As covered earlier – structured vs unstructured and the volume and complexity of data are major attributes that can help in deciding.
  23. 23. How close can we get to Real time Analytics using BIG Data technologies(than having to move data through ETL Processes) ? Really Real Time or Streaming Real Time Analytics is possible with BIG Data # Hadoop ecosystem has already got many customer examples where the Real time Analytics is really real time / streaming real time. Learn from this recently concluded Hadoop Summit keynote how a large truck agency tracks various events like starting, stopping, traffic violations like speeding, excessive braking and unsafe tail distance while trucks are on the roads and delivery goods. The system also gives interactive inputs on historical data as well – to see how other routes have performed in violations.
  24. 24. # Can we can replace RDBMS with BIG Data databases some day? Yes and No Why Yes? • BIG Data Eco systems like Hadoop already have components that can handle unstructured as well as the traditional structured data. • RDBMS Is expensive. Even with a Terabyte or two of data. The license fees and hardware needed to run even a 2-3 TB DWH and BI solution will be massive for a RDBMS based system. BIG Data technologies are quickly filling up here – giving away stable ecosystems without hampering performance or budget. Why No? • RDBMS, its been around for ages, is mature and has a lot of helpful tools. And then “Transactional Applications” is still one thing that RDBMS handles best, and we don’t see anything yet from the BIG Data technologies that tackles it as well. • Hadoop’s inventor Doug Cutting feels so. He recently opined Hadoop is "augmenting and not replacing“. He mentions things like doing payroll – the real nuts and bolts things for which people have been using RDBMS will not be a good fit for Hadoop or other BIG Data platforms
  25. 25. # Augment your EDW with Hadoop adding new capabilities/insight - Continue to store summary structured data from your OLTP and back office systems into the EDW. - Store unstructured data into Hadoop that does not fit nicely into “Tables.” This means all the communication with your customers from phone logs, customer feedbacks, GPS locations, photos, tweets, emails, text messages, etc. can be stored in Hadoop. You can store this a lot more cost effectively in Hadoop. - Co-relate data in your EDW with the data in your Hadoop cluster to get better insight about your customers, products, equipment, etc. You can now use this data for analytics that are computation-intensive, such as clustering and targeting. Run ad-hoc analytics and models against your data in Hadoop, while you are still transforming and loading your EDW. - Do not build Hadoop capabilities within your enterprise in a silo. Hadoop and other big data technologies should work in tandem with and extend the value of your existing data warehouse and analytics technologies. - Data warehouse vendors are adding capabilities of Hadoop and MapReduce into their offerings while Hadoop is trying to take on more traditional DW activities
  26. 26. Big Data Tool Comparison
  27. 27. Big Data Technologies Comparison # Features Cassandra HBase Hive MongoDB Description Wide-column store based on ideas of BigTable and DynamoDB Wide-column store based on Apache Hadoop and on concepts of BigTable data warehouse software for querying and managing large distributed datasets, built on Hadoop One of the most popular document stores Developer Apache Software Foundation Apache Software Foundation Apache Software Foundation MongoDB, Inc Initial release 2008 2008 2012 2009 License Open Source Open Source Open Source Open Source Implementation language Java Java Java C++ Server operating systems BSD Linux, Unix, Windows All OS with a Java VM, Linux, OSX, Solaris, Windows Database model Wide column store Wide column store Relational DBMS Document store Data scheme schema-free schema-free Yes schema-free Transaction concepts No No No No
  28. 28. Big Data Technologies Comparison # Name Cassandra HBase Hive MongoDB Typing Yes No Yes Yes Secondary indexes restricted No Yes Yes SQL No No No No APIs and other access methods Proprietary protocol Java API, RESTful HTTP API, Thrift JDBC, ODBC, Thrift proprietary protocol using JSON Partitioning methods Sharding Sharding Sharding Sharding Durability Yes Yes Yes Yes Server-side scripts No Yes Yes JavaScript Triggers Yes Yes No No Replication methods selectable replication factor selectable replication factor selectable replication factor Master-slave replication MapReduce Yes Yes Yes Yes
  29. 29. # Features Cassandra HBase Hive MongoDB Supported programming languages C#, C++, Clojure, Erlang, Go, Haskell, Java, JavaScript , Perl, PHP, Python, Ruby, Scala C, C#, C++, Groovy, Java, PHP, Python, Scala C++, Java, PHP, Python Actionscript , C, C#, C++, Clojure , ColdFusion , D , Dart , Delphi , , Erlang, Go , Groovy , Haskell, Java, JavaScript, Lisp , Lua , MatLab , Perl, PHP, PowerShell , Prolog , Python, R , Ruby, Scala, Smalltalk Consistency concepts Eventual Consistency, Immediate Consistency Immediate Consistency Eventual Consistency Eventual Consistency, Immediate Consistency Foreign keys No No No No Concurrency Yes Yes Yes Yes User concepts Access rights for users can be defined per object Access Control Lists (ACL) Access rights for users, groups and roles Users can be defined with full access or read-only access Big Data Technologies Comparison
  30. 30. BI Tool Comparison
  31. 31. BI Landscape # Vendor Category Vendor Products Megavendors IBM, Microsoft, Oracle, SAP Large Independent Vendors Information Builders, MicroStrategy, SAS Data Discovery Vendors Qlik, Tableau, Tibco Spotfire Open Source Actuate, Jaspersoft, Pentaho SaaS Birst, Small Independent Vendors Bitam, Salient, Panorama, Logi Analytics, Targit, GoodData, arcplan, Infor, Alteryx, Pyramid Analytics, Board International, Prognoz, Yellowfin
  32. 32. Gartner’s 17 Categories # Information Delivery 1. Reporting – Ability to create print-ready and interactive reports 2. Dashboards – Multi-object, linked reports in an intuitive and interactive display. 3. Ad hock report/query – Ability for end-users to create their own reports 4. Microsoft Office Integration – How the tool integrates with Office suite 5. Mobile BI – Ability to deliver to mobile devices using the native features of mobile Analysis 6. Interactive Visualization – Exploring the data that goes beyond pie/bar charts. Includes heat maps, geographic maps, scatter plots, etc. 7. Search-based Data Discovery – Easily search structured and unstructured data sources. 8. Geospatial and Location Intelligence – Ability to show relationships on interactive maps using geographic, spatial and time information. 9. Embedded Advanced Analytics – Leverages statistical function libraries, Predictive Model Markup Language (PMML )and R-based models. 10. OLAP – Fast, multidimensional access and manipulation of the data.
  33. 33. Gartner’s 17 Categories # Integration 11. BI Infrastructure and Administration – Shared security, metadata, administration, object model, query engine and scheduling/distribution. 12. Metadata Management (MDM) – Centralized and robust way to administer/manage dimensions, facts, performance, report layouts, etc. 13. Business User Data Mashup and Modeling – Code-free, drag-and-drop and user driving ability to mix and match different data. 14. Development Tools – Programmatic and visual tools for developing reports, dashboards and analysis. 15. Embeddable Analytics – Includes software development kit (SDK) for truly customizing, porting and embedding analysis both within and outside the platform. 16. Collaboration – Ability to share and discuss. 17. Support for Big Data – Ability to query hybrid, columnar and array-based data sources – MapReduce and NoSQL databases.
  34. 34. BI Platforms Comparison - Gartner Tool Strengths Weakness Actuate • Release of Birt iHub 3 – consistent, streamlined interface with better integration across product line • Expanded big data connectivity and mashup capabilities • Functionality and ease of use rated high • Deterioration of market understanding, user experience and contract experience • Overall product capability score below average • Not highly used for dash boarding, ad hoc analysis and interactive visualization/discovery Jaspersoft • End-to-end BI • First pay-as-you-go BI server on AWS • Low cost of ownership • Capabilities scored below average • Used narrowly in organizations • Below average data volumes • Embeddable analytics and advanced analytics Pentaho • Low cost of ownership • Ranked high for development tools • Investing and launching emerging analytic application capabilities – Big Data Layer, Instaview, Storm ad Splunk • Customer experience, product quality and support below average. • Difficult to use and implement Qlik • Launch of redesigned visualization experience – Natural Analytics (Q3/Q4 2014). • Ease of use for analysis and development • Associative search eliminates some complex SQL • Strong on dashboards, visualizations, mashups, collaboration, mobile and big data support • Not enterprise-ready – lacks MDM, infrastructure and embeddability • Limited compared to other stand-alone data vendors in visual-based interactive exploration and analysis • Major rearchitecting poses risks to current customers – could loose market traction Tableau • Highly intuitive, visual-based data discovery, dash boarding, and data mashup capabilities • High customer satisfaction and experience • Reusability, scalability and embeddability • Wide range of support for data access • Used as a complement, not the standard • Inflexible in negotiations / high maintenance fees • Ability to address governance and broader BI functionality a work in progress 34
  35. 35. BI Platforms Comparison - Gartner Tool Strengths Weakness MicroStrategy • Go-to platform to handle the most complex deployments • Organic integration and superior product quality • Choice where mobile is strategic requirement • Big Data integration • Visual data discovery and multi-TB, in-memory engine (in dev) • Steep initial learning curve (Moblie/VI combating that) • Cost of software • Longest to develop reports (along w/SAP) • Blurred marketing message SAP Business Object • Large deployments and enterprise BI standards – integration key • Heavy investing in visual data discovery/embeddable analytics • Expansion of BI Customer Success initiative • Hard to use and do complex analysis • Software quality/difficult to migrate • High cost and hard sale • Integration concerns/questions on BI commitment IBM Cognos • Handles some of the largest deployments • Watson Analytics (2014) – smart data discovery • Simplified licensing modeling • Unrecognizable differentiation in market • Cost, poor performance, lack of ease-of-use and support quality all customer concerns • Scores low/not reaching business benefits Oracle BI EE • Leader in information management • Integration, pre-built solutions and large scale deployments • Large network of partners • Unavailability of complex types/advanced analytics • Requires sophisticated BI-related competencies • Scores low in quality and late with mobile Tibco • Aims to stay ahead of the curve with aggressive development/acquisition • Quality, functional and ease of use rated high • Used for complex analyses • Large, complex reports take a long time to develop • Dashboards rated average • Administration, development and MDM rated below average • Support staff coverage not always adequate Microsoft • Ubiquitous BI across products - it is already there and being used • Attractive packaging and pricing • Investing heavily in cloud • Excel widely used and accelerated investments in feature releases • Mobile BI, interactive visualization and MDM are product weaknesses • Multiproduct complexity = on-premises or hybrid deployments. • Do-it-yourself approach – onus is on customer 35