Pentaho - Jake Cornelius - Hadoop World 2010


Published on

Putting Analytics in Big Data Analytics

Jake Cornelius
Director of Product Management, Pentaho Corporation

Learn more @

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • In a traditional BI system where we have not been able to store all of the raw data, we have solved the problem by being selective.
    Firstly we selected the attributes of the data that we know we have questions about. Then we cleansed it and aggregated it to transaction levels or higher, and packaged it up in a form that is easy to consume. Then we put it into an expensive system that we could not scale, whether technically or financially. The rest of the data was thrown away or archived on tape, which for the purposes of analysis, is the same as throwing it away.
    The problem is we don’t know what is in the data that we are throwing away or archiving. We can only answer the questions that we could predict ahead of time.
  • When we look at the Big Data architecture we described before we recall that
    * We want to store all of the data, so we can answer both known and unknown questions
    * We want to satisfy our standard reporting and analysis requirements
    * We want to satisfying ad-hoc needs by providing the ability to dip into the lake at any time to extract data
    * We want to balance balance performance and cost as we scale
    We need the ability to take the data in the Data Lake and easily convert it into data suitable for a data mart, data warehouse or ad-hoc data set - without requiring custom Java code
  • Fortunately we have an embeddable data integration engine, written in Java
    We have taken our Data Integration engine, PDI and integrated with Hadoop in a number of different areas:
    * We have the ability to move files between Hadoop and external locations
    * We have the ability to read and write to HDFS files during data transformations
    * We have the ability to execute data transformations within the MapReduce engine
    * We have the ability to extract information from Hadoop and load it into external data bases and applications
    * And we have the ability to orchestrate all of this so you can integrate Hadoop into the rest of your data architecture with scheduling, monitoring, logging etc
  • Put in to diagram form so we can indicate the different layers in the architecture and also show the scale of the data we get this Big Data pyramid.
    * At the bottom of the pyramid we have Hadoop, containing our complete set of data.
    * Higher up we have our data mart layer. This layer has less data in it, but has better performance.
    * At the top we have application-level data caches.
    * Looking down from the top, from the perspective of our users, they can see the whole pyramid - they have access to the whole structure. The only thing that varies is the query time, depending on what data they want.
    * Here we see that the RDBMS layer lets up optimize access to the data. We can decide how much data we want to stage in this layer. If we add more storage in this layer, we can increase performance of a larger subset of the data lake, but it costs more money.
  • In this demo we will show how easy it is to execute a series of Hadoop and non-Hadoop tasks. We are going to
    Get a weblog file from an FTP server
    Make sure the source file does not exist with the Hadoop file system
    Copy the weblog file into Hadoop
    Read the weblog and process it - add metadata about the URLs, add geocoding, and enrich the operating system and browser attributes
    Write the results of the data transformation to a new, improved, data file
    Load the data into Hive
    Read an aggregated data set from Hadoop
    And write it into a database
    Slice and dice the data with the database
    And execute an ad-hoc query into Hadoop
  • Pentaho - Jake Cornelius - Hadoop World 2010

    1. 1. © 2010, Pentaho. All Rights Reserved. Putting Analytics in Big Data Analytics Jake Cornelius, Dir. Of Product Management Pentaho Corporation October 12, 2010
    2. 2. 010, Pentaho. All Rights Reserved. US and Worldwide: +1 (866) 660-7555 | Slide Traditional BI Tape/Trash Data Mart(s) Data Source ? ? ? ? ? ??
    3. 3. 010, Pentaho. All Rights Reserved. US and Worldwide: +1 (866) 660-7555 | Slide Data Lake(s) Big Data Architecture Data Mart(s) Data Source Data WarehouseAd-Hoc
    4. 4. 010, Pentaho. All Rights Reserved. US and Worldwide: +1 (866) 660-7555 | Slide Pentaho Data Integration Hadoop Pentaho Data Integration Data Marts, Data Warehouse, Analytical Applications Design Deploy Orchestrate Pentaho Data Integration Pentaho Data Integration
    5. 5. 010, Pentaho. All Rights Reserved. US and Worldwide: +1 (866) 660-7555 | Slide Optimize Visualize Load Files / HDFS Hive DM & DW Applications & Systems Web Tier RDBMS Hadoop Reporting / Dashboards / Analysis
    6. 6. 010, Pentaho. All Rights Reserved. US and Worldwide: +1 (866) 660-7555 | Slide Web Tier RDBMS Hadoop Reporting / Dashboards / Analysis HDFS Hive DM
    7. 7. 010, Pentaho. All Rights Reserved. US and Worldwide: +1 (866) 660-7555 | Slide Demo
    8. 8. • Pentaho for Hadoop Download Capability • Includes support for development, production support will follow with GA • Collaborative effort between Pentaho and the Pentaho Community • 60+ beta sites over three month beta cycle • Pentaho contributed code for API integration with HIVE to the open source Apache Foundation • Pentaho and Cloudera Partnership • Combines Pentaho ‘s business intelligence and data integration capabilities with Cloudera’s Distribution for Hadoop (CDH) • Enables business users to take advantage of Hadoop with ability to easily and cost-effectively mine, visualize and analyze their Hadoop data Pentaho for Hadoop Announcements
    9. 9. Pentaho for Hadoop Announcements (cont) • Pentaho and Impetus Technologies Partnership • Incorporates Pentaho Agile BI and Pentaho BI Suite for Hadoop into Impetus Large Data Analytics practice • First major SI to adopt Pentaho for Hadoop • Facilitates large data analytics projects including expert consulting services, best practices support in Hadoop implementations and nCluster including deployment on private and public clouds
    10. 10. 010, Pentaho. All Rights Reserved. US and Worldwide: +1 (866) 660-7555 | Slide Pentaho for Hadoop Resources & Events Resources Download Pentaho for Hadoop webpage - resources, press, events, partnerships and more: Big Data Analytics: 5 part video series with James Dixon, Pentaho CTO Events Hadoop World: NYC - Oct 12, Gold Sponsor, Exhibitor, Richard Daley presenting, ‘Putting Analytics in Big Data Analysis’ London Hadoop User Group - Oct 12, London Agile BI Meets Big Data - Oct 13, New York City
    11. 11. 010, Pentaho. All Rights Reserved. US and Worldwide: +1 (866) 660-7555 | Slide Thank You. Join the conversation. You can find us on: Pentaho Facebook Group @Pentaho Pentaho - Open Source Business Intelligence Group