Putting Business Intelligence to Work on Hadoop Data Stores
 

Putting Business Intelligence to Work on Hadoop Data Stores

on

  • 2,044 views

An inexpensive way of storing large volumes of data, Hadoop is also scalable and redundant. But getting data out of Hadoop is tough due to a lack of a built-in query language. Also, because users ...

An inexpensive way of storing large volumes of data, Hadoop is also scalable and redundant. But getting data out of Hadoop is tough due to a lack of a built-in query language. Also, because users experience high latency (up to several minutes per query), Hadoop is not appropriate for ad hoc query, reporting, and business analysis with traditional tools.

The first step in overcoming Hadoop's constraints is connecting to HIVE, a data warehouse infrastructure built on top of Hadoop, which provides the relational structure necessary for schedule reporting of large datasets data stored in Hadoop files. HIVE also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data.

But to really unlock the power of Hadoop, you must be able to efficiently extract data stored across multiple (often tens or hundreds) of nodes with a user-friendly ETL (extract, transform and load) tool that will then allow you to move your Hadoop data into a relational data mart or warehouse where you can use BI tools for analysis.

Statistics

Views

Total Views
2,044
Views on SlideShare
2,014
Embed Views
30

Actions

Likes
2
Downloads
70
Comments
0

1 Embed 30

http://www.scoop.it 30

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Putting Business Intelligence to Work on Hadoop Data Stores Putting Business Intelligence to Work on Hadoop Data Stores Presentation Transcript

  • Putting Business Intelligence to Work on Hado Data Stores oop Ian Fyfe, Chief Techno ology Evangelist, Pentaho© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights R Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 1
  • Session Abstract This presentation will cover how to ov vercome Hadoops constraints to get more out of your business data analyssis. An inexpensive way of storing large volumes of da ata, ata Hadoop is also scalable and redundant But redundant. getting data out of Hadoop is tough due to a lack of a built-in query language. Also, because users k experience high latency (up to several minutes pe query), Hadoop is not appropriate for ad hoc er query, reporting, and business analysis with tradiitional tools. The fi t t in Th first step i overcoming H d i Hadoops constraints i connecting t HIVE a d t warehouse t i ts is ti to HIVE, data h infrastructure built on top of Hadoop, which provvides the relational structure necessary for schedule reporting of large datasets data stored in Hadoop files. HIVE also provides a simple query i language called Hive QL which is based on SQL an which enables users familiar with SQL to query nd this data. But to really unlock the power of Hadoop, you mu be able to efficiently extract data stored across ust multiple (often tens or hundreds) of nodes with a user-friendly ETL (extract, transform and load) tool that will then allow you to move y y your Hadoop data into a relational data mart or warehouse op where you can use BI tools for analysis. Attendees will learn, how an IT person without java programming skills can: Integrate with Hadoop and Hive to bring ETL, dat warehousing and BI applications to the tasks of ta analyzing Big Data; Provide key data integration and transformation functionality to Hadoop data; f Manage and control Hadoop jobs using a graphica interface; al Integrating Hadoop data with data from other souurces to drive compelling reporting and analytics for todays massive volumes of data.© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 2
  • THE CASE FOR B DATA BIG© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 3
  • The Case for Big Data Enterprises increasingly face nee to store, process and maintain eds larger and larger volumes of structured and unstructured data Compliance Competitive Advantage Challenges associated with big da ata Cost – storage and processing power r Timeliness of data processing Why Hadoop? Google trends for ‘Hadoop’ Low cost, reliable scale-out architec cture for storing massive amounts of data Parallel, Parallel distributed computing frammework for processing data Proven success in solving Big Data pr roblems at fortune 500 companies like Google, Yahoo!, IBM and GE Vibrant community, exploding i Vib i l di intere strong commercial i est, i l investments© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 4
  • Hadoop for Data Integration and BI Top Use Cases for Hadoop* 1. “mine data for improved busines intelligence” ss 2 “reducing cost of data analysis” 2. reducing analysis 3. “log analysis” Top Challenges with Hadoop* 1. Steep technical learning curve 2. Hiring qualified people 3. Availability of appropriate produ ucts and tools Unfortunately, Hadoop was not designed specifically for ETL and BI use cases: d It’s not a database High latency queries and jobs not ideal for all BI use cases Skill set mismatch for traditional ETL us sers and BI Solution architects *Based on a survey of 100+ Hadoop users conducted by Karmasphere Sept 2010 d Karmasphere, Sept.© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 5
  • ESTABLISHING A AN ARCHITECTURE FFOR BIG DATA© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 6
  • Example Use Cases Today p y Transactional •Fraud detection •Financial services/sto k markets Fi i l i / tock k t Sub-Transactional •Weblogs •Social/online media •Telecoms events© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555 Slide 7
  • Example Use Cases Today p y Non-Transactional •Web pages, blogs etc c •Documents D t •Physical events y •Application events •Machine events In most cases structur or semi-structured red© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555 Slide 8
  • Traditional Business In ntelligence ( ) g (BI) Data Mart(s) Tape/T Trash Data ? ? ? Source ? ? ??© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555 Slide 9
  • Data Lake • Single source • Large volume • Not distilled • T i ll no more th 0 2 Typically than 0-2 lakes per company • Known and unknown questions • Multiple user communities • Don’t fit in traditional RDBMS with a reasonable cost© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 10
  • Data Lake Requiremen q nts • Store all the data • Satisfy routine reporting and analysis • Satisfy ad-hoc query / analysis / reporting • Balance performance and cost© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 11
  • What if... Data Mart(s) Ad-H Hoc Data Warehouse Data L Lake(s) Data Source© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 12
  • Big Data Does Not Replace Data Marts g p It’s not a database High latency sive data-crunching Optimized for mass Big Data databases are immature s Databases are no SQL no-© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 13
  • What Hadoop Really is p y s…. Core Components HDFS a distributed file system allow wing massive storage across a cluster of com mmodity servers MapReduce Framework for distributed com mputation, common use cases include agg gregating, sorting, and filtering BIG data sets Problem is broken up into sma fragments all of work that can be computed or d recomputed in isolation on any node of the y cluster© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 14
  • What Hadoop Really is p y s…. Related Projects Hive – a data warehouse infrastructure on top of Hadoop H Implements a SQL like Query l language, language including a JDBC driver Allows MapReduce developers to plugin p p p g custom mappers and reducers Hbase – the Hadoop data abase – AH HA! A variant of NoSQL databases, problematic for traditional BI Best at storing large amounts of unstructured data© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 15
  • Hadoop and BI? p Distributed processin ng Distributed file syste em Commodity h d re C dit hardwar Platform independen (in theory) nt Scales out beyond te echnology and/or economy of a RDBM MS In many cases it’s the only viable solution© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 16
  • Hadoop and BI? p 90% of new Had doop use cases are transfo ormation of semi/struct tured data* data * of those companies we’ve talke to we ve ed to...© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 17
  • Hadoop and BI? p “The working conditio ons within Hadoop are sho ocking” ocking ETL Developer© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 18
  • Hadoop and BI? p Instead of this...© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 19
  • Hadoop and BI? p You have to do this in Java... public void map( Text key, Text value, OutputCollector output t, Reporter reporter) public void reduce( p Text key, Iterator values, OutputCollector output t, Reporter reporter)© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 20
  • People d t use don don’t Hadoop for BI because they wa to ant to...© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 21
  • ...they do i because they it they ha to ave to...© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 22
  • ... and unfo ortunately it wasn’t d designed for most BI requirements© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 23
  • Why not add to Hadoop d the things it’s missing...© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 24
  • ... until it can do t what we n need it to?© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 25
  • If only w had a we Java, Java emb beddable, beddable data transformmation engine engine...© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 26
  • A Data Integration Eng g g gine for Hadoop p Data Marts, Da Warehouse, ata Analytical App y Applications Data Integr ration Enginee Design Data Integr ration Hadoop Engine E i e Deploy Orchestrate Data Integr ration Engine g e© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 27
  • Visualize Reporting / Dashb boards / Analysis Web Tier DM & & DW RDBMS Optimize Hiv ve Hadoop Files / HDFS Load Applications s & Systems© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 28
  • Reporting / Dashb boards / Analysis Web Tier DM & & DW RDBMS adata Meta Hiv ve Hadoop Files / HDFS Applications s & Systems© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 29
  • Data Mart(s) Ad-H Hoc Data Warehouse Data Lake(s) Data Source© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 30
  • Reporting / Dashb boards / Analysis Web Tier RDBMS Data Hadoop Lake Applications s & Systems© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 31
  • Product Requirements for BI Ag gainst Hadoop Lower technical barriers through grap phical ETL environment for creating and managing Hadoop g MapReduce j b M R d jobs Interactive Analysis Batch Reporting Extreme ETL scalability through deplo oyment and Ad Hoc Query across the Hadoop cluster Data M t D t Marts Easily spin-off high performance data marts for Ag BI interactive analysis gile Hive Hi Easily integrate data from Hadoop with data from h other sources Hadoop Provide end-to-end BI addressing comm BI use P id dt d dd i mon Data Integration Jobs cases with Hadoop including reporting, ad hoc query and interactive analysis Reduce costs through subscription-base pricing, ed reduced dependency on scarce technica al Log DBs and Files other sources resources, and easier maintainability d i i t i bilit© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 32
  • THE ROAD AHEAD© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 33
  • The Road Ahead Other NoSQL Integration Facilitate BI use cases on top of HBase, possibly others like HBase MongoDB, Cassandra Streaming Data Source Su upport In support of near-realtime us cases se Long/always running data proc cessing jobs Contiguous Meta-data Data Lineage and Impact Analy covering the entire big data ysis architecture The End of MapReduce ( as a concept ETL users need to p (… s p understand) Push down optimization of Tra ansformations that generate native MapReduce tasks in Had doop© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 34
  • Hadoop Distro Wars The Apache Software Foundation© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 35
  • Tools That Make Hado Easier oop e.g. Apache Pig Pig is a platform for analyzing large data sets Produces sequences of MapReduce programs Integrate Pig scripts into enterprise data integration workflows e.g. 1 Submit and monitor a 1. series of Pig and MapReduce jobs 2. Process a database bulk load step to ready data for ad-hoc analysis or report bursting© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 36
  • Growth in Adoption of Other o NoSQL Big Data Platf forms Hbase – the Hadoop database mongoDB – scalable high performance document oriented database scalable, high-performance, document-oriented LexisNexis HPCC – a data intensive computing system platform Many others© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 37
  • Summary Hadoop and other Big Data NoSQL platforms N Great at storing and processin large diverse data volumes ng Not designed for Business Inte elligence Choosing the right BI technoology can unlock your Big Data to drive actionable insights g Graphical user interfaces Scalable Spin-off data marts Integrate data into data warehhouses Integrated dashboards, reportting, data analysis, data integration© 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | Slide 38
  • Thank You! k ifyfe@pen ntaho.com ntaho com© 2010, Pentaho. All Rights Reserved. www.pentaho.com. © 2010, Pentaho. All Rights Reserved. www.pentaho.com. Worldwide: +1 (866) 660-7555 | | Slide US and Worldwide: +1 (866) 660-7555Slide 39