The document discusses big data and Hadoop. It notes that big data comes in terabytes and petabytes, sometimes generated daily. Hadoop is presented as a framework for distributed computing on large datasets using MapReduce. While Hadoop can store and process massive amounts of data across commodity servers, it was not designed for business intelligence requirements. The document proposes addressing this by adding data integration and transformation capabilities to Hadoop through tools like Pentaho Data Integration, to enable it to better meet the needs of big data analytics.
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageCloudera, Inc.
Learn about:
Why big data matters to your business: realize revenue, increase customer loyalty, and pinpoint effective strategies
The business and technical challenges of big data solutions
How to leverage big data for competitive advantage
The “must haves” of an effective big data solution
Real-world examples of Cloudera, Pentaho and Dell big data solutions in action
Putting Business Intelligence to Work on Hadoop Data StoresDATAVERSITY
An inexpensive way of storing large volumes of data, Hadoop is also scalable and redundant. But getting data out of Hadoop is tough due to a lack of a built-in query language. Also, because users experience high latency (up to several minutes per query), Hadoop is not appropriate for ad hoc query, reporting, and business analysis with traditional tools.
The first step in overcoming Hadoop's constraints is connecting to HIVE, a data warehouse infrastructure built on top of Hadoop, which provides the relational structure necessary for schedule reporting of large datasets data stored in Hadoop files. HIVE also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data.
But to really unlock the power of Hadoop, you must be able to efficiently extract data stored across multiple (often tens or hundreds) of nodes with a user-friendly ETL (extract, transform and load) tool that will then allow you to move your Hadoop data into a relational data mart or warehouse where you can use BI tools for analysis.
Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 ...Cloudera, Inc.
Who is contributing to the Hadoop ecosystem, what are they contributing, and why? Who are the vendors that are supplying Hadoop-related products and services and what do they want from Hadoop? How is the expanding ecosystem benefiting or damaging the Apache Hadoop project? What are the emerging alternatives to Hadoop and what chance do they have? In this session, the 451 Group will seek to answer these questions based on their latest research and present their perspective of where Hadoop fits in the total data management landscape.
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
A summary of the History of Hadoop, some observations about the current state of Hadoop for new users and some predictions about its future (Hint, it's gonna be huge).
Presented at:
http://www.meetup.com/Pasadena-Big-Data-Users-Group/events/203961192/
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageCloudera, Inc.
Learn about:
Why big data matters to your business: realize revenue, increase customer loyalty, and pinpoint effective strategies
The business and technical challenges of big data solutions
How to leverage big data for competitive advantage
The “must haves” of an effective big data solution
Real-world examples of Cloudera, Pentaho and Dell big data solutions in action
Putting Business Intelligence to Work on Hadoop Data StoresDATAVERSITY
An inexpensive way of storing large volumes of data, Hadoop is also scalable and redundant. But getting data out of Hadoop is tough due to a lack of a built-in query language. Also, because users experience high latency (up to several minutes per query), Hadoop is not appropriate for ad hoc query, reporting, and business analysis with traditional tools.
The first step in overcoming Hadoop's constraints is connecting to HIVE, a data warehouse infrastructure built on top of Hadoop, which provides the relational structure necessary for schedule reporting of large datasets data stored in Hadoop files. HIVE also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data.
But to really unlock the power of Hadoop, you must be able to efficiently extract data stored across multiple (often tens or hundreds) of nodes with a user-friendly ETL (extract, transform and load) tool that will then allow you to move your Hadoop data into a relational data mart or warehouse where you can use BI tools for analysis.
Hadoop World 2011: The Blind Men and the Elephant - Matthew Aslett - The 451 ...Cloudera, Inc.
Who is contributing to the Hadoop ecosystem, what are they contributing, and why? Who are the vendors that are supplying Hadoop-related products and services and what do they want from Hadoop? How is the expanding ecosystem benefiting or damaging the Apache Hadoop project? What are the emerging alternatives to Hadoop and what chance do they have? In this session, the 451 Group will seek to answer these questions based on their latest research and present their perspective of where Hadoop fits in the total data management landscape.
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
A summary of the History of Hadoop, some observations about the current state of Hadoop for new users and some predictions about its future (Hint, it's gonna be huge).
Presented at:
http://www.meetup.com/Pasadena-Big-Data-Users-Group/events/203961192/
Hadoop as Data Refinery - Steve LoughranJAX London
Apache Hadoop is often described as a "Big Data Platform" but what does that mean? One way to better understand Hadoop is to talk about how Hadoop is used. This talk discusses using Hadoop as a "Data Refinery", which is a common use case. The concept is very much like a traditional oil refinery except with data, pulling in large quantities of "crude data" over pipelines, refining some into useful business intelligence; refining other pieces into slightly less crude data that stays in the cluster until needed later. This metaphor proves useful when considering how Hadoop could be adopted in an organisation that already has data warehousing and business intelligence systems -and when contemplating how to hook up a Hadoop cluster to the sources of data inside and outside that organisation. A key point to remember is that storing data in Hadoop is not a means to an end any more than storing data in a database is: it is extracting information from that data. Using Hadoop as a front end "data refinery" means that it can integrate with existing Business Intelligence systems, while providing the platform for new applications.
Explores the notion of "Hadoop as a Data Refinery" within an organisation, be it one with an existing Business Intelligence system or none - looks at 'agile data' as a a benefit of using Hadoop as the store for historical, unstructured and very-large-scale datasets.
The final slides look at the challenge of an organisation becoming "data driven"
Slides from a presentation I gave at the 5th SOA, Cloud + Service Technology Symposium (September 2012, Imperial College, London). The goal of this presentation was to explore with the audience use cases at the intersection of SOA, Big Data and Fast Data. If you are working with both SOA and Big Data I would would be very interested to hear about your projects.
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13Mark Rittman
The latest releases of OBIEE and ODI come with the ability to connect to Hadoop data sources, using MapReduce to integrate data from clusters of "big data" servers complementing traditional BI data sources. In this presentation, we will look at how these two tools connect to Apache Hadoop and access "big data" sources, and share tips and tricks on making it all work smoothly.
Human Information is made up of ideas, is diverse, and has context.
Ideas don’t exactly match like data does; they have distance.
Human Information is not static – it’s dynamic and lives everywhere.
Details on applications
HAVEn is integrated to costumers architecture through other n Apps
HP has started modifying our existing application portfolio to use HAVEn
And HP is building new applications that leverage power of HAVEn
Many customers are already building applications that use multiple HAVEn
I was meaning to put this talk up for grabs for some time now, but kept forgetting. I was invited to give the keynote speech for the Microstrategy World 2008 conference. The talk was very well received, so here it is.
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session we’ll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. We’ll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoop’s schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a “data factory” to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.
It is almost impossible to escape the topic of Data Science. While the core of Data Science has remained the same over the last decade, it’s emergence to the forefront is spurred by both the availability of new data types and a true realization of the value that it delivers. In this session, we will provide an overview of data science, the different classes of machine learning algorithm and deliver an end-to-end demonstration of performing Machine Learning Using Hadoop. Audience: Developers, Data Scientist Architects and System Engineers.
Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=4175a7421d00257f33df146f50c41af8
Presentation regarding big data. The presentation also contains basics regarding Hadoop and Hadoop components along with their architecture. Contents of the PPT are
1. Understanding Big Data
2. Understanding Hadoop & It’s Components
3. Components of Hadoop Ecosystem
4. Data Storage Component of Hadoop
5. Data Processing Component of Hadoop
6. Data Access Component of Hadoop
7. Data Management Component of Hadoop
8.Hadoop Security Management Tool: Knox ,Ranger
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.
Hadoop has showed itself as a great tool in resolving problems with different data aspects as Data Velocity, Variety and Volume, that are causing troubles to relational database storage. In this presentation you'll learn what problems with data are occurring nowdays and how Hadoop can solve them . You'll learn about Hadop basic components and principles that make Hadoop such great tool.
Hadoop as Data Refinery - Steve LoughranJAX London
Apache Hadoop is often described as a "Big Data Platform" but what does that mean? One way to better understand Hadoop is to talk about how Hadoop is used. This talk discusses using Hadoop as a "Data Refinery", which is a common use case. The concept is very much like a traditional oil refinery except with data, pulling in large quantities of "crude data" over pipelines, refining some into useful business intelligence; refining other pieces into slightly less crude data that stays in the cluster until needed later. This metaphor proves useful when considering how Hadoop could be adopted in an organisation that already has data warehousing and business intelligence systems -and when contemplating how to hook up a Hadoop cluster to the sources of data inside and outside that organisation. A key point to remember is that storing data in Hadoop is not a means to an end any more than storing data in a database is: it is extracting information from that data. Using Hadoop as a front end "data refinery" means that it can integrate with existing Business Intelligence systems, while providing the platform for new applications.
Explores the notion of "Hadoop as a Data Refinery" within an organisation, be it one with an existing Business Intelligence system or none - looks at 'agile data' as a a benefit of using Hadoop as the store for historical, unstructured and very-large-scale datasets.
The final slides look at the challenge of an organisation becoming "data driven"
Slides from a presentation I gave at the 5th SOA, Cloud + Service Technology Symposium (September 2012, Imperial College, London). The goal of this presentation was to explore with the audience use cases at the intersection of SOA, Big Data and Fast Data. If you are working with both SOA and Big Data I would would be very interested to hear about your projects.
Leveraging Hadoop with OBIEE 11g and ODI 11g - UKOUG Tech'13Mark Rittman
The latest releases of OBIEE and ODI come with the ability to connect to Hadoop data sources, using MapReduce to integrate data from clusters of "big data" servers complementing traditional BI data sources. In this presentation, we will look at how these two tools connect to Apache Hadoop and access "big data" sources, and share tips and tricks on making it all work smoothly.
Human Information is made up of ideas, is diverse, and has context.
Ideas don’t exactly match like data does; they have distance.
Human Information is not static – it’s dynamic and lives everywhere.
Details on applications
HAVEn is integrated to costumers architecture through other n Apps
HP has started modifying our existing application portfolio to use HAVEn
And HP is building new applications that leverage power of HAVEn
Many customers are already building applications that use multiple HAVEn
I was meaning to put this talk up for grabs for some time now, but kept forgetting. I was invited to give the keynote speech for the Microstrategy World 2008 conference. The talk was very well received, so here it is.
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
Delivered as a one-day seminar at the SIOUG and HROUG Oracle User Group Conferences, October 2014.
There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session we’ll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. We’ll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoop’s schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a “data factory” to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.
It is almost impossible to escape the topic of Data Science. While the core of Data Science has remained the same over the last decade, it’s emergence to the forefront is spurred by both the availability of new data types and a true realization of the value that it delivers. In this session, we will provide an overview of data science, the different classes of machine learning algorithm and deliver an end-to-end demonstration of performing Machine Learning Using Hadoop. Audience: Developers, Data Scientist Architects and System Engineers.
Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=4175a7421d00257f33df146f50c41af8
Presentation regarding big data. The presentation also contains basics regarding Hadoop and Hadoop components along with their architecture. Contents of the PPT are
1. Understanding Big Data
2. Understanding Hadoop & It’s Components
3. Components of Hadoop Ecosystem
4. Data Storage Component of Hadoop
5. Data Processing Component of Hadoop
6. Data Access Component of Hadoop
7. Data Management Component of Hadoop
8.Hadoop Security Management Tool: Knox ,Ranger
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.
Hadoop has showed itself as a great tool in resolving problems with different data aspects as Data Velocity, Variety and Volume, that are causing troubles to relational database storage. In this presentation you'll learn what problems with data are occurring nowdays and how Hadoop can solve them . You'll learn about Hadop basic components and principles that make Hadoop such great tool.
InfoAxon has 10+ years of rich open source solutions delivery experience. InfoAxon has a well defined Open Source BI Practice which is a collection of frameworks, templates, tools, methodologies, services and pre-integrated solutions resulting into customized BI Solutions for specific verticals or specific business problems.
At InfoAxon, we have adopted and developed deep understanding on Pentaho BI platform which is one of the world‟s leading Open source BI Platforms. We have rich experience in delivering customized BI Solutions using Pentaho as core BI engine.
Pentaho - Jake Cornelius - Hadoop World 2010Cloudera, Inc.
Putting Analytics in Big Data Analytics
Jake Cornelius
Director of Product Management, Pentaho Corporation
Learn more @ http://www.cloudera.com/hadoop/
Pentaho Big Data Analytics with Vertica and HadoopMark Kromer
Overview of the Pentaho Big Data Analytics Suite from the Pentaho + Vertica presentation at Big Data Techcon 2014 in Boston for the session called "The Ultimate Selfie | Picture Yourself with the Fastest Analytics on Hadoop with HP Vertica and Pentaho"
BI congres 2014-5: from BI to big data - Jan Aertsen - PentahoBICC Thomas More
7de BI congres van het BICC-Thomas More: 3 april 2014
Reisverslag van Business Intelligence naar Big Data
De reisbranche is sterk in beweging. Deze presentatie zal een reis door klassieke en moderne BI bestemmingen zijn, toont een serie snapshots van verschillende use cases in de reisbranche. Tijdens de sessie benadrukken we de capaciteit en flexibiliteit die een BI-tool nodig heeft om u te begeleiden op uw reis van klassieke BI-implementaties naar de moderne big data uitdagingen .
Strata 2015 presentation from Oracle for Big Data - we are announcing several new big data products including GoldenGate for Big Data, Big Data Discovery, Oracle Big Data SQL and Oracle NoSQL
<b>Blending Hadoop and MongoDB with Pentaho </b>[11:10 am - 11:30 am]<br />For eCommerce companies, knowing how promoted wish-lists can spark consumer spending is an analytics goldmine. In this lightning talk, Bo Borland will demonstrate how Pentaho analytics can blend click-stream data about promoted wish-lists with sales transaction records using Hadoop, MongoDB and Pentaho to reveal patterns in online shopping behavior. Regardless of your industry or specific use model, come to this session to learn how to blend MongoDB data with any data source for greater business insight. Pentaho offers the first end-to-end analytic solution for MongoDB. From data ingestion to pixel perfect reporting and ad hoc “slice and dice” analysis, the solution meets today’s growing demand for a 360-degree view of your business.
MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...MongoDB
Drawing on Pentaho's wide experience in solving customers' big data issues, Davy Nys will position the importance of analytics in the IoT:
[-] Understanding the challenges behind data integration & analytics for IoT
[-] Future proofing your information architecture for IoT
[-] Delivering IoT analytics, now and tomorrow
[-] Real customer examples of where Pentaho can help
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
30 for 30: Quick Start Your Pentaho EvaluationPentaho
These slides are from our recent 30 for 30 webinar tailored towards people that have downloaded the Pentaho evaluation and want to know more about all the data integration and business analytics components part of the trial, how to easily integrate data, and best practices for installing/developing content.
Apache Tajo: A Big Data Warehouse System on Hadoop
Presented by Jae-hwa Jeong, Apache Tajo committer and senior research engineer at Gruter, in Bigdata World Convention 2014 at Oct.23, Busan, Korea
5 things cucumber is bad at by Richard LawrenceSkills Matter
This talk will look at 5 things Cucumber’s bad at, why that’s a good thing, and what it tells us about Cucumber’s sweet spot in a team’s toolkit.
Many times, when people complain about something Cucumber’s not good at, they’re unwittingly describing something Cucumber shouldn't be good at. They’re revealing that they don’t quite understand BDD and Cucumber’s role in it.
Cucumber is the world's most misunderstood collaboration tool and people need to hear this over and over again.
Patterns for slick database applicationsSkills Matter
Slick is Typesafe's open source database access library for Scala. It features a collection-style API, compact syntax, type-safe, compositional queries and explicit execution control. Community feedback helped us to identify common problems developers are facing when writing Slick applications. This talk suggests particular solutions to these problems. We will be looking at reducing boiler-plate, re-using code between queries, efficiently modeling object references and more.
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmSkills Matter
Metascala is a tiny metacircular Java Virtual Machine (JVM) written in the Scala programming language. Metascala is barely 3000 lines of Scala, and is complete enough that it is able to interpret itself metacircularly. Being written in Scala and compiled to Java bytecode, the Metascala JVM requires a host JVM in order to run.
The goal of Metascala is to create a platform to experiment with the JVM: a 3000 line JVM written in Scala is probably much more approachable than the 1,000,000 lines of C/C++ which make up HotSpot, the standard implementation, and more amenable to implementing fun features like continuations, isolates or value classes. The 3000 lines of code gives you:
The bytecode interpreter, together with all the run-time data structures
A stack-machine to SSA register-machine bytecode translator
A custom heap, complete with a stop-the-world, copying garbage collector
Implementations of parts of the JVM's native interface
Although it is far from a complete implementation, Metascala already provides the ability to run untrusted bytecode securely (albeit slowly), since every operation which could potentially cause harm (including memory allocations and CPU usage) is virtualized and can be controlled. Ongoing work includes tightening of the security guarantees, improving compatibility and increasing performance.
ENJOYIN
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Skills Matter
Code Quotations: Code-as-Data for F#
This tutorial will cover F# Code Quotations in-depth. You'll learn what Code Quotations are, how to use them, and where to apply them in your applications. We'll work through several real-world examples to highlight the important features -- and potential pitfalls -- of Code Quotations.
Cukeup nyc ian dees on elixir, erlang, and cucumberlSkills Matter
Elixir, Erlang, and Cucumberl
Elixir is a new Ruby-inspired programming language that uses the powerful concurrent machinery of Erlang behind the scenes. Cucumberl is a port of Cucumber to Erlang. Let's see what happens when we put them together.
In this talk, we'll discuss:
How Erlang's concurrency makes it easier to write robust programs
Elixir's approachable syntax
How to test Erlang and Elixir programs using Cucumberl
Attendees will walk away with a solid introduction to the principles of Erlang, and an appreciation of the way Elixir brings the joy of Ruby to the solidity of the Erlang runtime.
Cukeup nyc peter bell on getting started with cucumber.jsSkills Matter
Cukeup NYC. Peter Bell on Getting started with cucumber.js
Ever wished you could use cucumber in your javascript apps? In this talk we'll look at the current state of play of cucumber js, when you should and shouldn't use it, and how to get started writing your step definitions in javascript.
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Skills Matter
In this engaging experience report, we will present 3 different views – Developer, Tester, Business Analyst – of implementing Acceptance Test Driven Development in a complex, data-driven domain. Hear how we used ATDD for building a ubiquitous language across the entire team, promoting faster feedback, and cultivating a culture where product owners were deeply invested in the quality of both every deliverable and the system as a whole.
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Skills Matter
In this tutorial, Phil and Rachel will introduce you to the Try F# samples giving you exposure to, and an understanding of, how F# tackles some real-world scenarios. We'll help you explore, generate, and just play around with code samples, as well as talk you through some of the key principles of F#. By the end of this session, you'll have gone from zero to data science in only a few hours!
Progressive f# tutorials nyc don syme on keynote f# in the open source worldSkills Matter
F# is a powerful open-source language which Microsoft, other companies and the F# community all contribute to. In this talk, Don will discuss how the “F# space” has recently opened up significantly in interesting ways. F# now includes contributions that range from Cloud IDE platforms, Cloud Compute frameworks, Data interoperability components, Cross-platform execution, Try F#, MonoDevelop, and even Emacs editor integration with surprising tooling support, as well as the Visual F# tools from Microsoft and the broader NuGet package ecosystem. Don will also talk about some of the latest contributions from Microsoft Research, including new type provider components for F#, and describe how his team work with the Visual F# team and other teams around Microsoft. There will also be demos of some fun new stuff that’s been going on with F# at MSR and the community.
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Skills Matter
Would you like to learn how to make your software testing practices more effective? And how to use your testing strategy to better capture and reflect customer requirements? Gojko Adzic takes a critical look at the effectiveness of current software testing practices and proposes strategies to make it much more effective.
Dmitry mozorov on code quotations code as-data for f#Skills Matter
Code Quotations: Code-as-Data for F#
This tutorial will cover F# Code Quotations in-depth. You'll learn what Code Quotations are, how to use them, and where to apply them in your applications. We'll work through several real-world examples to highlight the important features -- and potential pitfalls -- of Code Quotations.
Simon Peyton Jones: Managing parallelismSkills Matter
If you want to program a parallel computer, it obviously makes sense to start with a computational paradigm in which parallelism is the default (ie functional programming), rather than one in which computation is based on sequential flow of control (the imperative paradigm). And yet, and yet ... functional programmers have been singing this tune since the 1980s, but do not yet rule the world. In this talk I’ll say why I think parallelism is too complex a beast to be slain at one blow, and how we are going to be driven, willy-nilly, towards a world in which side effects are much more tightly controlled than now. I’ll sketch a whole range of ways of writing parallel program in a functional paradigm (implicit parallelism, transactional memory, data parallelism, DSLs for GPUs, distributed processes, etc, etc), illustrating with examples from the rapidly moving Haskell community, and identifying some of the challenges we need to tackle.
2. Big Data
Terabytes and petabytes of data
Sometimes per day
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
3. Example Use Cases Today
Transactional
•Fraud detection
•Financial services/stock markets
Sub-Transactional
•Weblogs
•Social/online media
•Telecoms events
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
4. Example Use Cases Today
Non-Transactional
•Web pages, blogs etc
•Documents
•Physical events
•Application events
•Machine events
In most cases structured or semi-structured
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
5. Data Lake
• Single source
• Large volume
• Not distilled
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
6. Data Lakes
• 0-2 lakes per company
• Known and unknown questions
• Multiple user communities
• $1-10k questions, not $1m ones
• Don’t fit in traditional RDBMS with a
reasonable cost
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
7. Data Lake Requirements
• Store all the data
• Satisfy routine reporting and analysis
• Satisfy ad-hoc query / analysis / reporting
• Balance performance and cost
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
8. Traditional BI
Data Mart(s)
Tape/Trash
Data ? ? ?
Source ?
? ??
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
9. What if...
Data Mart(s) Ad-Hoc Data Warehouse
Data Lake(s)
Data
Source
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
10. Big Data Does Not Replace Data Marts
• It’s not a database
• High latency
• Optimized for massive data-crunching
• Databases are immature
• Databases are no-SQL
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
11. Big Data
Map/Reduce
And
Sometimes per day
Hadoop
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
12. What is Map/Reduce
• Obligatory Wikipedia quote: “... is a patented software
framework introduced by Google to support
distributed computing on large data sets on clusters
of computers”
• Invented by Google to index “The Internet”
• Apache Hadoop is an Open Source implementation of the
Map/Reduce algorithm
• Scalable & fault-tolerant, not efficient!
13. What Hadoop Really Is
• Core components
• HDFS – a distributed file system allowing massive storage across a cluster of
commodity servers
• Map-Reduce
• Framework for distributed computation, common use cases include
aggregating, sorting, and filtering BIG data sets
• Problem is broken up into small fragments of work that can be computed or
recomputed in isolation on any node of the cluster
• Related Projects
• Hive – a data warehouse infrastructure on top of Hadoop
• Implements a SQL like Query language, including a JDBC driver
• Allows MapReduce developers to plugin custom mappers and reducers
• Hbase – the Hadoop database – AH HA!
• A variant of NoSQL databases, problematic for traditional BI
• Best at storing large amounts of unstructured data
14. No seriously, what’s is Hadoop?
Java software framework that supports data-
intensive distributed applications
• Apache project
• Created by Yahoo, Google’s idea
• Distributed filesystem + MapReduce engine
• Commodity hardware
• Scales out beyond technology and/or
economy of RDBMS
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
15. Hadoop and BI?
• Distributed processing
• Distributed file system
• Commodity hardware
• Platform independent (in theory)
• Scales out beyond technology and/or
economy of a RDBMS
In many cases it’s the only viable solution
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
16. Hadoop and BI?
90% of new Hadoop use cases
are transformation of
semi/structured data*
* of those companies we’ve talked to...
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
17. Hadoop and BI?
“The working conditions
within Hadoop are shocking”
ETL Developer
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
18. Hadoop and BI?
Instead of this...
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
19. Hadoop and BI?
You have to do this in Java...
•public void map(
• Text key,
• Text value,
• OutputCollector output,
• Reporter reporter)
•public void reduce(
• Text key,
• Iterator values,
• OutputCollector output,
• Reporter reporter)
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
20. People don’t use
Hadoop for BI because
they want to...
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
21. ...they do it because
they have to...
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
22. ... and unfortunately it
wasn’t designed
for most BI requirements
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
23. Why not add to Hadoop
the things it’s missing...
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
24. ... until it can do
what we need it to?
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
25. If only we had a
Java, embeddable,
data transformation engine...
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
26. Pentaho Data Integration
Data Marts, Data Warehouse,
Analytical Applications
Pentaho Data
Integration
Design
Pentaho Data Deploy
Hadoop Integration
Orchestrate
Pentaho Data
Integration
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
27. Visualize Reporting / Dashboards /
Analysis
Web Tier
DM & DW RDBMS
Optimize
Hive
Hadoop
Files / HDFS
Load Applications & Systems
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
28. Reporting / Dashboards /
Analysis
Web Tier
DM RDBMS
Hive
Hadoop
HDFS
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
29. 30000ft View
Host Machine
pentaho-hadoop-vm
Hadoop
PDI Client
HDFS Hive
Tasks and Jobs
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 29
30. Inside the VM
pentaho-hadoop-vm
Hadoop
HDFS Hive
Job
Mapper Reducer
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 30
31. Inside a job
Job
Mapper Reducer
*
Java Application Java Application
Scripting Scripting
* Combiner can be used to pre-reduce in memory on the mappers before data is transmitted.
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 31
32. Inside a job with PDI
Job
Mapper Reducer
PDI Execution Engine PDI Execution Engine
Transformation Transformation
Step
Step Step
Step
Step Step
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide 32
33. Demo
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
34. The Single Threaded Transformation Engine
• Designed to use a single thread
• Processes rows per batch because Hadoop
delivers rows in batches
• Knows when the batch of rows is processed
• Is only initialized once and disposed of once
• Has reduced overhead for data passing
between steps
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
35. The Single Threaded Transformation Engine
• Is no longer used inside of Hadoop thanks
to new developments. “The multi-threaded
engine is still faster” they said.
• Is being introduced into PDI 4.2.0 (CE)
• You will be able to specify a mapping to run
single threaded
• Allows you to reduce context switching in
large to huge transformations (lots of steps)
2010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
36. Pentaho for Hadoop Resources
Download www.pentaho.com/download/hadoop
Pentaho for Hadoop webpage - resources, press,
events, partnerships and more:
www.pentaho.com/hadoop
Big Data Analytics: 5 part video series with James
Dixon, Pentaho CTO
Or contact me : mcasters at pentaho dot org
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
37. Thank You.
Join the conversation. You can find us on:
http://blog.pentaho.com
@Pentaho
Pentaho Facebook Group
Pentaho - Open Source Business Intelligence Group
010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide