Pentaho - Jake Cornelius - Hadoop World 2010

•

2 likes•1,428 views

Putting Analytics in Big Data Analytics Jake Cornelius Director of Product Management, Pentaho Corporation Learn more @ http://www.cloudera.com/hadoop/

Technology

© 2010, Pentaho. All Rights Reserved. www.pentaho.com.
Putting Analytics in
Big Data Analytics
Jake Cornelius, Dir. Of Product Management
Pentaho Corporation
October 12, 2010

010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Traditional BI
Tape/Trash
Data Mart(s)
Data
Source
?
? ?
?
?
??

010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Data Lake(s)
Big Data Architecture
Data Mart(s)
Data
Source
Data WarehouseAd-Hoc

010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Pentaho Data Integration
Hadoop
Pentaho Data
Integration
Data Marts, Data Warehouse,
Analytical Applications
Design
Deploy
Orchestrate
Pentaho Data
Integration
Pentaho Data
Integration

010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Optimize
Visualize
Load
Files / HDFS
Hive
DM & DW
Applications & Systems
Web Tier
RDBMS
Hadoop
Reporting / Dashboards / Analysis

010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Web Tier
RDBMS
Hadoop
Reporting / Dashboards / Analysis
HDFS
Hive
DM

010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Demo

• Pentaho for Hadoop Download Capability
• Includes support for development, production support will follow with GA
• Collaborative effort between Pentaho and the Pentaho Community
• 60+ beta sites over three month beta cycle
• Pentaho contributed code for API integration with HIVE to the open source
Apache Foundation
• Pentaho and Cloudera Partnership
• Combines Pentaho ‘s business intelligence and data integration capabilities
with Cloudera’s Distribution for Hadoop (CDH)
• Enables business users to take advantage of Hadoop with ability to easily and
cost-effectively mine, visualize and analyze their Hadoop data
Pentaho for Hadoop Announcements

Pentaho for Hadoop Announcements (cont)
• Pentaho and Impetus Technologies Partnership
• Incorporates Pentaho Agile BI and Pentaho BI Suite for Hadoop into Impetus
Large Data Analytics practice
• First major SI to adopt Pentaho for Hadoop
• Facilitates large data analytics projects including expert consulting services,
best practices support in Hadoop implementations and nCluster including
deployment on private and public clouds

010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Pentaho for Hadoop Resources & Events
Resources
Download www.pentaho.com/download/hadoop
Pentaho for Hadoop webpage - resources, press, events, partnerships and
more: www.pentaho.com/hadoop
Big Data Analytics: 5 part video series with James Dixon, Pentaho CTO
Events
Hadoop World: NYC - Oct 12, Gold Sponsor, Exhibitor, Richard Daley
presenting, ‘Putting Analytics in Big Data Analysis’
London Hadoop User Group - Oct 12, London
Agile BI Meets Big Data - Oct 13, New York City

010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide
Thank You.
Join the conversation. You can find us on:
Pentaho Facebook Group
@Pentaho
http://blog.pentaho.com
Pentaho - Open Source Business Intelligence Group

1. The document discusses Pentaho's approach to big data analytics using a component-based data integration and visualization platform. 2. The platform allows business analysts and data scientists to prepare and analyze big data without advanced technical skills. 3. It provides a visual interface for building reusable data pipelines that can be run locally or deployed to Hadoop for analytics on large datasets.

Pentaho big data camp - 5 min

ianfyfe

The document discusses the importance of a hybrid data model for Hadoop-driven analytics. It notes that traditional data warehousing is not suitable for large, unstructured data in Hadoop environments due to limitations in handling data volume, variety, and velocity. The hybrid model combines a data lake in Hadoop for raw, large-scale data with data marts and warehouses. It argues that Pentaho's suite provides tools to lower technical barriers for extracting, transforming, and loading (ETL) data between the data lake and marts/warehouses, enabling analytics on Hadoop data.

Slides pentaho-hadoop-weka

lucboudreau

Pentaho provides open source business analytics tools including Kettle for extraction, transformation and loading (ETL) of data, and Weka for machine learning and data mining. Kettle allows users to run ETL jobs directly on Hadoop clusters and its JDBC layer enables SQL queries to be pushed down to databases for better performance. While bringing Weka analytics to Hadoop data provides gains, challenges include ensuring true parallel machine learning algorithms and keeping clients notified of database updates.

Big Data Integration Webinar: Getting Started With Hadoop Big Data

Pentaho

This document discusses getting started with big data analytics using Hadoop and Pentaho. It provides an overview of installing and configuring Hadoop and Pentaho on a single machine or cluster. Dell's Crowbar tool is presented as a way to quickly deploy Hadoop clusters on Dell hardware in about two hours. The document also covers best practices like leveraging different technologies, starting with small datasets, and not overloading networks. A demo is given and contact information provided.

Why Your Product Needs an Analytic Strategy

Pentaho

The document discusses strategies for enhancing products with analytics capabilities. It outlines three strategic approaches: 1) enhance current software products with analytics, 2) target new opportunities using existing data through direct data monetization or new products/services, and 3) reinvent value propositions using new data technologies like big data. The document provides examples of implementing analytics capabilities for different user personas and considerations for analytics deployments. It argues that analytics can provide benefits like improved decisions, customer stickiness, and new revenue opportunities.

Exclusive Verizon Employee Webinar: Getting More From Your CDR Data

Pentaho

This document discusses a project between Pentaho and Verizon to leverage big data analytics. Verizon generates vast amounts of call detail record (CDR) data from mobile networks that is currently stored in a data warehouse for 2 years and then archived to tape. Pentaho's platform will help optimize the data warehouse by using Hadoop to store all CDR data history. This will free up data warehouse capacity for high value data and allow analysis of the full 10 years of CDR data. Pentaho tools will ingest raw CDR data into Hadoop, execute MapReduce jobs to enrich the data, load results into Hive, and enable analyzing the data to understand calling patterns by geography over time.

Pentaho roadmap 061314

Stratebi

The document outlines Pentaho's roadmap and focus areas for business analytics products. It discusses enhancements planned for Pentaho Business Analytics 5.1, including new features for analyzing MongoDB data and improved visualizations. It also summarizes R&D activities like integrating real-time data processing with Storm and Spark. The roadmap focuses on hardening the Pentaho platform for large enterprises, extending capabilities for big data engineering and analytics, and improving embedded analytics.

Big Data for BI - Beyond the Hype - Pentaho

Subramanian Senthamarai Kannan

The document discusses Pentaho's business intelligence (BI) platform for big data analytics. It describes Pentaho as providing a modern, unified platform for data integration and analytics that allows for native integration into the big data ecosystem. It highlights Pentaho's open source development model and that it has over 1,000 commercial customers and 10,000 production deployments. Several use cases are presented that demonstrate how Pentaho helps customers unlock value from big data stores.

This document discusses approaches to implementing Hadoop, NoSQL, and analytical databases. It describes: 1) The current landscape of big data databases including Hadoop, NoSQL, and analytical databases that are often used together but come from different vendors with different interfaces. 2) Common uses of transactional databases, Hadoop, NoSQL databases, and analytical databases. 3) The complexity of current implementation approaches that involve multiple coding steps across various tools. 4) How Pentaho provides a unified platform and visual tools to reduce the time and effort needed for implementation by eliminating disjointed steps and enabling non-coders to develop workflows and analytics for big data.

Pentaho Analytics at Tampa Analytics September Meetup

Mark Kromer

30 for 30: Quick Start Your Pentaho Evaluation

Pentaho

Big Data for Product Managers

Pentaho

Moving Health Care Analytics to Hadoop to Build a Better Predictive Model

DataWorks Summit

This document discusses Dignity Health's move to using Hadoop for healthcare analytics to build better predictive models. It outlines their goals of saving costs and lives by leveraging over 30 TB of clinical data using Hadoop and SAS technologies on their Dignity Health Insights platform. The presentation agenda covers Dignity Health, healthcare analytics challenges, their big data ecosystem architecture featuring Hadoop, and how they are using this infrastructure for applications like sepsis surveillance analytics.

Breakout: Operational Analytics with Hadoop

Cloudera, Inc.

Operationalizing models and responding to large volumes of data, fast, requires bolt on systems that can struggle with processing (transforming the data), consistency (always responding to data), and scalability (processing and responding to large volumes of data). If the data volume become too large, these traditional systems fail to deliver their responses resulting in significant losses to organizations. Join this breakout to learn how to overcome the roadblocks.

Pentaho Analytics for MongoDB - presentation from MongoDB World 2014

Pentaho

Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight

Precisely

The document discusses moving legacy data and workloads from traditional data warehouses to Hadoop. It describes how ELT processes on dormant data waste resources and how offloading this data to Hadoop can optimize costs and performance. The presentation includes a demonstration of using Tableau for self-service analytics on data in Hadoop and a case study of a financial organization reducing ELT development time from weeks to hours by offloading mainframe data to Hadoop.

All data accessible to all my organization - Presentation at OW2con'19, June...

OW2

This document discusses how Dremio provides a unified access point for data across an entire organization. It summarizes how Dremio allows various users, including data engineers, scientists, analysts and business users, to access all kinds of data sources through SQL or REST APIs. Dremio also enables features like data catalogs, collaborative workspaces, and workload monitoring that help organizations better manage and govern their data.

Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight

Steven Totman

Demand for quicker access to multiple integrated sources of data continues to rise. Immediate access to data stored in a variety of systems - such as mainframes, data warehouses, and data marts - to mine visually for business intelligence is the competitive differentiation enterprises need to win in today’s economy. Stop playing the waiting game and learn about a new end-to-end solution for combining, analyzing, and visualizing data from practically any source in your enterprise environment. Leading organizations are already taking advantage of this architectural innovation to gain modern insights while reducing costs and propelling their businesses ahead of the competition. Are you tired of waiting? Don't let your architecture hold you back. Access this webinar and hear from a team of industry experts on how you can Break the Barriers to Big Data Insight.

Data Mashups for Analytics

Katharine Bierce

Explore how data integration (or “mashups”) can maximize analytic value and help business teams create streamlined data pipelines that enables ad-hoc analytic inquiries. You’ll learn why businesses increasingly focused on blending data on demand and at the source, the concrete analytic advantages that this approach delivers, and the type of architectures required for delivering trusted, blended data. We provide a checklist to assess your data integration needs and capabilities, and review some real-world examples of how blending various data types has created significant analytic value and concrete business impact.

Rob peglar introduction_analytics _big data_hadoop

Ghassan Al-Yafie

This document provides an introduction to analytics and big data using Hadoop. It discusses the growth of digital data and challenges of big data. Hadoop is presented as a solution for storing and processing large, unstructured datasets across commodity servers. The key components of Hadoop - HDFS for distributed storage and MapReduce for distributed processing - are described at a high level. Examples of industries using big data analytics are also listed.

Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...

ArabNet ME

A new foundation for the Modern Information Architecture. Speaker: Amr Awadallah, CTO & Cofounder, Cloudera Our legacy information architecture is not able to cope with the realities of today's business. This is because it is not able to scale to meet our SLAs due to separation of storage and compute, economically store the volumes and types of data we currently confront, provide the agility necessary for innovation, and most importantly, provide a full 360 degree view of our customers, products, and business. In this talk Dr. Amr Awadallah will present the Enterprise Data Hub (EDH) as the new foundation for the modern information architecture. Built with Apache Hadoop at the core, the EDH is an extremely scalable, flexible, and fault-tolerant, data processing system designed to put data at the center of your business.

Data Process Systems, connecting everything

DataWorks Summit/Hadoop Summit

This document summarizes Patrick de Vries' presentation on connecting everything at the Hadoop Summit 2016. The presentation discusses KPN's use of Hadoop to manage increasing data and network capacity needs. It outlines KPN's data flow process from source systems to Hadoop for processing and generating reports. The presentation also covers lessons learned in implementing Hadoop including having strong executive support, addressing cultural challenges around data ownership, and leveraging existing investments. Finally, it promotes joining a new TELCO Hadoop community for telecommunications providers to share use cases and lessons.

MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...

MongoDB

Embedded Analytics in Human Capital Management

Pentaho

Better Together: The New Data Management Orchestra

Cloudera, Inc.

To ingest, store, process and leverage big data for maximum business impact requires integrating systems, processing frameworks, and analytic deployment options. Learn how Cloudera’s enterprise data hub framework, MongoDB, and Teradata Data Warehouse working in concert can enable companies to explore data in new ways and solve problems that not long ago might have seemed impossible. Gone are the days of NoSQL and SQL competing for center stage. Visionary companies are driving data subsystems to operate in harmony. So what’s changed? In this webinar, you will hear from executives at Cloudera, Teradata and MongoDB about the following: How to deploy the right mix of tools and technology to become a data-driven organization Examples of three major data management systems working together Real world examples of how business and IT are benefiting from the sum of the parts Join industry leaders Charles Zedlewski, Chris Twogood and Kelly Stirman for this unique panel discussion, moderated by BI Research analyst, Colin White.

2021 gartner mq dsml

Sasikanth R

1. The document discusses a Gartner report that assesses 20 vendors of data science and machine learning platforms. It evaluates the platforms' abilities to support the full data science life cycle. 2. The report places vendors in four categories - Leaders, Challengers, Visionaries, and Niche Players. It outlines the strengths and cautions of platforms from vendors like Amazon Web Services, Alteryx, and Anaconda. 3. Key criteria for evaluating the platforms include ease of use, support for different personas, capabilities for tasks like modeling and deployment, and growth and innovation. The report aims to help users choose the right platform for their needs.

Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight

Cloudera, Inc.

Integrated dwh 3

Gwen (Chen) Shapira

This document discusses building an integrated data warehouse with Oracle Database and Hadoop. It describes why a data warehouse may need Hadoop to handle big data from sources like social media, sensors and logs. Examples are given of using Hadoop for ETL and analytics. The presentation provides an overview of Hadoop and how to connect it to the data warehouse using tools like Sqoop and external tables. It also offers tips on getting started and avoiding common pitfalls.

Hadoop uk user group meeting final

Skills Matter

The document summarizes Pentaho's open source business intelligence and data integration products, including their new capabilities for Hadoop and big data analytics. It discusses Pentaho's partnerships with Amazon Web Services and Cloudera to more easily integrate Hadoop data. It also outlines how Pentaho helps users analyze and visualize both structured and unstructured data from Hadoop alongside traditional data sources.

Plug 20110217

Skills Matter

The document discusses big data and Hadoop. It notes that big data comes in terabytes and petabytes, sometimes generated daily. Hadoop is presented as a framework for distributed computing on large datasets using MapReduce. While Hadoop can store and process massive amounts of data across commodity servers, it was not designed for business intelligence requirements. The document proposes addressing this by adding data integration and transformation capabilities to Hadoop through tools like Pentaho Data Integration, to enable it to better meet the needs of big data analytics.

What's hot

Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...

Pentaho

Pentaho Analytics at Tampa Analytics September Meetup

Mark Kromer

30 for 30: Quick Start Your Pentaho Evaluation

Pentaho

Big Data for Product Managers

Pentaho

Moving Health Care Analytics to Hadoop to Build a Better Predictive Model

DataWorks Summit

Breakout: Operational Analytics with Hadoop

Cloudera, Inc.

Pentaho Analytics for MongoDB - presentation from MongoDB World 2014

Pentaho

Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight

Precisely

All data accessible to all my organization - Presentation at OW2con'19, June...

OW2

Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight

Steven Totman

Data Mashups for Analytics

Katharine Bierce

Rob peglar introduction_analytics _big data_hadoop

Ghassan Al-Yafie

Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...

ArabNet ME

Data Process Systems, connecting everything

DataWorks Summit/Hadoop Summit

MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...

MongoDB

Embedded Analytics in Human Capital Management

Pentaho

Better Together: The New Data Management Orchestra

Cloudera, Inc.

2021 gartner mq dsml

Sasikanth R

Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight

Cloudera, Inc.

Integrated dwh 3

Gwen (Chen) Shapira

What's hot (20)

Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...

Pentaho Analytics at Tampa Analytics September Meetup

30 for 30: Quick Start Your Pentaho Evaluation

Big Data for Product Managers

Moving Health Care Analytics to Hadoop to Build a Better Predictive Model

Breakout: Operational Analytics with Hadoop

Pentaho Analytics for MongoDB - presentation from MongoDB World 2014

Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight

All data accessible to all my organization - Presentation at OW2con'19, June...

Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight

Data Mashups for Analytics

Rob peglar introduction_analytics _big data_hadoop

Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...

Data Process Systems, connecting everything

MongoDB IoT City Tour EINDHOVEN: Analysing the Internet of Things: Davy Nys, ...

Embedded Analytics in Human Capital Management

Better Together: The New Data Management Orchestra

2021 gartner mq dsml

Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight

Integrated dwh 3

Similar to Pentaho - Jake Cornelius - Hadoop World 2010

Hadoop uk user group meeting final

Skills Matter

Plug 20110217

Skills Matter

BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho

BICC Thomas More

7de BI congres van het BICC-Thomas More: 3 april 2014 Reisverslag van Business Intelligence naar Big Data De reisbranche is sterk in beweging. Deze presentatie zal een reis door klassieke en moderne BI bestemmingen zijn, toont een serie snapshots van verschillende use cases in de reisbranche. Tijdens de sessie benadrukken we de capaciteit en flexibiliteit die een BI-tool nodig heeft om u te begeleiden op uw reis van klassieke BI-implementaties naar de moderne big data uitdagingen .

How advanced analytics is impacting the banking sector

Michael Haddad

The document discusses how advanced analytics is impacting the banking sector. It covers topics like regulatory changes forcing banks to invest in compliance; new digital technologies changing how customers interact with banks; and data analytics helping banks reduce risk, deliver personalized services, and retain skills. It also discusses Hitachi Data Systems' acquisition of Pentaho and how their combined platform can provide unified data integration and business analytics across structured, unstructured, and streaming data sources.

Putting Business Intelligence to Work on Hadoop Data Stores

DATAVERSITY

An inexpensive way of storing large volumes of data, Hadoop is also scalable and redundant. But getting data out of Hadoop is tough due to a lack of a built-in query language. Also, because users experience high latency (up to several minutes per query), Hadoop is not appropriate for ad hoc query, reporting, and business analysis with traditional tools. The first step in overcoming Hadoop's constraints is connecting to HIVE, a data warehouse infrastructure built on top of Hadoop, which provides the relational structure necessary for schedule reporting of large datasets data stored in Hadoop files. HIVE also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. But to really unlock the power of Hadoop, you must be able to efficiently extract data stored across multiple (often tens or hundreds) of nodes with a user-friendly ETL (extract, transform and load) tool that will then allow you to move your Hadoop data into a relational data mart or warehouse where you can use BI tools for analysis.

Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...

MongoDB

The document discusses Pentaho's analytics and ETL solutions for MongoDB. It provides an overview of Pentaho Company and its platform for unified business analytics and data integration. It then outlines how Pentaho can be used to build a 360-degree view of customers by extracting, transforming and loading data from source systems into MongoDB and performing analytics and reporting on the MongoDB data. It demonstrates these capabilities with examples and screenshots.

Big data for product managers

AIPMM Administration

Big Data has been a "buzz word" for a few years now, and it's generated a fair amount of hype. But, while the technology landscape is still evolving, product companies in the software, web, and hardware areas have actually led the way in delivering real value from data sources like weblogs, sensors, and social media as well as systems like Hadoop, NoSQL, and Analytical Databases. These organizations have built "Big Data Apps" that leverage fast, flexible data frameworks to solve a wide array of user problems, scale to massive audiences, and deliver superior predictive intelligence. Join this webinar to learn why product managers should understand Big Data and hear about real-life products that have been elevated with these innovative technologies. You will hear from: - Ben Hopkins, Product Marketing Manager at Pentaho, who will discuss what Big Data means for product strategy and why it represents a new toolset for product teams to meet user needs and build competitive advantage - Jim Stascavage, VP of Engineering at ESRG, who will discuss how his company has innovated with Big Data and predictive analytics to deliver technology products that optimize fuel consumption and maintenance cycles in the maritime and heavy industry sectors, leveraging trillions of sensor data points a year. Who Should Attend Product Managers, Product Marketing Managers, Project Managers, Development Managers, Product Executives, and anyone responsible for addressing customer needs & influencing product strategy.

Accelerate and Scale Big Data Analytics and Machine Learning Pipelines with D...

Alluxio, Inc.

Pentaho Roadmap 2011

Datalytics

Pentaho Big Data Analytics with Vertica and Hadoop

Mark Kromer

Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...

NoSQLmatters

Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.

MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...

MongoDB

1) The document discusses Pentaho's beliefs around Internet of Things (IoT) analytics, including applying the right data source and processing for different analytics needs, gaining insights by blending multiple data sources on demand, and planning for agility, flexibility and near real-time analytics. 2) It describes how emerging big data use cases demand blending different data sources and provides examples like improving operations and customer experience. 3) The document advocates an Extract-Transform-Report approach for IoT analytics that provides flexibility to integrate diverse data sources and enables real-time insights.

What's on Your Wish List?

MongoDB

<b>Blending Hadoop and MongoDB with Pentaho </b>[11:10 am - 11:30 am]<br />For eCommerce companies, knowing how promoted wish-lists can spark consumer spending is an analytics goldmine. In this lightning talk, Bo Borland will demonstrate how Pentaho analytics can blend click-stream data about promoted wish-lists with sales transaction records using Hadoop, MongoDB and Pentaho to reveal patterns in online shopping behavior. Regardless of your industry or specific use model, come to this session to learn how to blend MongoDB data with any data source for greater business insight. Pentaho offers the first end-to-end analytic solution for MongoDB. From data ingestion to pixel perfect reporting and ad hoc “slice and dice” analysis, the solution meets today’s growing demand for a 360-degree view of your business.

Open Analytics 2014 - Pedro Alves - Innovation though Open Source

OpenAnalytics Spain

Delivering the Future of Analytics: Innovation through Open Source Pentaho was born out of the desire to achieve positive, disruptive change in the business analytics market, dominated by bureaucratic megavendors offering expensive heavy-weight products built on outdated technology platforms. Pentaho’s open, embeddable data integration and analytics platform was developed with a strong open source heritage. This provided Pentaho a first-mover advantage to engage early with adopters of big data technologies and solve the difficult challenges of integrating both established and emerging data types to drive analytics. Continued technology innovations to support the big data ecosystem, have kept customers ahead of the big data curve. With the ability to drastically reduce the time to design, develop and deploy big data solutions, Pentaho counts numerous big data customers, both large and small, across the financial services, retail, travel, healthcare and government industries around the world.

Evolution of Big Data at Intel - Crawl, Walk and Run Approach

DataWorks Summit

Intel's big data journey began in 2011 with an evaluation of Hadoop. Since then, Intel has expanded its use of Hadoop and Cloudera across multiple environments. Intel's 3-year roadmap focuses on evolving its Hadoop platform to support more advanced analytics, real-time capabilities, and integrating with traditional BI tools. Key strategies include designing for scalability, following an iterative approach to understand data, and leveraging open source technologies.

Filling the Data Lake

DataWorks Summit/Hadoop Summit

This document discusses strategies for filling a data lake by improving the process of data onboarding. It advocates using a template-based approach to streamline data ingestion from various sources and reduce dependence on hardcoded procedures. The key aspects are managing ELT templates and metadata through automated metadata extraction. This allows generating integration jobs dynamically based on metadata passed at runtime, providing flexibility to handle different source data with one template. It emphasizes reducing the risks associated with large data onboarding projects by maintaining a standardized and organized data lake.

Web Briefing: Unlock the power of Hadoop to enable interactive analytics

Kognitio

This document provides an agenda and summaries for a web briefing on unlocking the power of Hadoop to enable interactive analytics and real-time business intelligence. The agenda includes demonstrations on SQL and Hadoop with in-memory acceleration, interactive analytics with Hadoop, and modern data architectures. It also includes presentations on big data drivers and patterns, interoperating Hadoop with existing data tools, and using Hadoop to power new targeted applications.

Cloudian 451-hortonworks - webinar

Hortonworks

Join Cloudian, Hortonworks and 451 Research for a panel-style Q&A discussion about the latest trends and technology innovations in Big Data and Analytics. Matt Aslett, Data Platforms and Analytics Research Director at 451 Research, John Kreisa, Vice President of Strategic Marketing at Hortonworks, and Paul Turner, Chief Marketing Officer at Cloudian, will answer your toughest questions about data storage, data analytics, log data, sensor data and the Internet of Things. Bring your questions or just come and listen!

Driving Real Insights Through Data Science

VMware Tanzu

Major changes in industries have been brought about by the emergence of data-driven discoveries and applications. Many organizations are bringing together their data, and looking to drive change. But the ability to generate new insights in real time from a massive sets of data is still far from commonplace. At this event, data technology experts and data scientists from Pivotal provided the latest business perspective on how data science and engineering can be used to accelerate the generation of new insights. For information about upcoming Pivotal events, please visit: http://pivotal.io/news-events/#events

Pentaho Analytics on MongoDB

Mark Kromer

Similar to Pentaho - Jake Cornelius - Hadoop World 2010 (20)

Hadoop uk user group meeting final

Plug 20110217

BI congres 2014-5: from BI to big data - Jan Aertsen - Pentaho

How advanced analytics is impacting the banking sector

Putting Business Intelligence to Work on Hadoop Data Stores

Advanced Reporting and ETL for MongoDB: Easily Build a 360-Degree View of You...

Big data for product managers

Accelerate and Scale Big Data Analytics and Machine Learning Pipelines with D...

Pentaho Roadmap 2011

Pentaho Big Data Analytics with Vertica and Hadoop

Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...

MongoDB IoT City Tour LONDON: Analysing the Internet of Things: Davy Nys, Pen...

What's on Your Wish List?

Open Analytics 2014 - Pedro Alves - Innovation though Open Source

Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Filling the Data Lake

Web Briefing: Unlock the power of Hadoop to enable interactive analytics

Cloudian 451-hortonworks - webinar

Driving Real Insights Through Data Science

Pentaho Analytics on MongoDB

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx

Cloudera, Inc.

The document discusses using Cloudera DataFlow to address challenges with collecting, processing, and analyzing log data across many systems and devices. It provides an example use case of logging modernization to reduce costs and enable security solutions by filtering noise from logs. The presentation shows how DataFlow can extract relevant events from large volumes of raw log data and normalize the data to make security threats and anomalies easier to detect across many machines.

Cloudera Data Impact Awards 2021 - Finalists

Cloudera, Inc.

The document outlines the 2021 finalists for the annual Data Impact Awards program, which recognizes organizations using Cloudera's platform and the impactful applications they have developed. It provides details on the challenges, solutions, and outcomes for each finalist project in the categories of Data Lifecycle Connection, Cloud Innovation, Data for Enterprise AI, Security & Governance Leadership, Industry Transformation, People First, and Data for Good. There are multiple finalists highlighted in each category demonstrating innovative uses of data and analytics.

2020 Cloudera Data Impact Awards Finalists

Cloudera, Inc.

Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.

Edc event vienna presentation 1 oct 2019

Cloudera, Inc.

The document outlines the agenda for Cloudera's Enterprise Data Cloud event in Vienna. It includes welcome remarks, keynotes on Cloudera's vision and customer success stories. There will be presentations on the new Cloudera Data Platform and customer case studies, followed by closing remarks. The schedule includes sessions on Cloudera's approach to data warehousing, machine learning, streaming and multi-cloud capabilities.

Machine Learning with Limited Labeled Data 4/3/19

Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19

Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19

Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18

Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3

Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2

Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1

Cloudera, Inc.

The document discusses the benefits and trends of modernizing a data warehouse. It outlines how a modern data warehouse can provide deeper business insights at extreme speed and scale while controlling resources and costs. Examples are provided of companies that have improved fraud detection, customer retention, and machine performance by implementing a modern data warehouse that can handle large volumes and varieties of data from many sources.

Extending Cloudera SDX beyond the Platform

Cloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18

Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360

Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18

Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18

Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx

Cloudera Data Impact Awards 2021 - Finalists

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18

Recently uploaded

Azure API Management to expose backend services securely

Dinusha Kumarasiri

Choosing The Best AWS Service For Your Website + API.pptx

Brandon Minnick, MBA

Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API? Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose? Which one is cheapest? Which one is fastest? Which one will scale to meet our needs? Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!

Programming Foundation Models with DSPy - Meetup Slides

Zilliz

HCL Notes and Domino License Cost Reduction in the World of DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/ The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this! We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model. Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward. These topics will be covered - Reducing license cost by finding and fixing misconfigurations and superfluous accounts - How do CCB and CCX licenses really work? - Understanding the DLAU tool and how to best utilize it - Tips for common problem areas, like team mailboxes, functional/test users, etc - Practical examples and best practices to implement right away

June Patch Tuesday

Ivanti

Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.

Main news related to the CCS TSI 2023 (2023/1695)

Jakub Marek

An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers. The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 . The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers

akankshawande

Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf

flufftailshop

When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.

Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on integration of Salesforce with Bonterra Impact Management. Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Monitoring and Managing Anomaly Detection on OpenShift.pdf

Tosin Akinosho

Monitoring and Managing Anomaly Detection on OpenShift Overview Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices. Key Topics Covered 1. Introduction to Anomaly Detection - Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems. 2. Understanding Edge (IoT) - Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source. 3. What is ArgoCD? - Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices. 4. Deployment Using ArgoCD for Edge Devices - Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD. 5. Introduction to Apache Kafka and S3 - Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions. 6. Viewing Kafka Messages in the Data Lake - Learn how to view and analyze Kafka messages stored in a data lake for better insights. 7. What is Prometheus? - Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices. 8. Monitoring Application Metrics with Prometheus - Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system. 9. What is Camel K? - Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes. 10. Configuring Camel K Integrations for Data Pipelines - Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow. 11. What is a Jupyter Notebook? - Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text. 12. Jupyter Notebooks with Code Examples - Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.

TrustArc Webinar - 2024 Global Privacy Survey

TrustArc

How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024? In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores. See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe. This webinar will review: - The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey - The top challenges for privacy leaders, practitioners, and organizations in 2024 - Key themes to consider in developing and maintaining your privacy program

Operating System Used by Users in day-to-day life.pptx

Pravash Chandra Das

Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes. Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions. Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻 The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️ Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution. The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟

Nordic Marketo Engage User Group_June 13_ 2024.pptx

MichaelKnudsen27

Introduction of Cybersecurity with OSS at Code Europe 2024

Hiroshi SHIBATA

I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems. The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS. Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application. I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.

System Design Case Study: Building a Scalable E-Commerce Platform - Hiike

Hiike

AWS Cloud Cost Optimization Presentation.pptx

HarisZaheer8

This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

saastr

dbms calicut university B. sc Cs 4th sem.pdf

Shinana2

Best 20 SEO Techniques To Improve Website Visibility In SERP

Pixlogix Infotech

Skybuffer SAM4U tool for SAP license adoption

Tatiana Kojar

Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool. SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.

Recently uploaded (20)

Azure API Management to expose backend services securely

Choosing The Best AWS Service For Your Website + API.pptx

Programming Foundation Models with DSPy - Meetup Slides

HCL Notes and Domino License Cost Reduction in the World of DLAU

June Patch Tuesday

Main news related to the CCS TSI 2023 (2023/1695)

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers

Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf

Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...

Monitoring and Managing Anomaly Detection on OpenShift.pdf

TrustArc Webinar - 2024 Global Privacy Survey

Operating System Used by Users in day-to-day life.pptx

Nordic Marketo Engage User Group_June 13_ 2024.pptx

Introduction of Cybersecurity with OSS at Code Europe 2024

System Design Case Study: Building a Scalable E-Commerce Platform - Hiike

AWS Cloud Cost Optimization Presentation.pptx

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

dbms calicut university B. sc Cs 4th sem.pdf

Best 20 SEO Techniques To Improve Website Visibility In SERP

Skybuffer SAM4U tool for SAP license adoption

Pentaho - Jake Cornelius - Hadoop World 2010

4. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide Pentaho Data Integration Hadoop Pentaho Data Integration Data Marts, Data Warehouse, Analytical Applications Design Deploy Orchestrate Pentaho Data Integration Pentaho Data Integration

5. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide Optimize Visualize Load Files / HDFS Hive DM & DW Applications & Systems Web Tier RDBMS Hadoop Reporting / Dashboards / Analysis

8. • Pentaho for Hadoop Download Capability • Includes support for development, production support will follow with GA • Collaborative effort between Pentaho and the Pentaho Community • 60+ beta sites over three month beta cycle • Pentaho contributed code for API integration with HIVE to the open source Apache Foundation • Pentaho and Cloudera Partnership • Combines Pentaho ‘s business intelligence and data integration capabilities with Cloudera’s Distribution for Hadoop (CDH) • Enables business users to take advantage of Hadoop with ability to easily and cost-effectively mine, visualize and analyze their Hadoop data Pentaho for Hadoop Announcements

9. Pentaho for Hadoop Announcements (cont) • Pentaho and Impetus Technologies Partnership • Incorporates Pentaho Agile BI and Pentaho BI Suite for Hadoop into Impetus Large Data Analytics practice • First major SI to adopt Pentaho for Hadoop • Facilitates large data analytics projects including expert consulting services, best practices support in Hadoop implementations and nCluster including deployment on private and public clouds

10. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide Pentaho for Hadoop Resources & Events Resources Download www.pentaho.com/download/hadoop Pentaho for Hadoop webpage - resources, press, events, partnerships and more: www.pentaho.com/hadoop Big Data Analytics: 5 part video series with James Dixon, Pentaho CTO Events Hadoop World: NYC - Oct 12, Gold Sponsor, Exhibitor, Richard Daley presenting, ‘Putting Analytics in Big Data Analysis’ London Hadoop User Group - Oct 12, London Agile BI Meets Big Data - Oct 13, New York City

11. 010, Pentaho. All Rights Reserved. www.pentaho.com. US and Worldwide: +1 (866) 660-7555 | Slide Thank You. Join the conversation. You can find us on: Pentaho Facebook Group @Pentaho http://blog.pentaho.com Pentaho - Open Source Business Intelligence Group

Editor's Notes

In a traditional BI system where we have not been able to store all of the raw data, we have solved the problem by being selective. Firstly we selected the attributes of the data that we know we have questions about. Then we cleansed it and aggregated it to transaction levels or higher, and packaged it up in a form that is easy to consume. Then we put it into an expensive system that we could not scale, whether technically or financially. The rest of the data was thrown away or archived on tape, which for the purposes of analysis, is the same as throwing it away. TRANSITION The problem is we don’t know what is in the data that we are throwing away or archiving. We can only answer the questions that we could predict ahead of time.
When we look at the Big Data architecture we described before we recall that * We want to store all of the data, so we can answer both known and unknown questions * We want to satisfy our standard reporting and analysis requirements * We want to satisfying ad-hoc needs by providing the ability to dip into the lake at any time to extract data * We want to balance balance performance and cost as we scale We need the ability to take the data in the Data Lake and easily convert it into data suitable for a data mart, data warehouse or ad-hoc data set - without requiring custom Java code
Fortunately we have an embeddable data integration engine, written in Java We have taken our Data Integration engine, PDI and integrated with Hadoop in a number of different areas: * We have the ability to move files between Hadoop and external locations * We have the ability to read and write to HDFS files during data transformations * We have the ability to execute data transformations within the MapReduce engine * We have the ability to extract information from Hadoop and load it into external data bases and applications * And we have the ability to orchestrate all of this so you can integrate Hadoop into the rest of your data architecture with scheduling, monitoring, logging etc
Put in to diagram form so we can indicate the different layers in the architecture and also show the scale of the data we get this Big Data pyramid. * At the bottom of the pyramid we have Hadoop, containing our complete set of data. * Higher up we have our data mart layer. This layer has less data in it, but has better performance. * At the top we have application-level data caches. * Looking down from the top, from the perspective of our users, they can see the whole pyramid - they have access to the whole structure. The only thing that varies is the query time, depending on what data they want. * Here we see that the RDBMS layer lets up optimize access to the data. We can decide how much data we want to stage in this layer. If we add more storage in this layer, we can increase performance of a larger subset of the data lake, but it costs more money.
In this demo we will show how easy it is to execute a series of Hadoop and non-Hadoop tasks. We are going to TRANSITION 1 Get a weblog file from an FTP server TRANSITION 2 Make sure the source file does not exist with the Hadoop file system TRANSITION 3 Copy the weblog file into Hadoop TRANSITION 4 Read the weblog and process it - add metadata about the URLs, add geocoding, and enrich the operating system and browser attributes TRANSITION 5 Write the results of the data transformation to a new, improved, data file TRANSITION 6 Load the data into Hive TRANSITION 7 Read an aggregated data set from Hadoop TRANSITION 8 And write it into a database TRANSITION 9 Slice and dice the data with the database TRANSITION 10 And execute an ad-hoc query into Hadoop

Pentaho - Jake Cornelius - Hadoop World 2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pentaho - Jake Cornelius - Hadoop World 2010

Similar to Pentaho - Jake Cornelius - Hadoop World 2010 (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Pentaho - Jake Cornelius - Hadoop World 2010

Editor's Notes