Combine SAS High-Performance Capabilities with Hadoop YARNHortonworks
What if you could assemble all your data in one system and run your critical analytic applications in parallel, regardless of the format, age or location of the data? Today, thanks to the economics of Apache Hadoop-based data platforms, in particular YARN, this is possible.
SAS applications bring highly advanced, in-memory analytic processing to the data in Hadoop and enable a rich set of additional use cases with high performance analytic needs. Join this webinar and learn how, by combining their solutions, Hortonworks and SAS offer more flexibility to choose best of breed SAS HPA and LASR analytic applications in conjunction with trusted Hadoop workloads. Hear how it is possible to leverage Hadoop clusters to extend the power of SAS analytics.
You will hear directly from our experts how SAS HPA and LASR have been integrated with Hadoop YARN to:
Enable a modern data architecture without the need for fragmented processing clusters for each workload.
Ensure low-latency local data access directly from the data nodes.
Create Unified Resource Management window-panes for managing SAS HPA, LASR and HDP resources.
Speakers:
Arun Murthy, Founder and Architect at Hortonworks
Arun is a Apache Hadoop PMC member and has been a full time contributor to the project since the inception in 2006. He is also the lead of the MapReduce project and has focused on building NextGen MapReduce (YARN). Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo! He jointly holds the current world sorting record using Apache Hadoop.
Paul Kent, Vice President Big Data at SAS
Paul Kent is Vice President of Big Data initiatives at SAS. He spends his time between Customers, Partners and the Research & Development teams discussing, evangelizing and developing software at the confluence of big data and high performance computing. A datacenter rack full of current-generation 64bit x86 processors represents a very large aggregate memory space, thousands of threads and plentiful IO that can be harnessed to solve problems at a much larger scale than we have traditionally been accustomed to.
Introduction to the Hortonworks YARN Ready ProgramHortonworks
The recently launched YARN Ready Program will accelerate multi-workload Hadoop in the Enterprise. The program enables developers to integrate new and existing applications with YARN-based Hadoop. We will cover:
--the program and it's benefits
--why it is important to customers
--tools and guides to help you get started
--technical resources to support you
--marketing recognition you can leverage
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
What if your organization could obtain a 360 degree view of the customer across offline, online and social and mobile channels? Attend this webinar with Splunk and Hortonworks and see examples of how marketing, business and operations analysts can reach across disparate data sets in Hadoop to spot new opportunities for up-sell and cross-sell. We'll also cover examples of how to measure buyer sentiment and changes in buyer behavior. Along with best practices on how to use data in Hadoop with Splunk to assign customer influence scores that online, call-center, and retail branches can use to customize more compelling products and promotions.
Pig has added some exciting new features in 0.10, including a boolean type, UDFs in JRuby, load and store functions for JSON, bloom filters, and performance improvements. Join Alan Gates, Hortonworks co-founder and long-time contributor to the Apache Pig and HCatalog projects, to discuss these new features, as well as talk about work the project is planning to do in the near future. In particular, we will cover how Pig can take advantage of changes in Hadoop 0.23.
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
Real Time Monitoring requires a high scalable infrastructure of message bus, database, distributed event processing and scalable analytics engine. By bringing together leading open source projects of Apache Kafka, Apache HBase, Apache Storm and Apache Hive, the Hortonworks Data Platform offers a comprehensive Real Time Analysis platform. In this session, we will provide an in-depth overview all the key technology components and demonstrate a working solution for monitoring a fleet of trucks.
Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community.
Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=0278dc8aa49a9991e1ce436c71f53d30
Delivering Apache Hadoop for the Modern Data Architecture Hortonworks
Join Hortonworks and Cisco as we discuss trends and drivers for a modern data architecture. Our experts will walk you through some key design considerations when deploying a Hadoop cluster in production. We'll also share practical best practices around Cisco-based big data architectures and Hortonworks Data Platform to get you started on building your modern data architecture.
This is the presentation from the "Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS" webinar on May 28, 2014. Rohit Bahkshi, a senior product manager at Hortonworks, and Vinod Vavilapalli, PMC for Apache Hadoop, discuss an overview of YARN in HDFS and new features in HDP 2.1. Those new features include: HDFS extended ACLs, HTTPs wire encryption, HDFS DataNode caching, resource manager high availability, application timeline server, and capacity scheduler pre-emption.
Hortonworks Data Platform 2.2 includes Apache HBase for fast NoSQL data access. In this 30-minute webinar, we discussed HBase innovations that are included in HDP 2.2, including: support for Apache Slider; Apache HBase high availability (HA); block ache compression; and wire-level encryption.
Combine SAS High-Performance Capabilities with Hadoop YARNHortonworks
What if you could assemble all your data in one system and run your critical analytic applications in parallel, regardless of the format, age or location of the data? Today, thanks to the economics of Apache Hadoop-based data platforms, in particular YARN, this is possible.
SAS applications bring highly advanced, in-memory analytic processing to the data in Hadoop and enable a rich set of additional use cases with high performance analytic needs. Join this webinar and learn how, by combining their solutions, Hortonworks and SAS offer more flexibility to choose best of breed SAS HPA and LASR analytic applications in conjunction with trusted Hadoop workloads. Hear how it is possible to leverage Hadoop clusters to extend the power of SAS analytics.
You will hear directly from our experts how SAS HPA and LASR have been integrated with Hadoop YARN to:
Enable a modern data architecture without the need for fragmented processing clusters for each workload.
Ensure low-latency local data access directly from the data nodes.
Create Unified Resource Management window-panes for managing SAS HPA, LASR and HDP resources.
Speakers:
Arun Murthy, Founder and Architect at Hortonworks
Arun is a Apache Hadoop PMC member and has been a full time contributor to the project since the inception in 2006. He is also the lead of the MapReduce project and has focused on building NextGen MapReduce (YARN). Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo! He jointly holds the current world sorting record using Apache Hadoop.
Paul Kent, Vice President Big Data at SAS
Paul Kent is Vice President of Big Data initiatives at SAS. He spends his time between Customers, Partners and the Research & Development teams discussing, evangelizing and developing software at the confluence of big data and high performance computing. A datacenter rack full of current-generation 64bit x86 processors represents a very large aggregate memory space, thousands of threads and plentiful IO that can be harnessed to solve problems at a much larger scale than we have traditionally been accustomed to.
Introduction to the Hortonworks YARN Ready ProgramHortonworks
The recently launched YARN Ready Program will accelerate multi-workload Hadoop in the Enterprise. The program enables developers to integrate new and existing applications with YARN-based Hadoop. We will cover:
--the program and it's benefits
--why it is important to customers
--tools and guides to help you get started
--technical resources to support you
--marketing recognition you can leverage
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
What if your organization could obtain a 360 degree view of the customer across offline, online and social and mobile channels? Attend this webinar with Splunk and Hortonworks and see examples of how marketing, business and operations analysts can reach across disparate data sets in Hadoop to spot new opportunities for up-sell and cross-sell. We'll also cover examples of how to measure buyer sentiment and changes in buyer behavior. Along with best practices on how to use data in Hadoop with Splunk to assign customer influence scores that online, call-center, and retail branches can use to customize more compelling products and promotions.
Pig has added some exciting new features in 0.10, including a boolean type, UDFs in JRuby, load and store functions for JSON, bloom filters, and performance improvements. Join Alan Gates, Hortonworks co-founder and long-time contributor to the Apache Pig and HCatalog projects, to discuss these new features, as well as talk about work the project is planning to do in the near future. In particular, we will cover how Pig can take advantage of changes in Hadoop 0.23.
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
Real Time Monitoring requires a high scalable infrastructure of message bus, database, distributed event processing and scalable analytics engine. By bringing together leading open source projects of Apache Kafka, Apache HBase, Apache Storm and Apache Hive, the Hortonworks Data Platform offers a comprehensive Real Time Analysis platform. In this session, we will provide an in-depth overview all the key technology components and demonstrate a working solution for monitoring a fleet of trucks.
Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community.
Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=0278dc8aa49a9991e1ce436c71f53d30
Delivering Apache Hadoop for the Modern Data Architecture Hortonworks
Join Hortonworks and Cisco as we discuss trends and drivers for a modern data architecture. Our experts will walk you through some key design considerations when deploying a Hadoop cluster in production. We'll also share practical best practices around Cisco-based big data architectures and Hortonworks Data Platform to get you started on building your modern data architecture.
This is the presentation from the "Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS" webinar on May 28, 2014. Rohit Bahkshi, a senior product manager at Hortonworks, and Vinod Vavilapalli, PMC for Apache Hadoop, discuss an overview of YARN in HDFS and new features in HDP 2.1. Those new features include: HDFS extended ACLs, HTTPs wire encryption, HDFS DataNode caching, resource manager high availability, application timeline server, and capacity scheduler pre-emption.
Hortonworks Data Platform 2.2 includes Apache HBase for fast NoSQL data access. In this 30-minute webinar, we discussed HBase innovations that are included in HDP 2.2, including: support for Apache Slider; Apache HBase high availability (HA); block ache compression; and wire-level encryption.
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks
Many enterprises are turning to Apache Hadoop to enable Big Data Analytics and reduce the costs of traditional data warehousing. Yet, it is hard to succeed when 80% of the time is spent on moving data and only 20% on using it. It’s time to swap the 80/20! The Big Data experts at Attunity and Hortonworks have a solution for accelerating data movement into and out of Hadoop that enables faster time-to-value for Big Data projects and a more complete and trusted view of your business. Join us to learn how this solution can work for you.
YARN Ready: Integrating to YARN with Tez Hortonworks
YARN Ready webinar series helps developers integrate their applications to YARN. Tez is one vehicle to do that. We take a deep dive including code review to help you get started.
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveHortonworks
In February 2013, the open source community launched the Stinger Initiative to improve speed, scale and SQL semantics in Apache Hive. After thirteen months of constant, concerted collaboration (and more than 390,000 new lines of Java code) Stinger is complete with Hive 0.13.
In this presentation, Carter Shanklin, Hortonworks director of product management, and Owen O'Malley, Hortonworks co-founder and committer to Apache Hive, discuss how Hive enables interactive query using familiar SQL semantics.
In 2012, we released Hortonworks Data Platform powered by Apache Hadoop and established partnerships with major enterprise software vendors including Microsoft and Teradata that are making enterprise ready Hadoop easier and faster to consume. As we start 2013, we invite you to join us for this live webinar where Shaun Connolly, VP of Strategy at Hortonworks, will cover the highlights of 2012 and the road ahead in 2013 for Hortonworks and Apache Hadoop.
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
Beginning with HDP 2.1, Hortonworks Data Platform ships with Apache Falcon for Hadoop data governance. Himanshu Bari, Hortonworks senior product manager, and Venkatesh Seetharam, Hortonworks co-founder and committer to Apache Falcon, lead this 30-minute webinar, including:
+ Why you need Apache Falcon
+ Key new Falcon features
+ Demo: Defining data pipelines with replication; policies for retention and late data arrival; managing Falcon server with Ambari
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
For the first time, Hortonworks Data Platform ships with Apache Storm for processing stream data in Hadoop.
In this presentation, Himanshu Bari, Hortonworks senior product manager, and Taylor Goetz, Hortonworks engineer and committer to Apache Storm, cover Storm and stream processing in HDP 2.1:
+ Key requirements of a streaming solution and common use cases
+ An overview of Apache Storm
+ Q & A
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
Apache Solr is the open source platform for searching data stored in Hadoop. Solr powers search on many of the world's largest Internet sites, enabling powerful full-text search and near real-time indexing. Whether users search for tabular, text, geo-location or sensor data in Hadoop, they find it quickly with Apache Solr. Hortonworks Data Platform 2.1 includes Apache Solr.
In this deck from their 30-minute webinar, Rohit Bakhshi, Hortonworks product manager, and Paul Codding, Hortonworks solution engineer describe how Solr works within HDP's YARN-based architecture.
Stinger.Next by Alan Gates of HortonworksData Con LA
ver the last 13 months the Apache Hive community, which included 145 developers and 44 companies working together through the Stinger initiative, delivered 390,000 lines of code and 1600 resolved JIRA tickets. This is only the beginning. The Hive community has already started the next phase of extending the Speed, Scale, and SQL compliance in Hive. As Hadoop 2.0 with YARN evolves to enable a dizzying array of powerful engines that allow us to interact with ever growing data in new ways, well known tools such as SQL need to scale with it. This session will provide a technical illustration of the challenges facing SQL on Hadoop today and what the road ahead looks like as the user community drives more innovation. Stinger.next is the next multi-phase initiative to evolve Hive as the de facto SQL engine for Hadoop designed to deliver Speed, Scale and better SQL.
Hortonworks Yarn Code Walk Through January 2014Hortonworks
This slide deck accompanies the Webinar recording YARN Code Walk through on Jan. 22, 2014, on Hortonworks.com/webinars under Past Webinars, or
https://hortonworks.webex.com/hortonworks/lsr.php?AT=pb&SP=EC&rID=129468197&rKey=b645044305775657
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hortonworks
Hortonworks continues to innovate throughout all Hadoop related projects, packaging the most enterprise-ready components, such as Ambari, into the Hortonworks Data Platform (HDP). Please join us in this interactive webinar as we present real-world use cases of Enterprise customers that are finding success with HDP and their Big Data initiatives. We will also introduce new features from version 1.2 of the Hortonworks Data Platform and how it has become the leading 100% open source distribution choice for the Enterprise.
In this webinar we will outline how enterprise customers are successfult with HDP and also review some of the newest features in version1.2 including:
-How to provision a cluster
-How to manage and monitor a cluster using completely open source tools
-How to perform diagnostics to identify issues in a cluster
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
This webinar focuses on introducing Scalding for developers and writing applications for Hadoop and YARN using Scalding. Guest speaker Jonathan Coveney from Twitter provides an overview, use cases, limitations, and core concepts.
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Hortonworks
Hortonworks Data Platform 2.2 include HDFS for data storage . In this 30-minute webinar, we discussed data storage innovations, including Heterogeneous storage, encryption, and operational security enhancements.
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
Hortonworks Data Platform 2.2 includes Apache Falcon for Hadoop data governance. In this 30-minute webinar, we discussed why the enterprise needs Falcon for governance, and demonstrated data pipeline construction, policies for data retention and management with Ambari. We also discussed new innovations including: integration of user authentication, data lineage, an improved interface for pipeline management, and the new Falcon capability to establish an automated policy for cloud backup to Microsoft Azure or Amazon S3.
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
Azure Big Data: “Got Data? Go Modern and Monetize”.
In this session you will learn how to architected, developed, and build completely in the open, Hortonworks Data Platform (HDP) that provides an enterprise ready data platform to adopt a Modern Data Architecture.
Apache Ambari is a single framework for IT administrators to provision, manage and monitor a Hadoop cluster. Apache Ambari 1.7.0 is included with Hortonworks Data Platform 2.2.
In this 30-minute webinar, Hortonworks Product Manager Jeff Sposetti and Apache Ambari committer Mahadev Konar discussed new capabilities including:
Improvements to Ambari core - such as support for ResourceManager HA
Extensions to Ambari platform - introducing Ambari Administration and Ambari Views
Enhancements to Ambari Stacks - dynamic configuration recommendations and validations via a "Stack Advisor"
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks
Many enterprises are turning to Apache Hadoop to enable Big Data Analytics and reduce the costs of traditional data warehousing. Yet, it is hard to succeed when 80% of the time is spent on moving data and only 20% on using it. It’s time to swap the 80/20! The Big Data experts at Attunity and Hortonworks have a solution for accelerating data movement into and out of Hadoop that enables faster time-to-value for Big Data projects and a more complete and trusted view of your business. Join us to learn how this solution can work for you.
YARN Ready: Integrating to YARN with Tez Hortonworks
YARN Ready webinar series helps developers integrate their applications to YARN. Tez is one vehicle to do that. We take a deep dive including code review to help you get started.
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveHortonworks
In February 2013, the open source community launched the Stinger Initiative to improve speed, scale and SQL semantics in Apache Hive. After thirteen months of constant, concerted collaboration (and more than 390,000 new lines of Java code) Stinger is complete with Hive 0.13.
In this presentation, Carter Shanklin, Hortonworks director of product management, and Owen O'Malley, Hortonworks co-founder and committer to Apache Hive, discuss how Hive enables interactive query using familiar SQL semantics.
In 2012, we released Hortonworks Data Platform powered by Apache Hadoop and established partnerships with major enterprise software vendors including Microsoft and Teradata that are making enterprise ready Hadoop easier and faster to consume. As we start 2013, we invite you to join us for this live webinar where Shaun Connolly, VP of Strategy at Hortonworks, will cover the highlights of 2012 and the road ahead in 2013 for Hortonworks and Apache Hadoop.
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
Beginning with HDP 2.1, Hortonworks Data Platform ships with Apache Falcon for Hadoop data governance. Himanshu Bari, Hortonworks senior product manager, and Venkatesh Seetharam, Hortonworks co-founder and committer to Apache Falcon, lead this 30-minute webinar, including:
+ Why you need Apache Falcon
+ Key new Falcon features
+ Demo: Defining data pipelines with replication; policies for retention and late data arrival; managing Falcon server with Ambari
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
For the first time, Hortonworks Data Platform ships with Apache Storm for processing stream data in Hadoop.
In this presentation, Himanshu Bari, Hortonworks senior product manager, and Taylor Goetz, Hortonworks engineer and committer to Apache Storm, cover Storm and stream processing in HDP 2.1:
+ Key requirements of a streaming solution and common use cases
+ An overview of Apache Storm
+ Q & A
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
Apache Solr is the open source platform for searching data stored in Hadoop. Solr powers search on many of the world's largest Internet sites, enabling powerful full-text search and near real-time indexing. Whether users search for tabular, text, geo-location or sensor data in Hadoop, they find it quickly with Apache Solr. Hortonworks Data Platform 2.1 includes Apache Solr.
In this deck from their 30-minute webinar, Rohit Bakhshi, Hortonworks product manager, and Paul Codding, Hortonworks solution engineer describe how Solr works within HDP's YARN-based architecture.
Stinger.Next by Alan Gates of HortonworksData Con LA
ver the last 13 months the Apache Hive community, which included 145 developers and 44 companies working together through the Stinger initiative, delivered 390,000 lines of code and 1600 resolved JIRA tickets. This is only the beginning. The Hive community has already started the next phase of extending the Speed, Scale, and SQL compliance in Hive. As Hadoop 2.0 with YARN evolves to enable a dizzying array of powerful engines that allow us to interact with ever growing data in new ways, well known tools such as SQL need to scale with it. This session will provide a technical illustration of the challenges facing SQL on Hadoop today and what the road ahead looks like as the user community drives more innovation. Stinger.next is the next multi-phase initiative to evolve Hive as the de facto SQL engine for Hadoop designed to deliver Speed, Scale and better SQL.
Hortonworks Yarn Code Walk Through January 2014Hortonworks
This slide deck accompanies the Webinar recording YARN Code Walk through on Jan. 22, 2014, on Hortonworks.com/webinars under Past Webinars, or
https://hortonworks.webex.com/hortonworks/lsr.php?AT=pb&SP=EC&rID=129468197&rKey=b645044305775657
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hortonworks
Hortonworks continues to innovate throughout all Hadoop related projects, packaging the most enterprise-ready components, such as Ambari, into the Hortonworks Data Platform (HDP). Please join us in this interactive webinar as we present real-world use cases of Enterprise customers that are finding success with HDP and their Big Data initiatives. We will also introduce new features from version 1.2 of the Hortonworks Data Platform and how it has become the leading 100% open source distribution choice for the Enterprise.
In this webinar we will outline how enterprise customers are successfult with HDP and also review some of the newest features in version1.2 including:
-How to provision a cluster
-How to manage and monitor a cluster using completely open source tools
-How to perform diagnostics to identify issues in a cluster
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
This webinar focuses on introducing Scalding for developers and writing applications for Hadoop and YARN using Scalding. Guest speaker Jonathan Coveney from Twitter provides an overview, use cases, limitations, and core concepts.
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Hortonworks
Hortonworks Data Platform 2.2 include HDFS for data storage . In this 30-minute webinar, we discussed data storage innovations, including Heterogeneous storage, encryption, and operational security enhancements.
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
Hortonworks Data Platform 2.2 includes Apache Falcon for Hadoop data governance. In this 30-minute webinar, we discussed why the enterprise needs Falcon for governance, and demonstrated data pipeline construction, policies for data retention and management with Ambari. We also discussed new innovations including: integration of user authentication, data lineage, an improved interface for pipeline management, and the new Falcon capability to establish an automated policy for cloud backup to Microsoft Azure or Amazon S3.
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
Azure Big Data: “Got Data? Go Modern and Monetize”.
In this session you will learn how to architected, developed, and build completely in the open, Hortonworks Data Platform (HDP) that provides an enterprise ready data platform to adopt a Modern Data Architecture.
Apache Ambari is a single framework for IT administrators to provision, manage and monitor a Hadoop cluster. Apache Ambari 1.7.0 is included with Hortonworks Data Platform 2.2.
In this 30-minute webinar, Hortonworks Product Manager Jeff Sposetti and Apache Ambari committer Mahadev Konar discussed new capabilities including:
Improvements to Ambari core - such as support for ResourceManager HA
Extensions to Ambari platform - introducing Ambari Administration and Ambari Views
Enhancements to Ambari Stacks - dynamic configuration recommendations and validations via a "Stack Advisor"
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
Hortonworks and Teradata have partnered to provide a clear path to Big Analytics via stable and reliable Hadoop for the enterprise. The Teradata® Portfolio for Hadoop is a flexible offering of products and services for customers to integrate Hadoop into their data architecture while taking advantage of the world-class service and support Teradata provides.
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
Scalding is a scala DSL for Cascading. Run on Hadoop, it’s a concise, functional, and very efficient way to build big data applications. One significant benefit of Scalding is that it allows easy porting of Scalding apps from MapReduce to newer, faster execution fabrics.
In this webinar, Cyrille Chépélov, of Transparency Rights Management, will share how his organization boosted the performance of their Scalding apps by over 50% by moving away from MapReduce to Cascading 3.0 on Apache Tez. Dhruv Kumar, Hortonworks Partner Solution Engineer, will then explain how you can interact with data on HDP using Scala and leverage Scala as a programming language to develop Big Data applications.
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
Scalding is a scala DSL for Cascading. Run on Hadoop, it’s a concise, functional, and very efficient way to build big data applications. One significant benefit of Scalding is that it allows easy porting of Scalding apps from MapReduce to newer, faster execution fabrics.
In this webinar, Cyrille Chépélov, of Transparency Rights Management, will share how his organization boosted the performance of their Scalding apps by over 50% by moving away from MapReduce to Cascading 3.0 on Apache Tez. Dhruv Kumar, Hortonworks Partner Solution Engineer, will then explain how you can interact with data on HDP using Scala and leverage Scala as a programming language to develop Big Data applications.
Boost Performance with Scala – Learn From Those Who’ve Done It! Hortonworks
Scalding is a scala DSL for Cascading. Run on Hadoop, it’s a concise, functional, and very efficient way to build big data applications. One significant benefit of Scalding is that it allows easy porting of Scalding apps from MapReduce to newer, faster execution fabrics.
In this webinar, Cyrille Chépélov, of Transparency Rights Management, will share how his organization boosted the performance of their Scalding apps by over 50% by moving away from MapReduce to Cascading 3.0 on Apache Tez. Dhruv Kumar, Hortonworks Partner Solution Engineer, will then explain how you can interact with data on HDP using Scala and leverage Scala as a programming language to develop Big Data applications.
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
Earlier this year, the Apache open source community delivered the Stinger Initiative to improve speed, scale and SQL semantics in Apache Hive. Now Stinger.next is underway, to build on those initial successes.
In this presentation, from a webinar hosted by Hortonworks co-founder Alan Gates and Hortonworks Hive product manager Raj Baines, you can learn more about Stinger.next and innovation in Apache Hive.
Alan and Raj cover new Hive functionality for more speed, scale and SQL in HDP 2.2. Specific topics include transactions with ACID semantics, the cost based optimizer and dynamic query optimizations.
The presentation also shows future plans for the Stinger.next initiative.
In this webinar, we'll:
-Examine the key drivers and use cases for High Availability, performance and scalability for Apache Hadoop.
-Walk through an overview of reference architecture for a Non-Stop Hadoop implementation.
-Show how you can get started with Non-Stop Hadoop with the Hortonworks Data Platform.
These slides to the Discover HDP 2.2 Webinar Series: Data Storage Innovations in HDFS explore Heterogeneous storage, Data Encryption and Operational security.
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
Companies in every industry look for ways to explore new data types and large data sets that were previously too big to capture, store and process. They need to unlock insights from data such as clickstream, geo-location, sensor, server log, social, text and video data. However, becoming a data-first enterprise comes with many challenges.
Join this webinar organized by three leaders in their respective fields and learn from our experts how you can accelerate the implementation of a scalable, cost-efficient and robust Big Data solution. Cisco, Hortonworks and Red Hat will explore how new data sets can enrich existing analytic applications with new perspectives and insights and how they can help you drive the creation of innovative new apps that provide new value to your business.
Mr. Slim Baltagi is a Systems Architect at Hortonworks, with over 4 years of Hadoop experience working on 9 Big Data projects: Advanced Customer Analytics, Supply Chain Analytics, Medical Coverage Discovery, Payment Plan Recommender, Research Driven Call List for Sales, Prime Reporting Platform, Customer Hub, Telematics, Historical Data Platform; with Fortune 100 clients and global companies from Financial Services, Insurance, Healthcare and Retail.
Mr. Slim Baltagi has worked in various architecture, design, development and consulting roles at.
Accenture, CME Group, TransUnion, Syntel, Allstate, TransAmerica, Credit Suisse, Chicago Board Options Exchange, Federal Reserve Bank of Chicago, CNA, Sears, USG, ACNielsen, Deutshe Bahn.
Mr. Baltagi has also over 14 years of IT experience with an emphasis on full life cycle development of Enterprise Web applications using Java and Open-Source software. He holds a master’s degree in mathematics and is an ABD in computer science from Université Laval, Québec, Canada.
Languages: Java, Python, JRuby, JEE , PHP, SQL, HTML, XML, XSLT, XQuery, JavaScript, UML, JSON
Databases: Oracle, MS SQL Server, MYSQL, PostreSQL
Software: Eclipse, IBM RAD, JUnit, JMeter, YourKit, PVCS, CVS, UltraEdit, Toad, ClearCase, Maven, iText, Visio, Japser Reports, Alfresco, Yslow, Terracotta, Toad, SoapUI, Dozer, Sonar, Git
Frameworks: Spring, Struts, AppFuse, SiteMesh, Tiles, Hibernate, Axis, Selenium RC, DWR Ajax , Xstream
Distributed Computing/Big Data: Hadoop, MapReduce, HDFS, Hive, Pig, Sqoop, HBase, R, RHadoop, Cloudera CDH4, MapR M7, Hortonworks HDP 2.1
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
Speakers:
John Hol, Regional Director, Attunity
Mike Hollobon, Director Business Development, IBT
Hitachi Data Systems Hadoop Solution. Customers are seeing exponential growth of unstructured data from their social media websites to operational sources. Their enterprise data warehouses are not designed to handle such high volumes and varieties of data. Hadoop, the latest software platform that scales to process massive volumes of unstructured and semi-structured data by distributing the workload through clusters of servers, is giving customers new option to tackle data growth and deploy big data analysis to help better understand their business. Hitachi Data Systems is launching its latest Hadoop reference architecture, which is pre-tested with Cloudera Hadoop distribution to provide a faster time to market for customers deploying Hadoop applications. HDS, Cloudera and Hitachi Consulting will present together and explain how to get you there. Attend this WebTech and learn how to: Solve big-data problems with Hadoop. Deploy Hadoop in your data warehouse environment to better manage your unstructured and structured data. Implement Hadoop using HDS Hadoop reference architecture. For more information on Hitachi Data Systems Hadoop Solution please read our blog: http://blogs.hds.com/hdsblog/2012/07/a-series-on-hadoop-architecture.html
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available in multiple distributions and Amazon EMR gives you the option of using the Amazon Distribution or the MapR Distribution for Hadoop.
This webinar will show you examples of how to use Amazon EMR to with the MapR Distribution for Hadoop. You will learn how you can free yourself from the heavy lifting required to run Hadoop on-premises, and gain the advantages of using the cloud to increase flexibility and accelerate projects while lowering costs.
What we'll learn:
• See a live demonstration of how you can quickly and easily launch your first Hadoop cluster in a few steps.
• Examples of real world applications and customer successes in production
• Best practices for maximizing the benefits of using MapR with AWS.
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
Hortonworks.bdb
1. Hortonworks: We Do Hadoop.
Our mission is to enable your Modern Data Architecture
by Delivering Enterprise Apache Hadoop
March 2014
2. Our Mission:
Our Commitment
Open Leadership
Drive innovation in the open exclusively via the
Apache community-driven open source process
Enterprise Rigor
Engineer, test and certify Apache Hadoop with
the enterprise in mind
Ecosystem Endorsement
Focus on deep integration with existing data
center technologies and skills
Page 2
Headquarters: Palo Alto, CA
Employees: 300+ and growing
Trusted Partners
Enable your Modern Data Architecture by
Delivering Enterprise Apache Hadoop
4. 1Key Services
Platform, Operational and
Data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
Requirements for Enterprise Hadoop
Page 4
CORE
SERVICES
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
OPERATIONAL
SERVICES
HDFS
SQOOP
FLUME
NFS
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
Integration
Interoperable with existing
data center investments3
OPERATIONAL
SERVICES
DATA
SERVICES
CORE SERVICES
Schedule
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
Storage
Resource Management
Process
Data
Movement
Cluster
Mgmt Dataset
Mgmt
Data Access
Data
Security
5. 1Key Services
Platform, Operational and
Data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
HDP: A Complete Hadoop Distribution
Page 5
OS/VM Cloud Appliance
CORE
SERVICES
CORE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
Integration
Interoperable with existing
data center investments3
OPERATIONAL
SERVICES
DATA
SERVICES
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
Schedule
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
Storage
Resource Management
Process
Data
Movement
Cluster
Mgmnt Dataset
Mgmnt
Data Access
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUMEAMBARI
FALCON
YARN
MAP
TEZREDUCE
HIVEPIG
HBASE
OOZIE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
LOAD &
EXTRACT
WebHDFS
NFS
KNOX
6. Store all date in a single place, interact in multiple ways
Hadoop 2: The Introduction of YARN
1st Gen of
Hadoop
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HADOOP 2
Single Use System
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
Page 6
Redundant, Reliable Storage
(HDFS)
Efficient Cluster Resource
Management & Shared Services
(YARN)
Standard Query
Processing
Hive, Pig
Batch
MapReduce
Interactive
Tez
Online Data
Processing
HBase, Accumulo
Real Time Stream
Processing
Storm
others
…
7. Apache Hadoop YARN
Page 7
Flexible
Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming
Efficient
Double processing IN Hadoop on
the same hardware while
providing predictable
performance & quality of service
Shared
Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads
The data operating system for Hadoop 2.0
Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm
IN-MEMORY
Spark
GRAPH
Giraph
SAS
LASR, HPA
ONLINE
HBase, Accumulo
OTHERS
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management
8. Driving Our Innovation Through Apache
147,933 lines
614,041 lines
End Users
449,768 lines
Total Net Lines Contributed
to Apache Hadoop
Yahoo: 10
Cloudera: 7
IBM: 3
10 Others
21
Facebook: 5
LinkedIn: 3
Total Number of Committers
to Apache Hadoop
63
total
Hortonworks mission is
to power your modern data architecture by enabling
Hadoop to be an enterprise data platform that
deeply integrates with your data center technologies
Page 8
Apache
Project
Committers
PMC
Members
Hadoop 21 13
Tez 10 4
Hive 11 3
HBase 8 3
Pig 6 5
Sqoop 1 0
Ambari 20 12
Knox 6 2
Falcon 2 2
Oozie 2 2
Zookeepe
r
2 1
Flume 1 0
Accumulo 2 2
Storm 1 0
Drill 1 0
TOTAL 95 48
9. Patterns for Hadoop Applications
Page 9
1
Integration
Interoperable with existing
data center investments
Key Services
Platform, operational and
data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
3 DEVELOPANALYZEOPERATE
COLLECT PROCESS BUILD
EXPLORE QUERY DELIVER
PROVISION MANAGE MONITOR
10. Familiar and Existing Tools
Page 10
1Key Services
Platform, operational and
data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
DEVELOPANALYZEOPERATE
COLLECT PROCESS BUILD
EXPLORE QUERY DELIVER
PROVISION MANAGE MONITOR
BusinessObjects BI
Integration
Interoperable with existing
data center investments3
11. SQL Interactive Query & Apache Hive
Page 11
1Key Services
Platform, operational and
data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
Integration
Interoperable with existing
data center investments3
Stinger Initiative
Broad, community based effort to deliver the
next generation of Apache Hive
Scale
The only SQL interface
to Hadoop designed for
queries that scale from
TB to PB
SQL
Support broadest range
of SQL semantics for
analytic applications
against Hadoop
Speed
Improve Hive query
performance by 100X to
allow for interactive
query times (seconds)
SQL
Apache Hive
• The defacto standard for Hadoop SQL access
• Used by your current data center partners
• Built for batch AND interactive query
12. APPLICATIONSDATASYSTEM
REPOSITORIES
SOURCES
Existing Sources
(CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Emerging Sources
(Sensor, Sentiment, Geo, Unstructured)
OPERATIONAL
TOOLS
MANAGE &
MONITOR
DEV & DATA
TOOLS
BUILD &
TEST
Business
Analytics
Custom
Applications
Packaged
Applications
Requirements for Enterprise Hadoop
Page 12
Integration
Interoperable with existing
data center investments3
Integrate with
Applications
Business Intelligence,
Developer IDEs,
Data Integration
Systems
Data Systems & Storage,
Systems Management
Platforms
Operating Systems,
Virtualization, Cloud,
Appliances
13. Broad Ecosystem Integration
Page 13
APPLICATIONSDATASYSTEMSOURCES
RDBMS EDW MPP
Emerging Sources
(Sensor, Sentiment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Existing Sources
(CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE
14. Apache Hive and Stinger:
SQL in Hadoop
Arun Murthy (@acmurthy)
Alan Gates (@alanfgates)
Owen O’Malley (@owen_omalley)
@hortonworks
15. Stinger Project
(announced February 2013)
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Coming Soon:
• Hive on Apache Tez
• Query Service
• Buffer Cache
• Cost Based Optimizer (Optiq)
• Vectorized Processing
Hive 0.11, May 2013:
• Base Optimizations
• SQL Analytic Functions
• ORCFile, Modern File Format
Hive 0.12, October 2013:
• VARCHAR, DATE Types
• ORCFile predicate pushdown
• Advanced Optimizations
• Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
…all IN Hadoop
Goals:
16. Hive 0.12
Hive 0.12
Release Theme Speed, Scale and SQL
Specific Features • 10x faster query launch when using large number
(500+) of partitions
• ORCFile predicate pushdown speeds queries
• Evaluate LIMIT on the map side
• Parallel ORDER BY
• New query optimizer
• Introduces VARCHAR and DATE datatypes
• GROUP BY on structs or unions
Included
Components
Apache Hive 0.12
17. SPEED: Increasing Hive Performance
Performance Improvements
included in Hive 12
– Base & advanced query optimization
– Startup time improvement
– Join optimizations
Interactive Query Times across ALL use cases
• Simple and advanced queries in seconds
• Integrates seamlessly with existing tools
• Currently a >100x improvement in just nine months
18. Stinger Phase 3: Unlocking Interactive Query
Page 18
Stinger Phase 3: Features and Benefits
Container Pre-Launch
Overcomes Java VM startup latency by pre-
launching hot containers ready to serve queries
Container Re-Use
Finished Maps and Reduces pick up more work
rather than exiting. Reduces latency and
eliminates difficult split size tuning
Tez Integration
Tez Broadcast Edge and Intermediate Reduce
pattern improve query scale and throughput
In-Memory Cache Hot data kept in RAM for fast access
19. Stinger Phase 3: Speed, Scale, and SQL
Page 19
Release Theme Prove Hive for both large-scale and interactive SQL /
analytics
Specific Features • < 10s SQL queries over 200GB datasets through Hive
• Tez container pre-launch
• Tez container re-use
• Use of Tez Intermediate Reduce pattern
• In-memory HDFS caching
Made available as part of the Tech Preview for Stinger Phase 3
20. Stinger Phase 3: Beyond Tech Preview
Page 20
Release Theme Speed, SQL,…and Security
Specific Features • Hive-on-Tez: Interactive query on Hive
• SQL Improvements:
• Sub-query for WHERE
• Standard JOIN semantics
• Support for Common Table Expressions (CTE)
• Phase 1 of ACID Semantics support
• Automatic JOIN order optimization
• CHAR datatype
• PAM authentication support
• SSL encryption
21. SQL: Enhancing SQL Semantics
Hive SQL Datatypes Hive SQL Semantics
INT SELECT, INSERT
TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY
BOOLEAN JOIN on explicit join key
FLOAT Inner, outer, cross and semi joins
DOUBLE Sub-queries in FROM clause
STRING ROLLUP and CUBE
TIMESTAMP UNION
BINARY Windowing Functions (OVER, RANK, etc)
DECIMAL Custom Java UDFs
ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.)
DATE Advanced UDFs (ngram, Xpath, URL)
VARCHAR Sub-queries in WHERE, HAVING
CHAR Expanded JOIN Syntax
SQL Compliant Security (GRANT, etc.)
INSERT/UPDATE/DELETE (ACID)
Hive 0.12
Available
Roadmap
SQL Compliance
Hive 12 provides a wide
array of SQL datatypes
and semantics so your
existing tools integrate
more seamlessly with
Hadoop
22.
23. Vectorized Query Execution
•Designed for Modern Processor Architectures
–Avoid branching in the inner loop.
–Make the most use of L1 and L2 cache.
•How It Works
–Process records in batches of 1,000 rows
–Generate code from templates to minimize branching.
•What It Gives
–30x improvement in rows processed per second.
–Initial prototype: 100M rows/sec on laptop
Page 23
24.
25.
26. Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids
unneeded writes to
HDFS
27. Tez Delivers Interactive Query - Out of the Box!
Page 27
Feature Description Benefit
Tez Session
Overcomes Map-Reduce job-launch latency by pre-
launching Tez AppMaster
Latency
Tez Container Pre-
Launch
Overcomes Map-Reduce latency by pre-launching
hot containers ready to serve queries.
Latency
Tez Container Re-Use
Finished maps and reduces pick up more work
rather than exiting. Reduces latency and eliminates
difficult split-size tuning. Out of box performance!
Latency
Runtime re-
configuration of DAG
Runtime query tuning by picking aggregation
parallelism using online query statistics
Throughput
Tez In-Memory Cache Hot data kept in RAM for fast access. Latency
Complex DAGs
Tez Broadcast Edge and Map-Reduce-Reduce
pattern improve query scale and throughput.
Throughput
28.
29.
30.
31.
32.
33.
34. How Stinger Phase 3 Delivers Interactive Query
Page 34
Feature Description Benefit
Tez Integration Tez is significantly better engine than MapReduce Latency
Vectorized Query
Take advantage of modern hardware by processing
thousand-row blocks rather than row-at-a-time.
Throughput
Query Planner
Using extensive statistics now available in Metastore
to better plan and optimize query, including
predicate pushdown during compilation to eliminate
portions of input (beyond partition pruning)
Latency
Cost Based Optimizer
(Optiq)
Join re-ordering and other optimizations based on
column statistics including histograms etc.
Latency
36. Hortonworks: The Value of “Open” for You
Page 36
Validate & Try
1. Download the
Hortonworks Sandbox
2. Learn Hadoop using the
technical tutorials
3. Investigate a business
case using the step-by-
step business cases
scenarios
4. Validate YOUR business
case using your data in
the sandbox
Connect With the Hadoop Community
We employ a large number of Apache project committers & innovators so
that you are represented in the open source community
Avoid Vendor Lock-In
Hortonworks Data Platform remain as close to the open source trunk as
possible and is developed 100% in the open so you are never locked in
The Partners you Rely On, Rely On Hortonworks
We work with partners to deeply integrate Hadoop with data center
technologies so you can leverage existing skills and investments
Certified for the Enterprise
We engineer, test and certify the Hortonworks Data Platform at scale to
ensure reliability and stability you require for enterprise use
Support from the Experts
We provide the highest quality of support for deploying at scale. You are
supported by hundreds of years of Hadoop experience
Engage
1. Execute a Business Case
Discovery Workshop with
our architects
2. Build a business case for
Hadoop today
Editor's Notes
Hello Today I’m going to talk to you about HW and how we deliver an Enterprise Ready Hadoop to enable your modern data architecture.
Founded just 2.5 years ago from the original hadoop team members a yahoo.Hortonworks emerged as the leader in open source Hadoop.We are commited to ensure H is an enterprise viable data platform ready for your modern data architectureOur team is probably the largest assembled team of Hadoop experts and active leaders in the communityWe not only make sure Hadoop meets all your enterprise requirements likeOperations, reliablity & SecurityIt also needs to bePackaged & Tested and we do this.It has to work with what you have Make Hadoop an enterprise data platform. Make the market function.Innovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners
The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that we’ve put Hive on a clear roadmap to SQL compliance.That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases.
query 52 star join followed by group/order (different keys), selective filterquery 55 same
query 28: 4subquery joinquery 12: star join over range of dates
query 1: SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BYSUBSTR(sourceIP, 1, X)
SELECT sourceIP, totalRevenue, avgPageRankFROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X') GROUP BY UV.sourceIP)ORDER BY totalRevenue DESC LIMIT 1
Make Hadoop an enterprise data platformInnovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners