HDP-1 introduction for HUG France


Published on

Presentation on Hortonworks Data Platform for HUG France, June 28, 2012

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • <PRESENTATION> The newest way that JBoss is delivering Enterprise-class stability and performance is with JBoss Enterprise Platforms. Having to integrate, and maintain the integrations between the multiple community projects to meet your enterprise middleware platform needs can add complexity and cost to your IT operations. Red Hat solves this problem with JBoss Enterprise Platforms. JBoss Enterprise Platforms integrate the most popular JBoss.org projects into stable, secure, certified distributions with a single patch and update stream. JBoss Enterprise Platforms are available via subscriptions that include certified software, industry-leading support, updates and patches, documentation and multi-year maintenance policies. Now, customers can leverage all the innovation, flexibility and value of open source without additional time and expense of maintain their own application platform. Everybody wins. </PRESENTATION>
  • How do you set up a cluster? Three ways. 1. The HMC installer uses Puppet to set up a set of machines -driven by files listing hostnames of machines you want in specific roles. It doesn't assume you have Java; installs the tested Java versions (64 bit for masters, 32 bit for workers). Brings up entire cluster, smoke tests, leaves you with web management console driven by ganglia and nagios. This is the easy way to set up an entire cluster. 2. There is the option of just installing the RPMs using Yum, directly from the HWX repository, using "yum upgrade" to upgrade -or even go to Kickstart and create your own OS images on demand. One thing to consider is that the platforms tested on look "dated" -why not RHEL6.3 + Java 7? Using experience w/ stability problems on the Y! cluster to stick to JVM version that is trusted to be stable; mature OS.
  • HCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. Similar to a Schema in the RDBMS world except that it's more than just the SQL-layer. A common buzzword in the NoSQL world today is that of polyglot persistence. Basically, what that comes down to is that you pick the right tool for the job. In the Hadoop ecosystem, you have many tools that might be used for data processing - you might use Pig or Hive, or your own custom MapReduce program, or that shiny new GUI-based tool that's just come out. And which one to use might depend on the user, or on the type of query you're interested in, or the type of job we want to run. From another perspective, you might want to store your data in columnar storage for efficient storage and retrieval for particular query types, or in text so that users can write data producers in scripting languages like Perl or Python, or you may want to hook up that HBase table as a data source. As a end-user, I want to use whatever data processing tool is available to me. As a data designer, I want to optimize how data is stored. As a cluster manager/data architect, I want the ability to share pieces of information across the board, and move data back and forth fluidly. HCatalog's hopes and promises are the realization of all of the above.
  • Picking on a new feature in Hadoop 1.0.3 : webhdfs is something interesting. Set one config option and the DNs and NNs become web servers (using the chosen auth mechanism), offering read and write access to the data. This is integral to the cluster -you ask the NN for data, which triggers a 307 redirect to a DN with the data, which serves up up. A redirect that is handled transparently by all HTTP clients set up to handle redirects.
  • Up until now, a change in the internal Hadoop versions caused -protocol version mismatch problems with all remote clients. Those clients also needed the entire Hadoop JAR set on their classpath, and were java only. Now: stable APIs, cross-language,
  • This is something still coming together: HA clustering using VMWare vSphere as the HA clustering system underneath the classic failure points - the Namenode of HDFS; the JobTracker of MapR Monitoring agents to report failure to vSphere, trigger failover on process crash or hang, VM crash/hang, and physical hardware failure. Lets you host a set of independent VMs, one per master server, with isolated lifecycle and management. Very good for ops tasks: snapshotting, update software in an offline VM, etc. Does not require that the workers are virtual -they can be physical, virtual or even a mix of both.
  • HDP-1 introduction for HUG France

    1. 1. HDP-1Steve Loughran– Hortonworksstevel at hortonworks.com@steveloughranParis, June 2012© Hortonworks Inc. 2012
    2. 2. Hortonworks Data Platform Develop Interact Non-Relational Database Scripting Query Talend Open Studio for Big Data, Sqoop) (Pig) (Hive) Data Extraction & Loading Workflow & Scheduling Management & Monitoring (HCatalog APIs, WebHDFS, (HBase) Metadata Management (Ambari, Zookeeper) (HCatalog) (Oozie)Operate Distributed Processing Integrate (MapReduce) Distributed Storage (HDFS) Hortonworks Data Platform Page 2 © Hortonworks Inc. 2012
    3. 3. Hortonworks Data Platform (HDP) Fully Integrated, Extensively Tested, Enterprise Supported Challenge: • Integrate, manage, and support changes across a wide range of open source projects that power the Hadoop platform; each with their own release schedules, versions, & dependencies. • Time-intensive, Complex, Expensive Solution: Hortonworks Data Platform • Integrated certified platform distributions • Extensive Q/A process: many apps across small, medium, & large clusters • Industry-leading Support with clearHadoop Pig HCatalog Hive Ambari Zookeeper service levels for updates and patches Core = New Version Page 3 © Hortonworks Inc. 2012
    4. 4. HDP 1.0 Components Component Version Apache Hadoop (HDFS & MapReduce) 1.0.3+ Apache HCatalog 0.4.0+ Apache Pig 0.9.2 Apache Hive 0.9.0+ Apache HBase 0.92.1+ Talend Open Studio for Big Data 5.1.0 Apache Sqoop 1.4.1+ Apache Oozie 3.1.3+ Apache Zookeeper 3.3.4 0.1 Apache Ambari (Technology Preview) Page 4 © Hortonworks Inc. 2012
    5. 5. Management & Monitoring: Ambari• 100% Open Source• Wizard-based install, provisioning & configuration management• Monitoring and alerting dashboards• Goals: ease of installation, scale to large clusters, effective monitoring of all services Page 5 © Hortonworks Inc. 2012
    6. 6. Cluster Provisioning through Web UIDownload and try from http://hortonworks.com Page 6 © Hortonworks Inc. 2012
    7. 7. Monitoring and alerting dashboards Page 7 © Hortonworks Inc. 2012
    8. 8. Installation and ProvisioningHMC Installer -GUI, puppet-driven – Installs Java and up; – Configures entire cluster – Sets up HMC for cluster monitoring – Web UI + text files listing nodesgsInstall – Command line installer -file drivenRPM/YUM for custom installation processes – Configuration left as an exercise – Use if you have other cluster management tooling Qualified at scale on RHEL5.8 & Java 6u26 Page 8 © Hortonworks Inc. 2012
    9. 9. Enterprise Data Integration -> Talend• Talend Open Studio for Big Data – Feature-rich Job Designer – Rich palette of pre-built templates – Supports HDFS, Pig, Hive, HBase, HCatalog – Apache-licensed, bundled with HDP• Key benefits – Graphical development – Robust and scalable execution – Broadest connectivity to support all systems: 450+ components – Real-time debugging Page 9 © Hortonworks Inc. 2012
    10. 10. Metadata Management -> HCatalog• Simplifies data sharing between Hadoop and other data systems – Enables Hadoop data to be described in a schema & accessed as tables• Provides consistent data access for MapReduce, Hive and Pig – Minimizes hard coding of data structure, storage format, and location• Manages metadata for table storage – Based on Hive’s metadata server – Uses Hive language for metadata manipulation operations• Tables may be stored in RCFile, Text files, or SequenceFiles Page 10 © Hortonworks Inc. 2012
    11. 11. RESTful API Front-door for Hadoop• Opens the door to languages other than Java• Thin clients via web services vs. fat-clients in gateway• Insulation from interface changes release to release HCatalog web interfaces FS HD eb W MapReduce Pig Hive HCatalog External HDFS HBase Store Page 11 © Hortonworks Inc. 2012
    12. 12. WebHDFS: HDFS over HTTP~:$ GET http://nnode:50070/webhdfs/v1/results/part-r-00000.csv?op=openGATE4,eb8bd736445f415e18886ba037f84829,55000,2007-01-14,14:01:54,GATE4,ec58edcce1049fa665446dc1fa690638,8030803000,2007-01-14,13:52:31,GATE4,b6f07ce00f09035a6683c5e93e3c04b8,30000,2007-01-28,12:41:11,GATE4,a1bc345b756090854e9dd0011087c6c0,30000,2007-01-28,12:59:33,... Potential Uses: Out of cluster access to HDFS Cross-cluster, cross version HDFS access Native filesystem clients dfs.webhdfs.enabled=true Page 12 © Hortonworks Inc. 2012
    13. 13. The Web HDFS & service APIsisolate Hadoop internals fromstable public interfacesLong-haul, cross-language, stable, secure Page 13© Hortonworks Inc. 2012
    14. 14. My project: HA on vSphere Page 14 © Hortonworks Inc. 2012
    15. 15. Release ScheduleHDP 1.x : quarterly releases – Large-scale QA process – Validate performance as well as functionalityTechnology Preview Program – Early access; help w/ testing – Access to new features such as – HA – Windows Integration Predictable timetable of stable releases Page 15 © Hortonworks Inc. 2012
    16. 16. Ready and free to use today:http://hortonworks.com/download/ Page 16 © Hortonworks Inc. 2012
    17. 17. Thank You!Des questions? Page 17 © Hortonworks Inc. 2012