<PRESENTATION> The newest way that JBoss is delivering Enterprise-class stability and performance is with JBoss Enterprise Platforms. Having to integrate, and maintain the integrations between the multiple community projects to meet your enterprise middleware platform needs can add complexity and cost to your IT operations. Red Hat solves this problem with JBoss Enterprise Platforms. JBoss Enterprise Platforms integrate the most popular JBoss.org projects into stable, secure, certified distributions with a single patch and update stream. JBoss Enterprise Platforms are available via subscriptions that include certified software, industry-leading support, updates and patches, documentation and multi-year maintenance policies. Now, customers can leverage all the innovation, flexibility and value of open source without additional time and expense of maintain their own application platform. Everybody wins. </PRESENTATION>
How do you set up a cluster? Three ways. 1. The HMC installer uses Puppet to set up a set of machines -driven by files listing hostnames of machines you want in specific roles. It doesn't assume you have Java; installs the tested Java versions (64 bit for masters, 32 bit for workers). Brings up entire cluster, smoke tests, leaves you with web management console driven by ganglia and nagios. This is the easy way to set up an entire cluster. 2. There is the option of just installing the RPMs using Yum, directly from the HWX repository, using &quot;yum upgrade&quot; to upgrade -or even go to Kickstart and create your own OS images on demand. One thing to consider is that the platforms tested on look &quot;dated&quot; -why not RHEL6.3 + Java 7? Using experience w/ stability problems on the Y! cluster to stick to JVM version that is trusted to be stable; mature OS.
HCatalog is a table abstraction and a storage abstraction system that makes it easy for multiple tools to interact with the same underlying data. Similar to a Schema in the RDBMS world except that it's more than just the SQL-layer. A common buzzword in the NoSQL world today is that of polyglot persistence. Basically, what that comes down to is that you pick the right tool for the job. In the Hadoop ecosystem, you have many tools that might be used for data processing - you might use Pig or Hive, or your own custom MapReduce program, or that shiny new GUI-based tool that's just come out. And which one to use might depend on the user, or on the type of query you're interested in, or the type of job we want to run. From another perspective, you might want to store your data in columnar storage for efficient storage and retrieval for particular query types, or in text so that users can write data producers in scripting languages like Perl or Python, or you may want to hook up that HBase table as a data source. As a end-user, I want to use whatever data processing tool is available to me. As a data designer, I want to optimize how data is stored. As a cluster manager/data architect, I want the ability to share pieces of information across the board, and move data back and forth fluidly. HCatalog's hopes and promises are the realization of all of the above.
Picking on a new feature in Hadoop 1.0.3 : webhdfs is something interesting. Set one config option and the DNs and NNs become web servers (using the chosen auth mechanism), offering read and write access to the data. This is integral to the cluster -you ask the NN for data, which triggers a 307 redirect to a DN with the data, which serves up up. A redirect that is handled transparently by all HTTP clients set up to handle redirects.
Up until now, a change in the internal Hadoop versions caused -protocol version mismatch problems with all remote clients. Those clients also needed the entire Hadoop JAR set on their classpath, and were java only. Now: stable APIs, cross-language,
This is something still coming together: HA clustering using VMWare vSphere as the HA clustering system underneath the classic failure points - the Namenode of HDFS; the JobTracker of MapR Monitoring agents to report failure to vSphere, trigger failover on process crash or hang, VM crash/hang, and physical hardware failure. Lets you host a set of independent VMs, one per master server, with isolated lifecycle and management. Very good for ops tasks: snapshotting, update software in an offline VM, etc. Does not require that the workers are virtual -they can be physical, virtual or even a mix of both.