This document discusses Spring for Apache Hadoop, a framework that simplifies using Apache Hadoop and related projects like HBase and Hive within the Spring programming model. It provides wrappers and configuration for common Hadoop tasks like MapReduce jobs, scripting, and accessing Hadoop databases and data processing engines. The goals are to provide a programmatic model for the Hadoop ecosystem, simplify client libraries, and leverage Spring features. It supports various Hadoop distributions and provides interfaces for MapReduce, HBase, Hive, Pig and other Hadoop technologies.
2. Agenda
• Goals of the project
• Hadoop Map/Reduce
• Scripting
• HBase
• Hive
• Pig
• Other
• Alternatives
2
3. Big Data – Why?
Because of Terabytes and Petabytes:
• Smart meter analysis
• Genome processing
• Sentiment & social media analysis
• Network capacity trending & management
• Ad targeting
• Fraud detection
3
4. Goals
• Provide programmatic model to work with
Hadoop ecosystem
• Simplify client libraries usage
• Provide Spring friendly wrappers
• Enable real-world usage as a part of Spring
Batch & Spring Integration
• Leverage Spring features
4
18. HBase basics
• Distributed, column oriented store
• Independent of Hadoop
• No translation into Map/Reduce
• Stores data in MapFiles (indexed SequenceFiles)
18
Create ‘sometable’, ‘clmnfamily1’
Put ‘sometable’, ‘row_id1’, ‘clmnfamily1:c1’, ‘some values’
Scan ‘sometable’
19. Features
• Easy connection interface
• Thread safe
• DAO friendly support and wrappers:
• HbaseTemplate
• TableCallback
• RowMapper
• ResultsExtractor
• Binding table to current thread
19
24. Hive basics
• SQL-like interface - HiveQL
• Has its own structure
• Not a pipeline like Pig
• Basically a distributed data warehouse
• Has execution optimization
24
25. Features
• Hive server
• DAO friendly Hive Thrift Client simplification
• Hive JDBC driver within Spring DAO ecosystem
• Hive scripting
• Thread safe
25
26. Example - beans
26
<hdp:hive-server host=“hivehost" port="10001" />
<hdp:hive-template />
<hdp:hive-client-factory host="some-host" port="some-port" >
<hdp:script location="classpath:org/company/hive/script.q">
<arguments>ignore-case=true</arguments>
</hdp:script>
</hdp:hive-client-factory>
<hdp:hive-runner id="hiveRunner" run-at-startup="true">
<hdp:script>
DROP TABLE IF EXITS testHiveBatchTable;
CREATE TABLE testHiveBatchTable (key int, value string);
</hdp:script>
<hdp:script location="hive-scripts/script.q"/>
</hdp:hive-runner>
30. Pig
• High level language for data analysis
• Uses PigLatin to describe data flows
(translates into MapReduce)
• Filters, Joins, Projections, Groupings, Counts,
etc.
• Example:
30
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;