Your SlideShare is downloading. ×
Spring for ApacheHadoopBy Zenyk Matchyshyn
Agenda• Goals of the project• Hadoop Map/Reduce• Scripting• HBase• Hive• Pig• Other• Alternatives2
Big Data – Why?Because of Terabytes and Petabytes:• Smart meter analysis• Genome processing• Sentiment & social media anal...
Goals• Provide programmatic model to work withHadoop ecosystem• Simplify client libraries usage• Provide Spring friendly w...
Supported distros• Apache Hadoop• Cloudera CDH• Greenplum HD5
HADOOP6
Hadoop7HadoopMap/ReduceHDFSHBasePig Hive
Hadoop basicsSplit Map Shuffle Reduce8Dog ate the boneCat ate the fishDog, 1Ate, 1The, 1Bone, 1Cat, 1Ate, 1The, 1Fish,1Dog...
Configuration9<?xml version="1.0" encoding="UTF-8"?><beans:beans xmlns="http://www.springframework.org/schema/hadoop"xmlns...
Job definition10<hdp:job id=“hadoopJob"input-path="${wordcount.input.path}"output-path="${wordcount.output.path}"libs="fil...
Job Execution11<hdp:job-runner id="runner" run-at-startup="true"pre-action=“someScript“post-action=“someOtherScript“job-re...
• Hadoop Streaming:• Hadoop Tool Executor:Other approaches12<hdp:streaming id="streaming"input-path="/input/" output-path=...
SCRIPTING13
Details• Supports JVM languages from JSR-223(Groovy, JRuby, Jython, Rhino)• Exposes SimplerFileSystem• Provides implicit v...
Example15<hdp:script-tasklet id="script-tasklet"><hdp:script language="groovy">inputPath = "/user/gutenberg/input/word/"ou...
HBASE16
HBase17HadoopMap/ReduceHDFSHBasePig Hive
HBase basics• Distributed, column oriented store• Independent of Hadoop• No translation into Map/Reduce• Stores data in Ma...
Features• Easy connection interface• Thread safe• DAO friendly support and wrappers:• HbaseTemplate• TableCallback• RowMap...
Example - beans20<hdp:hbase-configuration/><bean id="hbaseTemplate"class="org.springframework.data.hadoop.hbase.HbaseTempl...
Example - code21template.execute("MyTable", new TableCallback<Object>() {@Overridepublic Object doInTable(HTable table) th...
HIVE22
Hive23HadoopMap/ReduceHDFSHBasePig Hive
Hive basics• SQL-like interface - HiveQL• Has its own structure• Not a pipeline like Pig• Basically a distributed data war...
Features• Hive server• DAO friendly Hive Thrift Client simplification• Hive JDBC driver within Spring DAO ecosystem• Hive ...
Example - beans26<hdp:hive-server host=“hivehost" port="10001" /><hdp:hive-template /><hdp:hive-client-factory host="some-...
Example - template27return hiveTemplate.execute(new HiveClientCallback<List<String>>() {@Overridepublic List<String> doInH...
PIG28
Pig basics29HadoopMap/ReduceHDFSHBasePig Hive
Pig• High level language for data analysis• Uses PigLatin to describe data flows(translates into MapReduce)• Filters, Join...
Features• Scripts execution• DAO friendly template• Thread safe31
Example - beans32<hdp:pig-factory exec-type="LOCAL" job-name="pig-script" configuration-ref="hadoopConfiguration"propertie...
Example - template33return pigTemplate.execute(new PigCallback<Set<String>() {@Overridepublic Set<String> doInPig(PigServe...
Other features• Cascading support• Works well with Hadoop security• Spring Batch tasklets• Spring Integration support34
Alternatives & related• Apache Flume – distributed data collection• Apache Oozie – workflow scheduler• Apache Sqoop – SQL ...
Q/A?36
Upcoming SlideShare
Loading in...5
×

Spring for Apache Hadoop

2,169

Published on

Presentation covers main features of a newly arrived Spring for Apache Hadoop project.

Published in: Technology
1 Comment
8 Likes
Statistics
Notes
  • Would you give me a sample about spring for hadoop mapreduce ....
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
2,169
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
8
Embeds 0
No embeds

No notes for slide

Transcript of "Spring for Apache Hadoop"

  1. 1. Spring for ApacheHadoopBy Zenyk Matchyshyn
  2. 2. Agenda• Goals of the project• Hadoop Map/Reduce• Scripting• HBase• Hive• Pig• Other• Alternatives2
  3. 3. Big Data – Why?Because of Terabytes and Petabytes:• Smart meter analysis• Genome processing• Sentiment & social media analysis• Network capacity trending & management• Ad targeting• Fraud detection3
  4. 4. Goals• Provide programmatic model to work withHadoop ecosystem• Simplify client libraries usage• Provide Spring friendly wrappers• Enable real-world usage as a part of SpringBatch & Spring Integration• Leverage Spring features4
  5. 5. Supported distros• Apache Hadoop• Cloudera CDH• Greenplum HD5
  6. 6. HADOOP6
  7. 7. Hadoop7HadoopMap/ReduceHDFSHBasePig Hive
  8. 8. Hadoop basicsSplit Map Shuffle Reduce8Dog ate the boneCat ate the fishDog, 1Ate, 1The, 1Bone, 1Cat, 1Ate, 1The, 1Fish,1Dog, 1Ate, {1, 1}The, {1, 1}Bone, 1Cat, 1Fish,1Dog, 1Ate, 2The, 2Bone, 1Cat, 1Fish,1
  9. 9. Configuration9<?xml version="1.0" encoding="UTF-8"?><beans:beans xmlns="http://www.springframework.org/schema/hadoop"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xmlns:beans="http://www.springframework.org/schema/beans"xmlns:context="http://www.springframework.org/schema/context"xsi:schemaLocation="http://www.springframework.org/schema/beanshttp://www.springframework.org/schema/beans/spring-beans.xsdhttp://www.springframework.org/schema/contexthttp://www.springframework.org/schema/context/spring-context.xsdhttp://www.springframework.org/schema/hadoophttp://www.springframework.org/schema/hadoop/spring-hadoop.xsd"><context:property-placeholder location="hadoop.properties"/><configuration>fs.default.name=${hd.fs}mapred.job.tracker=${hd.jt}</configuration>………………….</beans:beans
  10. 10. Job definition10<hdp:job id=“hadoopJob"input-path="${wordcount.input.path}"output-path="${wordcount.output.path}"libs="file:${app.repo}/supporting-lib-*.jar"mapper="org.company.Mapper"reducer="org.company.Reducer"/>Configuration conf = new Configuration();Job job = new Job(conf, “hadoopJob");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(Maper.class);job.setReducerClass(Reducer.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, newPath(args[1]));job.waitForCompletion(true);
  11. 11. Job Execution11<hdp:job-runner id="runner" run-at-startup="true"pre-action=“someScript“post-action=“someOtherScript“job-ref=“hadoopJob" />
  12. 12. • Hadoop Streaming:• Hadoop Tool Executor:Other approaches12<hdp:streaming id="streaming"input-path="/input/" output-path="/ouput/"mapper="${path.cat}" reducer="${path.wc}"/><hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" run-at-startup="true"><hdp:arg value="data/in.txt"/><hdp:arg value="data/out.txt"/>property=value</hdp:tool-runner>
  13. 13. SCRIPTING13
  14. 14. Details• Supports JVM languages from JSR-223(Groovy, JRuby, Jython, Rhino)• Exposes SimplerFileSystem• Provides implicit variables• Exposes FsShell to mimic HDFS shell• Exposes DistCp to mimic distcp from Hadoop14
  15. 15. Example15<hdp:script-tasklet id="script-tasklet"><hdp:script language="groovy">inputPath = "/user/gutenberg/input/word/"outputPath = "/user/gutenberg/output/word/"if (fsh.test(inputPath)) {fsh.rmr(inputPath) }if (fsh.test(outputPath)) {fsh.rmr(outputPath) }inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"fsh.put(inputFile, inputPath)</hdp:script></hdp:script-tasklet>
  16. 16. HBASE16
  17. 17. HBase17HadoopMap/ReduceHDFSHBasePig Hive
  18. 18. HBase basics• Distributed, column oriented store• Independent of Hadoop• No translation into Map/Reduce• Stores data in MapFiles (indexed SequenceFiles)18Create ‘sometable’, ‘clmnfamily1’Put ‘sometable’, ‘row_id1’, ‘clmnfamily1:c1’, ‘some values’Scan ‘sometable’
  19. 19. Features• Easy connection interface• Thread safe• DAO friendly support and wrappers:• HbaseTemplate• TableCallback• RowMapper• ResultsExtractor• Binding table to current thread19
  20. 20. Example - beans20<hdp:hbase-configuration/><bean id="hbaseTemplate"class="org.springframework.data.hadoop.hbase.HbaseTemplate"p:configuration-ref="hbaseConfiguration"/>
  21. 21. Example - code21template.execute("MyTable", new TableCallback<Object>() {@Overridepublic Object doInTable(HTable table) throws Throwable {Put p = new Put(Bytes.toBytes("SomeRow"));p.add(Bytes.toBytes("SomeColumn"), Bytes.toBytes("SomeQualifier"), Bytes.toBytes("AValue"));table.put(p);return null;}});List<String> rows = template.find("MyTable", "SomeColumn", new RowMapper<String>() {@Overridepublic String mapRow(Result result, int rowNum) throws Exception {return result.toString();}}));
  22. 22. HIVE22
  23. 23. Hive23HadoopMap/ReduceHDFSHBasePig Hive
  24. 24. Hive basics• SQL-like interface - HiveQL• Has its own structure• Not a pipeline like Pig• Basically a distributed data warehouse• Has execution optimization24
  25. 25. Features• Hive server• DAO friendly Hive Thrift Client simplification• Hive JDBC driver within Spring DAO ecosystem• Hive scripting• Thread safe25
  26. 26. Example - beans26<hdp:hive-server host=“hivehost" port="10001" /><hdp:hive-template /><hdp:hive-client-factory host="some-host" port="some-port" ><hdp:script location="classpath:org/company/hive/script.q"><arguments>ignore-case=true</arguments></hdp:script></hdp:hive-client-factory><hdp:hive-runner id="hiveRunner" run-at-startup="true"><hdp:script>DROP TABLE IF EXITS testHiveBatchTable;CREATE TABLE testHiveBatchTable (key int, value string);</hdp:script><hdp:script location="hive-scripts/script.q"/></hdp:hive-runner>
  27. 27. Example - template27return hiveTemplate.execute(new HiveClientCallback<List<String>>() {@Overridepublic List<String> doInHive(HiveClient hiveClient) throws Exception {return hiveClient.get_all_databases();}}));
  28. 28. PIG28
  29. 29. Pig basics29HadoopMap/ReduceHDFSHBasePig Hive
  30. 30. Pig• High level language for data analysis• Uses PigLatin to describe data flows(translates into MapReduce)• Filters, Joins, Projections, Groupings, Counts,etc.• Example:30A = LOAD student USING PigStorage() AS (name:chararray, age:int, gpa:float);B = FOREACH A GENERATE name;DUMP B;
  31. 31. Features• Scripts execution• DAO friendly template• Thread safe31
  32. 32. Example - beans32<hdp:pig-factory exec-type="LOCAL" job-name="pig-script" configuration-ref="hadoopConfiguration"properties-location="pig-dev.properties”">source=${pig.script.src}<script location="org/company/pig/script.pig“/></hdp:pig-factory><hdp:pig-runner id="pigRunner" run-at-startup="true"><hdp:script>A = LOAD src/test/resources/logs/apache_access.log USING PigStorage() AS (name:chararray, age:int);B = FOREACH A GENERATE name;DUMP B;</hdp:script><hdp:script location="pig-scripts/script.pig"><arguments>electric=sea</arguments></hdp:script></hdp:pig-runner><hdp:pig-template/>
  33. 33. Example - template33return pigTemplate.execute(new PigCallback<Set<String>() {@Overridepublic Set<String> doInPig(PigServer pig) throws ExecException, IOException {return pig.getAliasKeySet();}}));
  34. 34. Other features• Cascading support• Works well with Hadoop security• Spring Batch tasklets• Spring Integration support34
  35. 35. Alternatives & related• Apache Flume – distributed data collection• Apache Oozie – workflow scheduler• Apache Sqoop – SQL bulk import/export35
  36. 36. Q/A?36

×