Your SlideShare is downloading. ×
Spring for Apache Hadoop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Spring for Apache Hadoop


Published on

Presentation covers main features of a newly arrived Spring for Apache Hadoop project.

Presentation covers main features of a newly arrived Spring for Apache Hadoop project.

Published in: Technology

1 Comment
  • Would you give me a sample about spring for hadoop mapreduce ....
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Spring for ApacheHadoopBy Zenyk Matchyshyn
  • 2. Agenda• Goals of the project• Hadoop Map/Reduce• Scripting• HBase• Hive• Pig• Other• Alternatives2
  • 3. Big Data – Why?Because of Terabytes and Petabytes:• Smart meter analysis• Genome processing• Sentiment & social media analysis• Network capacity trending & management• Ad targeting• Fraud detection3
  • 4. Goals• Provide programmatic model to work withHadoop ecosystem• Simplify client libraries usage• Provide Spring friendly wrappers• Enable real-world usage as a part of SpringBatch & Spring Integration• Leverage Spring features4
  • 5. Supported distros• Apache Hadoop• Cloudera CDH• Greenplum HD5
  • 6. HADOOP6
  • 7. Hadoop7HadoopMap/ReduceHDFSHBasePig Hive
  • 8. Hadoop basicsSplit Map Shuffle Reduce8Dog ate the boneCat ate the fishDog, 1Ate, 1The, 1Bone, 1Cat, 1Ate, 1The, 1Fish,1Dog, 1Ate, {1, 1}The, {1, 1}Bone, 1Cat, 1Fish,1Dog, 1Ate, 2The, 2Bone, 1Cat, 1Fish,1
  • 9. Configuration9<?xml version="1.0" encoding="UTF-8"?><beans:beans xmlns=""xmlns:xsi=""xmlns:beans=""xmlns:context=""xsi:schemaLocation=""><context:property-placeholder location=""/><configuration>${hd.fs}mapred.job.tracker=${hd.jt}</configuration>………………….</beans:beans
  • 10. Job definition10<hdp:job id=“hadoopJob"input-path="${wordcount.input.path}"output-path="${wordcount.output.path}"libs="file:${app.repo}/supporting-lib-*.jar"mapper=""reducer=""/>Configuration conf = new Configuration();Job job = new Job(conf, “hadoopJob");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(Maper.class);job.setReducerClass(Reducer.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, newPath(args[1]));job.waitForCompletion(true);
  • 11. Job Execution11<hdp:job-runner id="runner" run-at-startup="true"pre-action=“someScript“post-action=“someOtherScript“job-ref=“hadoopJob" />
  • 12. • Hadoop Streaming:• Hadoop Tool Executor:Other approaches12<hdp:streaming id="streaming"input-path="/input/" output-path="/ouput/"mapper="${}" reducer="${path.wc}"/><hdp:tool-runner id="someTool" tool-class="" run-at-startup="true"><hdp:arg value="data/in.txt"/><hdp:arg value="data/out.txt"/>property=value</hdp:tool-runner>
  • 13. SCRIPTING13
  • 14. Details• Supports JVM languages from JSR-223(Groovy, JRuby, Jython, Rhino)• Exposes SimplerFileSystem• Provides implicit variables• Exposes FsShell to mimic HDFS shell• Exposes DistCp to mimic distcp from Hadoop14
  • 15. Example15<hdp:script-tasklet id="script-tasklet"><hdp:script language="groovy">inputPath = "/user/gutenberg/input/word/"outputPath = "/user/gutenberg/output/word/"if (fsh.test(inputPath)) {fsh.rmr(inputPath) }if (fsh.test(outputPath)) {fsh.rmr(outputPath) }inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"fsh.put(inputFile, inputPath)</hdp:script></hdp:script-tasklet>
  • 16. HBASE16
  • 17. HBase17HadoopMap/ReduceHDFSHBasePig Hive
  • 18. HBase basics• Distributed, column oriented store• Independent of Hadoop• No translation into Map/Reduce• Stores data in MapFiles (indexed SequenceFiles)18Create ‘sometable’, ‘clmnfamily1’Put ‘sometable’, ‘row_id1’, ‘clmnfamily1:c1’, ‘some values’Scan ‘sometable’
  • 19. Features• Easy connection interface• Thread safe• DAO friendly support and wrappers:• HbaseTemplate• TableCallback• RowMapper• ResultsExtractor• Binding table to current thread19
  • 20. Example - beans20<hdp:hbase-configuration/><bean id="hbaseTemplate"class=""p:configuration-ref="hbaseConfiguration"/>
  • 21. Example - code21template.execute("MyTable", new TableCallback<Object>() {@Overridepublic Object doInTable(HTable table) throws Throwable {Put p = new Put(Bytes.toBytes("SomeRow"));p.add(Bytes.toBytes("SomeColumn"), Bytes.toBytes("SomeQualifier"), Bytes.toBytes("AValue"));table.put(p);return null;}});List<String> rows = template.find("MyTable", "SomeColumn", new RowMapper<String>() {@Overridepublic String mapRow(Result result, int rowNum) throws Exception {return result.toString();}}));
  • 22. HIVE22
  • 23. Hive23HadoopMap/ReduceHDFSHBasePig Hive
  • 24. Hive basics• SQL-like interface - HiveQL• Has its own structure• Not a pipeline like Pig• Basically a distributed data warehouse• Has execution optimization24
  • 25. Features• Hive server• DAO friendly Hive Thrift Client simplification• Hive JDBC driver within Spring DAO ecosystem• Hive scripting• Thread safe25
  • 26. Example - beans26<hdp:hive-server host=“hivehost" port="10001" /><hdp:hive-template /><hdp:hive-client-factory host="some-host" port="some-port" ><hdp:script location="classpath:org/company/hive/script.q"><arguments>ignore-case=true</arguments></hdp:script></hdp:hive-client-factory><hdp:hive-runner id="hiveRunner" run-at-startup="true"><hdp:script>DROP TABLE IF EXITS testHiveBatchTable;CREATE TABLE testHiveBatchTable (key int, value string);</hdp:script><hdp:script location="hive-scripts/script.q"/></hdp:hive-runner>
  • 27. Example - template27return hiveTemplate.execute(new HiveClientCallback<List<String>>() {@Overridepublic List<String> doInHive(HiveClient hiveClient) throws Exception {return hiveClient.get_all_databases();}}));
  • 28. PIG28
  • 29. Pig basics29HadoopMap/ReduceHDFSHBasePig Hive
  • 30. Pig• High level language for data analysis• Uses PigLatin to describe data flows(translates into MapReduce)• Filters, Joins, Projections, Groupings, Counts,etc.• Example:30A = LOAD student USING PigStorage() AS (name:chararray, age:int, gpa:float);B = FOREACH A GENERATE name;DUMP B;
  • 31. Features• Scripts execution• DAO friendly template• Thread safe31
  • 32. Example - beans32<hdp:pig-factory exec-type="LOCAL" job-name="pig-script" configuration-ref="hadoopConfiguration"properties-location="”">source=${pig.script.src}<script location="org/company/pig/script.pig“/></hdp:pig-factory><hdp:pig-runner id="pigRunner" run-at-startup="true"><hdp:script>A = LOAD src/test/resources/logs/apache_access.log USING PigStorage() AS (name:chararray, age:int);B = FOREACH A GENERATE name;DUMP B;</hdp:script><hdp:script location="pig-scripts/script.pig"><arguments>electric=sea</arguments></hdp:script></hdp:pig-runner><hdp:pig-template/>
  • 33. Example - template33return pigTemplate.execute(new PigCallback<Set<String>() {@Overridepublic Set<String> doInPig(PigServer pig) throws ExecException, IOException {return pig.getAliasKeySet();}}));
  • 34. Other features• Cascading support• Works well with Hadoop security• Spring Batch tasklets• Spring Integration support34
  • 35. Alternatives & related• Apache Flume – distributed data collection• Apache Oozie – workflow scheduler• Apache Sqoop – SQL bulk import/export35
  • 36. Q/A?36