• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Spring for Apache Hadoop
 

Spring for Apache Hadoop

on

  • 1,843 views

Presentation covers main features of a newly arrived Spring for Apache Hadoop project.

Presentation covers main features of a newly arrived Spring for Apache Hadoop project.

Statistics

Views

Total Views
1,843
Views on SlideShare
1,843
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Would you give me a sample about spring for hadoop mapreduce ....
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Spring for Apache Hadoop Spring for Apache Hadoop Presentation Transcript

    • Spring for ApacheHadoopBy Zenyk Matchyshyn
    • Agenda• Goals of the project• Hadoop Map/Reduce• Scripting• HBase• Hive• Pig• Other• Alternatives2
    • Big Data – Why?Because of Terabytes and Petabytes:• Smart meter analysis• Genome processing• Sentiment & social media analysis• Network capacity trending & management• Ad targeting• Fraud detection3
    • Goals• Provide programmatic model to work withHadoop ecosystem• Simplify client libraries usage• Provide Spring friendly wrappers• Enable real-world usage as a part of SpringBatch & Spring Integration• Leverage Spring features4
    • Supported distros• Apache Hadoop• Cloudera CDH• Greenplum HD5
    • HADOOP6
    • Hadoop7HadoopMap/ReduceHDFSHBasePig Hive
    • Hadoop basicsSplit Map Shuffle Reduce8Dog ate the boneCat ate the fishDog, 1Ate, 1The, 1Bone, 1Cat, 1Ate, 1The, 1Fish,1Dog, 1Ate, {1, 1}The, {1, 1}Bone, 1Cat, 1Fish,1Dog, 1Ate, 2The, 2Bone, 1Cat, 1Fish,1
    • Configuration9<?xml version="1.0" encoding="UTF-8"?><beans:beans xmlns="http://www.springframework.org/schema/hadoop"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xmlns:beans="http://www.springframework.org/schema/beans"xmlns:context="http://www.springframework.org/schema/context"xsi:schemaLocation="http://www.springframework.org/schema/beanshttp://www.springframework.org/schema/beans/spring-beans.xsdhttp://www.springframework.org/schema/contexthttp://www.springframework.org/schema/context/spring-context.xsdhttp://www.springframework.org/schema/hadoophttp://www.springframework.org/schema/hadoop/spring-hadoop.xsd"><context:property-placeholder location="hadoop.properties"/><configuration>fs.default.name=${hd.fs}mapred.job.tracker=${hd.jt}</configuration>………………….</beans:beans
    • Job definition10<hdp:job id=“hadoopJob"input-path="${wordcount.input.path}"output-path="${wordcount.output.path}"libs="file:${app.repo}/supporting-lib-*.jar"mapper="org.company.Mapper"reducer="org.company.Reducer"/>Configuration conf = new Configuration();Job job = new Job(conf, “hadoopJob");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(Maper.class);job.setReducerClass(Reducer.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, newPath(args[1]));job.waitForCompletion(true);
    • Job Execution11<hdp:job-runner id="runner" run-at-startup="true"pre-action=“someScript“post-action=“someOtherScript“job-ref=“hadoopJob" />
    • • Hadoop Streaming:• Hadoop Tool Executor:Other approaches12<hdp:streaming id="streaming"input-path="/input/" output-path="/ouput/"mapper="${path.cat}" reducer="${path.wc}"/><hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" run-at-startup="true"><hdp:arg value="data/in.txt"/><hdp:arg value="data/out.txt"/>property=value</hdp:tool-runner>
    • SCRIPTING13
    • Details• Supports JVM languages from JSR-223(Groovy, JRuby, Jython, Rhino)• Exposes SimplerFileSystem• Provides implicit variables• Exposes FsShell to mimic HDFS shell• Exposes DistCp to mimic distcp from Hadoop14
    • Example15<hdp:script-tasklet id="script-tasklet"><hdp:script language="groovy">inputPath = "/user/gutenberg/input/word/"outputPath = "/user/gutenberg/output/word/"if (fsh.test(inputPath)) {fsh.rmr(inputPath) }if (fsh.test(outputPath)) {fsh.rmr(outputPath) }inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"fsh.put(inputFile, inputPath)</hdp:script></hdp:script-tasklet>
    • HBASE16
    • HBase17HadoopMap/ReduceHDFSHBasePig Hive
    • HBase basics• Distributed, column oriented store• Independent of Hadoop• No translation into Map/Reduce• Stores data in MapFiles (indexed SequenceFiles)18Create ‘sometable’, ‘clmnfamily1’Put ‘sometable’, ‘row_id1’, ‘clmnfamily1:c1’, ‘some values’Scan ‘sometable’
    • Features• Easy connection interface• Thread safe• DAO friendly support and wrappers:• HbaseTemplate• TableCallback• RowMapper• ResultsExtractor• Binding table to current thread19
    • Example - beans20<hdp:hbase-configuration/><bean id="hbaseTemplate"class="org.springframework.data.hadoop.hbase.HbaseTemplate"p:configuration-ref="hbaseConfiguration"/>
    • Example - code21template.execute("MyTable", new TableCallback<Object>() {@Overridepublic Object doInTable(HTable table) throws Throwable {Put p = new Put(Bytes.toBytes("SomeRow"));p.add(Bytes.toBytes("SomeColumn"), Bytes.toBytes("SomeQualifier"), Bytes.toBytes("AValue"));table.put(p);return null;}});List<String> rows = template.find("MyTable", "SomeColumn", new RowMapper<String>() {@Overridepublic String mapRow(Result result, int rowNum) throws Exception {return result.toString();}}));
    • HIVE22
    • Hive23HadoopMap/ReduceHDFSHBasePig Hive
    • Hive basics• SQL-like interface - HiveQL• Has its own structure• Not a pipeline like Pig• Basically a distributed data warehouse• Has execution optimization24
    • Features• Hive server• DAO friendly Hive Thrift Client simplification• Hive JDBC driver within Spring DAO ecosystem• Hive scripting• Thread safe25
    • Example - beans26<hdp:hive-server host=“hivehost" port="10001" /><hdp:hive-template /><hdp:hive-client-factory host="some-host" port="some-port" ><hdp:script location="classpath:org/company/hive/script.q"><arguments>ignore-case=true</arguments></hdp:script></hdp:hive-client-factory><hdp:hive-runner id="hiveRunner" run-at-startup="true"><hdp:script>DROP TABLE IF EXITS testHiveBatchTable;CREATE TABLE testHiveBatchTable (key int, value string);</hdp:script><hdp:script location="hive-scripts/script.q"/></hdp:hive-runner>
    • Example - template27return hiveTemplate.execute(new HiveClientCallback<List<String>>() {@Overridepublic List<String> doInHive(HiveClient hiveClient) throws Exception {return hiveClient.get_all_databases();}}));
    • PIG28
    • Pig basics29HadoopMap/ReduceHDFSHBasePig Hive
    • Pig• High level language for data analysis• Uses PigLatin to describe data flows(translates into MapReduce)• Filters, Joins, Projections, Groupings, Counts,etc.• Example:30A = LOAD student USING PigStorage() AS (name:chararray, age:int, gpa:float);B = FOREACH A GENERATE name;DUMP B;
    • Features• Scripts execution• DAO friendly template• Thread safe31
    • Example - beans32<hdp:pig-factory exec-type="LOCAL" job-name="pig-script" configuration-ref="hadoopConfiguration"properties-location="pig-dev.properties”">source=${pig.script.src}<script location="org/company/pig/script.pig“/></hdp:pig-factory><hdp:pig-runner id="pigRunner" run-at-startup="true"><hdp:script>A = LOAD src/test/resources/logs/apache_access.log USING PigStorage() AS (name:chararray, age:int);B = FOREACH A GENERATE name;DUMP B;</hdp:script><hdp:script location="pig-scripts/script.pig"><arguments>electric=sea</arguments></hdp:script></hdp:pig-runner><hdp:pig-template/>
    • Example - template33return pigTemplate.execute(new PigCallback<Set<String>() {@Overridepublic Set<String> doInPig(PigServer pig) throws ExecException, IOException {return pig.getAliasKeySet();}}));
    • Other features• Cascading support• Works well with Hadoop security• Spring Batch tasklets• Spring Integration support34
    • Alternatives & related• Apache Flume – distributed data collection• Apache Oozie – workflow scheduler• Apache Sqoop – SQL bulk import/export35
    • Q/A?36