Composing re-useableETLon Hadoop                       Paul Lam (@Quantisan)                       Big Data London, 1 Octo...
Data science lifecycle                Acquire       Action             Analyse
Data science lifecycle                   Acquire        80% of work          Action             Analyse80% of result
From these✤   {:status 200, :scheme http, :pipe ., :request-uri /broadband/?    gclid=CPnYgdqj0bECFa4mtAodVEsAYA, :http-x-...
To these✤   Removing bots and crawlers✤   Picking out relevant events✤   Grouping events by users✤   Sequencing the events...
Using this✤   an abstraction framework for building MapReduce jobs✤   linearly scalable data processing✤   www.cascading.org
Word Count - MapReduce✤   public static class Map extends MapReduceBase implements    Mapper<LongWritable, Text, Text, Int...
Word Count - Cascalog
Benefits ofDomain Specific Language✤   At same level of abstraction of the problem    ✤   split words, then do a count on ...
TF-IDF✤   Extended from word    count example✤   Single-purpose    methods✤   Composition of    functions✤   github.com/Qu...
Our data processing methodology✤   Apply single-purpose    functions to immutable data✤   Only build what we need as    we...
Contact✤   Paul Lam, data scientist at uSwitch✤   @Quantisan✤   paul.lam@forward.co.uk
Upcoming SlideShare
Loading in …5
×

Composing re-useable ETL on Hadoop

1,522 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,522
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • There are 3 stages to our data process.\n
  • \n
  • semi-structured\n\npostcode field example\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Composing re-useable ETL on Hadoop

    1. 1. Composing re-useableETLon Hadoop Paul Lam (@Quantisan) Big Data London, 1 October 2012
    2. 2. Data science lifecycle Acquire Action Analyse
    3. 3. Data science lifecycle Acquire 80% of work Action Analyse80% of result
    4. 4. From these✤ {:status 200, :scheme http, :pipe ., :request-uri /broadband/? gclid=CPnYgdqj0bECFa4mtAodVEsAYA, :http-x-forwarded-for 92.9.200.50, :msec 1344196910.137, :sent-http-set-cookie -, :body-bytes-sent 18836, :query-string gclid=CPnYgdj0bECa4mtAdVEsAYA, :request-content-type -, :cookie-urefs -, :request GET /broadband/?gclid=CPnYgdj0bECa4mtAdVEsAYA HTTP/1.1, :upstream- response-time 0.164, :sent-http-content-type text/html, :hostname nginx- lb-20120229-1942-24.uswitchinternal.com, :sent-http-location -, :time-local 05/Aug/ 2012:20:01:50 +0000, :http-referer http://www.google.co.uk/aclk? sa=l&ai=D1556&rct=j&q=best%20value%20internet%20uk, :http-user-agent Mozilla/ 5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.60 Safari/537.1, :request-time 0.164, :request-body -, :http-host www.uswitch.com, :upstream-addr 178.32.60.100:80, :sent-http-server -, :upstream- status 200, :uscc <ANON>}
    5. 5. To these✤ Removing bots and crawlers✤ Picking out relevant events✤ Grouping events by users✤ Sequencing the events✤ Structuring as a matrix✤ Graphing
    6. 6. Using this✤ an abstraction framework for building MapReduce jobs✤ linearly scalable data processing✤ www.cascading.org
    7. 7. Word Count - MapReduce✤ public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> ✤ public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException ✤ “take this line and split it to word tokens”✤ public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> ✤ public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException ✤ “take each word token and increment a counter”
    8. 8. Word Count - Cascalog
    9. 9. Benefits ofDomain Specific Language✤ At same level of abstraction of the problem ✤ split words, then do a count on them✤ Fewer custom code, less prone to implementation bugs✤ More readable✤ More productive
    10. 10. TF-IDF✤ Extended from word count example✤ Single-purpose methods✤ Composition of functions✤ github.com/Quantisan/Impatient✤ github.com/Cascading/Impatient
    11. 11. Our data processing methodology✤ Apply single-purpose functions to immutable data✤ Only build what we need as we go✤ Composability, extensibility, maintainability✤ Use the right tool for the right task
    12. 12. Contact✤ Paul Lam, data scientist at uSwitch✤ @Quantisan✤ paul.lam@forward.co.uk

    ×