SlideShare a Scribd company logo
Composing re-useable
ETL
on Hadoop
                       Paul Lam (@Quantisan)

                       Big Data London, 1 October 2012
Data science lifecycle


                Acquire




       Action             Analyse
Data science lifecycle


                   Acquire        80% of work




          Action             Analyse



80% of result
From these

✤   {:status 200, :scheme http, :pipe ., :request-uri /broadband/?
    gclid=CPnYgdqj0bECFa4mtAodVEsAYA, :http-x-forwarded-for 92.9.200.50, :msec
    1344196910.137, :sent-http-set-cookie -, :body-bytes-sent 18836, :query-string
    gclid=CPnYgdj0bECa4mtAdVEsAYA, :request-content-type -, :cookie-urefs -, :request
    GET /broadband/?gclid=CPnYgdj0bECa4mtAdVEsAYA HTTP/1.1, :upstream-
    response-time 0.164, :sent-http-content-type text/html, :hostname nginx-
    lb-20120229-1942-24.uswitchinternal.com, :sent-http-location -, :time-local 05/Aug/
    2012:20:01:50 +0000, :http-referer http://www.google.co.uk/aclk?
    sa=l&ai=D1556&rct=j&q=best%20value%20internet%20uk, :http-user-agent Mozilla/
    5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.60
    Safari/537.1, :request-time 0.164, :request-body -, :http-host
    www.uswitch.com, :upstream-addr 178.32.60.100:80, :sent-http-server -, :upstream-
    status 200, :uscc <ANON>}
To these

✤   Removing bots and crawlers

✤   Picking out relevant events

✤   Grouping events by users

✤   Sequencing the events

✤   Structuring as a matrix

✤   Graphing
Using this




✤   an abstraction framework for building MapReduce jobs

✤   linearly scalable data processing

✤   www.cascading.org
Word Count - MapReduce

✤   public static class Map extends MapReduceBase implements
    Mapper<LongWritable, Text, Text, IntWritable>

    ✤   public void map(LongWritable key, Text value, OutputCollector<Text,
        IntWritable> output, Reporter reporter) throws IOException

    ✤   “take this line and split it to word tokens”

✤   public static class Reduce extends MapReduceBase implements Reducer<Text,
    IntWritable, Text, IntWritable>

    ✤   public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
        IntWritable> output, Reporter reporter) throws IOException

    ✤   “take each word token and increment a counter”
Word Count - Cascalog
Benefits of
Domain Specific Language

✤   At same level of abstraction of the problem

    ✤   split words, then do a count on them

✤   Fewer custom code, less prone to implementation bugs

✤   More readable

✤   More productive
TF-IDF

✤   Extended from word
    count example

✤   Single-purpose
    methods

✤   Composition of
    functions



✤   github.com/Quantisan/Impatient


✤   github.com/Cascading/Impatient
Our data processing methodology

✤   Apply single-purpose
    functions to immutable data

✤   Only build what we need as
    we go

✤   Composability, extensibility,
    maintainability

✤   Use the right tool for the right
    task
Contact



✤   Paul Lam, data scientist at uSwitch

✤   @Quantisan

✤   paul.lam@forward.co.uk

More Related Content

What's hot

Android and REST
Android and RESTAndroid and REST
Android and REST
Roman Woźniak
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisation
grooverdan
 
Easy REST with OpenAPI
Easy REST with OpenAPIEasy REST with OpenAPI
Easy REST with OpenAPI
ZWEIDENKER GmbH
 
Nginx: Accelerate Rails, HTTP Tricks
Nginx: Accelerate Rails, HTTP TricksNginx: Accelerate Rails, HTTP Tricks
Nginx: Accelerate Rails, HTTP Tricks
Adam Wiggins
 
Care and feeding notes
Care and feeding notesCare and feeding notes
Care and feeding notes
Perrin Harkins
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
NAVER D2
 
Side by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and SolrSide by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and Solr
Sematext Group, Inc.
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
Cloudera, Inc.
 
WordPress Need For Speed
WordPress Need For SpeedWordPress Need For Speed
WordPress Need For Speed
pdeschen
 
«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub
it-people
 
5 things MySql
5 things MySql5 things MySql
5 things MySql
sarahnovotny
 
Lightweight DAS components in Perl
Lightweight DAS components in PerlLightweight DAS components in Perl
Lightweight DAS components in Perl
guestbab097
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Brian Brazil
 
Solving anything in VCL
Solving anything in VCLSolving anything in VCL
Solving anything in VCL
Fastly
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
grooverdan
 
Getting a Grip on CDN Performance - Why and How
Getting a Grip on CDN Performance - Why and HowGetting a Grip on CDN Performance - Why and How
Getting a Grip on CDN Performance - Why and How
Aaron Peters
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Rick Copeland
 
Mongo db v3_deep_dive
Mongo db v3_deep_diveMongo db v3_deep_dive
Mongo db v3_deep_dive
Bryan Reinero
 
mongoDB Performance
mongoDB PerformancemongoDB Performance
mongoDB PerformanceMoshe Kaplan
 
Reverse proxy & web cache with NGINX, HAProxy and Varnish
Reverse proxy & web cache with NGINX, HAProxy and VarnishReverse proxy & web cache with NGINX, HAProxy and Varnish
Reverse proxy & web cache with NGINX, HAProxy and Varnish
El Mahdi Benzekri
 

What's hot (20)

Android and REST
Android and RESTAndroid and REST
Android and REST
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisation
 
Easy REST with OpenAPI
Easy REST with OpenAPIEasy REST with OpenAPI
Easy REST with OpenAPI
 
Nginx: Accelerate Rails, HTTP Tricks
Nginx: Accelerate Rails, HTTP TricksNginx: Accelerate Rails, HTTP Tricks
Nginx: Accelerate Rails, HTTP Tricks
 
Care and feeding notes
Care and feeding notesCare and feeding notes
Care and feeding notes
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Side by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and SolrSide by Side with Elasticsearch and Solr
Side by Side with Elasticsearch and Solr
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
 
WordPress Need For Speed
WordPress Need For SpeedWordPress Need For Speed
WordPress Need For Speed
 
«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub
 
5 things MySql
5 things MySql5 things MySql
5 things MySql
 
Lightweight DAS components in Perl
Lightweight DAS components in PerlLightweight DAS components in Perl
Lightweight DAS components in Perl
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
Solving anything in VCL
Solving anything in VCLSolving anything in VCL
Solving anything in VCL
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
 
Getting a Grip on CDN Performance - Why and How
Getting a Grip on CDN Performance - Why and HowGetting a Grip on CDN Performance - Why and How
Getting a Grip on CDN Performance - Why and How
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
 
Mongo db v3_deep_dive
Mongo db v3_deep_diveMongo db v3_deep_dive
Mongo db v3_deep_dive
 
mongoDB Performance
mongoDB PerformancemongoDB Performance
mongoDB Performance
 
Reverse proxy & web cache with NGINX, HAProxy and Varnish
Reverse proxy & web cache with NGINX, HAProxy and VarnishReverse proxy & web cache with NGINX, HAProxy and Varnish
Reverse proxy & web cache with NGINX, HAProxy and Varnish
 

Similar to Composing re-useable ETL on Hadoop

DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
Aman Kohli
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
Dongmin Yu
 
20190516 web security-basic
20190516 web security-basic20190516 web security-basic
20190516 web security-basic
MksYi
 
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc  2015 HTTP 1, HTTP 2 and folksDevoxx Maroc  2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
Nicolas Martignole
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on Purpose
Aman Kohli
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
Rafał Kuć
 
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & ElasticsearchFrom Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
Sematext Group, Inc.
 
Life on the Edge with ESI
Life on the Edge with ESILife on the Edge with ESI
Life on the Edge with ESI
Kit Chan
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
IBM dwLive, "Internet & HTTP - 잃어버린 패킷을 찾아서..."
IBM dwLive, "Internet & HTTP - 잃어버린 패킷을 찾아서..."IBM dwLive, "Internet & HTTP - 잃어버린 패킷을 찾아서..."
IBM dwLive, "Internet & HTTP - 잃어버린 패킷을 찾아서..."
Dongwook Lee
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
Pietro Michiardi
 
Using and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middlewareUsing and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middleware
Alona Mekhovova
 
Web Performance, Scalability, and Testing Techniques - Boston PHP Meetup
Web Performance, Scalability, and Testing Techniques - Boston PHP MeetupWeb Performance, Scalability, and Testing Techniques - Boston PHP Meetup
Web Performance, Scalability, and Testing Techniques - Boston PHP Meetup
Jonathan Klein
 
Cape Cod Web Technology Meetup - 2
Cape Cod Web Technology Meetup - 2Cape Cod Web Technology Meetup - 2
Cape Cod Web Technology Meetup - 2
Asher Martin
 
Web program-peformance-optimization
Web program-peformance-optimizationWeb program-peformance-optimization
Web program-peformance-optimization
xiaojueqq12345
 
The Big Picture and How to Get Started
The Big Picture and How to Get StartedThe Big Picture and How to Get Started
The Big Picture and How to Get Started
guest1af57e
 
Whatever it takes - Fixing SQLIA and XSS in the process
Whatever it takes - Fixing SQLIA and XSS in the processWhatever it takes - Fixing SQLIA and XSS in the process
Whatever it takes - Fixing SQLIA and XSS in the process
guest3379bd
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat
Jonathan Linowes
 
OSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDBOSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDB
Bradley Holt
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 

Similar to Composing re-useable ETL on Hadoop (20)

DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
20190516 web security-basic
20190516 web security-basic20190516 web security-basic
20190516 web security-basic
 
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc  2015 HTTP 1, HTTP 2 and folksDevoxx Maroc  2015 HTTP 1, HTTP 2 and folks
Devoxx Maroc 2015 HTTP 1, HTTP 2 and folks
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on Purpose
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
 
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & ElasticsearchFrom Zero to Hero - Centralized Logging with Logstash & Elasticsearch
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
 
Life on the Edge with ESI
Life on the Edge with ESILife on the Edge with ESI
Life on the Edge with ESI
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
IBM dwLive, "Internet & HTTP - 잃어버린 패킷을 찾아서..."
IBM dwLive, "Internet & HTTP - 잃어버린 패킷을 찾아서..."IBM dwLive, "Internet & HTTP - 잃어버린 패킷을 찾아서..."
IBM dwLive, "Internet & HTTP - 잃어버린 패킷을 찾아서..."
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Using and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middlewareUsing and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middleware
 
Web Performance, Scalability, and Testing Techniques - Boston PHP Meetup
Web Performance, Scalability, and Testing Techniques - Boston PHP MeetupWeb Performance, Scalability, and Testing Techniques - Boston PHP Meetup
Web Performance, Scalability, and Testing Techniques - Boston PHP Meetup
 
Cape Cod Web Technology Meetup - 2
Cape Cod Web Technology Meetup - 2Cape Cod Web Technology Meetup - 2
Cape Cod Web Technology Meetup - 2
 
Web program-peformance-optimization
Web program-peformance-optimizationWeb program-peformance-optimization
Web program-peformance-optimization
 
The Big Picture and How to Get Started
The Big Picture and How to Get StartedThe Big Picture and How to Get Started
The Big Picture and How to Get Started
 
Whatever it takes - Fixing SQLIA and XSS in the process
Whatever it takes - Fixing SQLIA and XSS in the processWhatever it takes - Fixing SQLIA and XSS in the process
Whatever it takes - Fixing SQLIA and XSS in the process
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat
 
OSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDBOSCON 2011 Learning CouchDB
OSCON 2011 Learning CouchDB
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 

More from Paul Lam

Mozambique, Smallholder Farming, and Technology
Mozambique, Smallholder Farming, and TechnologyMozambique, Smallholder Farming, and Technology
Mozambique, Smallholder Farming, and Technology
Paul Lam
 
When a machine learning researcher and a software engineer walk into a bar
When a machine learning researcher and a software engineer walk into a barWhen a machine learning researcher and a software engineer walk into a bar
When a machine learning researcher and a software engineer walk into a bar
Paul Lam
 
Evolution of Our Software Architecture
Evolution of Our Software ArchitectureEvolution of Our Software Architecture
Evolution of Our Software Architecture
Paul Lam
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojure
Paul Lam
 
Yet another startup built on Clojure(Script)
Yet another startup built on Clojure(Script)Yet another startup built on Clojure(Script)
Yet another startup built on Clojure(Script)
Paul Lam
 
Clojure in US vs Europe
Clojure in US vs EuropeClojure in US vs Europe
Clojure in US vs Europe
Paul Lam
 
2014 docker boston fig for developing microservices
2014 docker boston   fig for developing microservices2014 docker boston   fig for developing microservices
2014 docker boston fig for developing microservices
Paul Lam
 
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Paul Lam
 
An agile approach to knowledge discovery on web log data
An agile approach to knowledge discovery on web log dataAn agile approach to knowledge discovery on web log data
An agile approach to knowledge discovery on web log data
Paul Lam
 

More from Paul Lam (9)

Mozambique, Smallholder Farming, and Technology
Mozambique, Smallholder Farming, and TechnologyMozambique, Smallholder Farming, and Technology
Mozambique, Smallholder Farming, and Technology
 
When a machine learning researcher and a software engineer walk into a bar
When a machine learning researcher and a software engineer walk into a barWhen a machine learning researcher and a software engineer walk into a bar
When a machine learning researcher and a software engineer walk into a bar
 
Evolution of Our Software Architecture
Evolution of Our Software ArchitectureEvolution of Our Software Architecture
Evolution of Our Software Architecture
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojure
 
Yet another startup built on Clojure(Script)
Yet another startup built on Clojure(Script)Yet another startup built on Clojure(Script)
Yet another startup built on Clojure(Script)
 
Clojure in US vs Europe
Clojure in US vs EuropeClojure in US vs Europe
Clojure in US vs Europe
 
2014 docker boston fig for developing microservices
2014 docker boston   fig for developing microservices2014 docker boston   fig for developing microservices
2014 docker boston fig for developing microservices
 
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
Customer Behaviour Analytics: Billions of Events to one Customer-Product Prop...
 
An agile approach to knowledge discovery on web log data
An agile approach to knowledge discovery on web log dataAn agile approach to knowledge discovery on web log data
An agile approach to knowledge discovery on web log data
 

Recently uploaded

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

Composing re-useable ETL on Hadoop

  • 1. Composing re-useable ETL on Hadoop Paul Lam (@Quantisan) Big Data London, 1 October 2012
  • 2. Data science lifecycle Acquire Action Analyse
  • 3. Data science lifecycle Acquire 80% of work Action Analyse 80% of result
  • 4. From these ✤ {:status 200, :scheme http, :pipe ., :request-uri /broadband/? gclid=CPnYgdqj0bECFa4mtAodVEsAYA, :http-x-forwarded-for 92.9.200.50, :msec 1344196910.137, :sent-http-set-cookie -, :body-bytes-sent 18836, :query-string gclid=CPnYgdj0bECa4mtAdVEsAYA, :request-content-type -, :cookie-urefs -, :request GET /broadband/?gclid=CPnYgdj0bECa4mtAdVEsAYA HTTP/1.1, :upstream- response-time 0.164, :sent-http-content-type text/html, :hostname nginx- lb-20120229-1942-24.uswitchinternal.com, :sent-http-location -, :time-local 05/Aug/ 2012:20:01:50 +0000, :http-referer http://www.google.co.uk/aclk? sa=l&ai=D1556&rct=j&q=best%20value%20internet%20uk, :http-user-agent Mozilla/ 5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.60 Safari/537.1, :request-time 0.164, :request-body -, :http-host www.uswitch.com, :upstream-addr 178.32.60.100:80, :sent-http-server -, :upstream- status 200, :uscc <ANON>}
  • 5. To these ✤ Removing bots and crawlers ✤ Picking out relevant events ✤ Grouping events by users ✤ Sequencing the events ✤ Structuring as a matrix ✤ Graphing
  • 6. Using this ✤ an abstraction framework for building MapReduce jobs ✤ linearly scalable data processing ✤ www.cascading.org
  • 7. Word Count - MapReduce ✤ public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> ✤ public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException ✤ “take this line and split it to word tokens” ✤ public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> ✤ public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException ✤ “take each word token and increment a counter”
  • 8. Word Count - Cascalog
  • 9. Benefits of Domain Specific Language ✤ At same level of abstraction of the problem ✤ split words, then do a count on them ✤ Fewer custom code, less prone to implementation bugs ✤ More readable ✤ More productive
  • 10. TF-IDF ✤ Extended from word count example ✤ Single-purpose methods ✤ Composition of functions ✤ github.com/Quantisan/Impatient ✤ github.com/Cascading/Impatient
  • 11. Our data processing methodology ✤ Apply single-purpose functions to immutable data ✤ Only build what we need as we go ✤ Composability, extensibility, maintainability ✤ Use the right tool for the right task
  • 12. Contact ✤ Paul Lam, data scientist at uSwitch ✤ @Quantisan ✤ paul.lam@forward.co.uk

Editor's Notes

  1. \n
  2. There are 3 stages to our data process.\n
  3. \n
  4. semi-structured\n\npostcode field example\n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n