SlideShare a Scribd company logo
1 of 17
SyncNorwich – Big Data on AWS

April 2013

Robin Meehan
CTO
http://flickr.com/photos/brunogirin/68341710/
What the hell is it?
http://commons.wikimedia.org/wiki/File:Loud_environment_headphones.jpg
http://commons.wikimedia.org/wiki/File:Ferrari_156_85_in_2011.j
http://commons.wikimedia.org/wiki/File:Hundreds_and_thousands.jpg
http://www.flickr.com/photos/krishaamer/2836262962/
Storm
   Spark
Dremel/Drill
   Impala
AWS Redshift
   etc etc




     http://flickr.com/photos/42033648@N00/
Big data exploitation – in practice




                                      11
Introduction…….

The example Aviva Use Case…

• Aviva have a number of brands/channels to market including
  insurance aggregators (e.g. CompareThe Market,
  GoCompare…)
• The raw aggregator quote data is of a scale to present a „Big
  Data‟ problem – there is great potential for gaining additional
  insights from this data

So…
• Define some candidate business questions
• Test them against significant volumes of data
• Measure cluster size/£/time performance




                                                                    12
Driving AWS EMR…

AWS Elastic Map Reduce…configuring a Hadoop Cluster...




                                                         13
Some pig…

Query B: ~10 million quotes (5m each channel). Joining quote data across different channels.


  register 's3n://ashaw-1/jars/myudfs.jar';
  register 's3n://ashaw-1/jars/dom4j-1.6.1.jar';
  A = load 's3n://ashaw-1/Intermediate/duplicated/lots' using PigStorage();
  Arac = load 's3n://ashaw-1/Intermediate/duplicated/lotsrac' using PigStorage();
  A1 = limit A 5000000;
  Arac1 = limit Arac 5000000;
  B = foreach A1 generate myudfs.Flatten((chararray)$5);
  Brac = foreach Arac1 generate myudfs.Flatten2((chararray)$5);
  C = join B by (chararray)($0.$21), Brac by (chararray)($0.$21);
  D = filter C by $1.$0 == 1 OR $0.$0 == 1;
  STORE D INTO „s3n://ashaw-1/myoutputfolder/‟;
Costs per run…
XML Flattening results:

• 10 Million quotes:
Cluster size:             Time to execute:   Approx. cost:
10 x Small nodes          64 minutes.        11 compute hours - $1.155 per hour (approx. £0.72)
19 x Small nodes          31 minutes.        20 compute hours - $2.10 per hour (approx. £1.30)
 8 x Large nodes          19 minutes         8 compute hours - $3.78 per hour (approx. £2.34)




         But we could
         have used
         spot
         instances…
Visualisation
Wrapping up…
• It will be a similar adoption pattern to cloud:
  − Those organisations that make it work and gain
    additional business insights will
       •   market more accurately
       •   sell more
       •   have less customer churn
       •   have better paying customers

• Market forces will eventually force adoption or
  failure of their competitors – all other things being
  equal. It‟s Darwinian evolutionary forces at work
  in the marketplace.

• Interestingly, the costs to exploit big data (well – at
  least to find out if there is some value that you are
  missing out on) are now very low due to vendors
  such as AWS, so it‟s a market advantage that is
  relatively cheap to attain
  − I.e. we‟re talking about a few enabled savvy staff
    and some “pay as you go” compute resources

More Related Content

Viewers also liked (20)

Henna egipcia
Henna egipciaHenna egipcia
Henna egipcia
 
Conoscamos a slideshare
Conoscamos a slideshareConoscamos a slideshare
Conoscamos a slideshare
 
Ma Evans
Ma EvansMa Evans
Ma Evans
 
Genomma lab bio dual
Genomma lab bio dualGenomma lab bio dual
Genomma lab bio dual
 
Pepino español cristian rivas
Pepino español cristian rivasPepino español cristian rivas
Pepino español cristian rivas
 
Una semana de mi vida
Una semana de mi vidaUna semana de mi vida
Una semana de mi vida
 
Genomma
GenommaGenomma
Genomma
 
Bengue
BengueBengue
Bengue
 
Especies en extincion
Especies en extincionEspecies en extincion
Especies en extincion
 
Dalay Genomma Lab
Dalay Genomma LabDalay Genomma Lab
Dalay Genomma Lab
 
Acne Problema y solución
Acne Problema y soluciónAcne Problema y solución
Acne Problema y solución
 
Especies en extinción
Especies en extinciónEspecies en extinción
Especies en extinción
 
Presentación2 clary
Presentación2 claryPresentación2 clary
Presentación2 clary
 
Presentación2 clary
Presentación2 claryPresentación2 clary
Presentación2 clary
 
Expösicion
ExpösicionExpösicion
Expösicion
 
Genomma lab dermoprada
Genomma lab dermopradaGenomma lab dermoprada
Genomma lab dermoprada
 
Especies en extinción
Especies en extinciónEspecies en extinción
Especies en extinción
 
Cartel
CartelCartel
Cartel
 
Unigastrosol
UnigastrosolUnigastrosol
Unigastrosol
 
Conoscamos a slideshare
Conoscamos a slideshareConoscamos a slideshare
Conoscamos a slideshare
 

Similar to Smart421 SyncNorwich Big Data on AWS by Robin Meehan

Cloud costs: my 2 cents
Cloud costs: my 2 centsCloud costs: my 2 cents
Cloud costs: my 2 cents
RightScale
 
AWS Cloud Kata | Bangkok - Getting to Profitability
AWS Cloud Kata | Bangkok - Getting to ProfitabilityAWS Cloud Kata | Bangkok - Getting to Profitability
AWS Cloud Kata | Bangkok - Getting to Profitability
Amazon Web Services
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Amazon Web Services
 
Com 220 grammar exercise 3
Com 220 grammar exercise 3Com 220 grammar exercise 3
Com 220 grammar exercise 3
promaminpi1972
 

Similar to Smart421 SyncNorwich Big Data on AWS by Robin Meehan (20)

AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
 
Cloud costs: my 2 cents
Cloud costs: my 2 centsCloud costs: my 2 cents
Cloud costs: my 2 cents
 
Understanding cloud costs with analytics
Understanding cloud costs with analyticsUnderstanding cloud costs with analytics
Understanding cloud costs with analytics
 
Stampede con 2014 cassandra in the real world
Stampede con 2014   cassandra in the real worldStampede con 2014   cassandra in the real world
Stampede con 2014 cassandra in the real world
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
 
AWS Cloud Kata | Bangkok - Getting to Profitability
AWS Cloud Kata | Bangkok - Getting to ProfitabilityAWS Cloud Kata | Bangkok - Getting to Profitability
AWS Cloud Kata | Bangkok - Getting to Profitability
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
 
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
 
AWS Cost Control
AWS Cost ControlAWS Cost Control
AWS Cost Control
 
CQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspectiveCQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspective
 
AWS Cost Opt Meetup 2 - News corp - Spot On deep dive
AWS Cost Opt Meetup 2 - News corp - Spot On deep diveAWS Cost Opt Meetup 2 - News corp - Spot On deep dive
AWS Cost Opt Meetup 2 - News corp - Spot On deep dive
 
Fraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWSFraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWS
 
Com 220 grammar exercise 3
Com 220 grammar exercise 3Com 220 grammar exercise 3
Com 220 grammar exercise 3
 
Eitc team 1 of v3 annotated bibliography
Eitc team 1 of v3  annotated bibliographyEitc team 1 of v3  annotated bibliography
Eitc team 1 of v3 annotated bibliography
 
AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...
AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...
AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Cost saving in cloud solutions
Cost saving in cloud solutionsCost saving in cloud solutions
Cost saving in cloud solutions
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Smart421 SyncNorwich Big Data on AWS by Robin Meehan

  • 1. SyncNorwich – Big Data on AWS April 2013 Robin Meehan CTO
  • 3.
  • 4.
  • 5. What the hell is it?
  • 10. Storm Spark Dremel/Drill Impala AWS Redshift etc etc http://flickr.com/photos/42033648@N00/
  • 11. Big data exploitation – in practice 11
  • 12. Introduction……. The example Aviva Use Case… • Aviva have a number of brands/channels to market including insurance aggregators (e.g. CompareThe Market, GoCompare…) • The raw aggregator quote data is of a scale to present a „Big Data‟ problem – there is great potential for gaining additional insights from this data So… • Define some candidate business questions • Test them against significant volumes of data • Measure cluster size/£/time performance 12
  • 13. Driving AWS EMR… AWS Elastic Map Reduce…configuring a Hadoop Cluster... 13
  • 14. Some pig… Query B: ~10 million quotes (5m each channel). Joining quote data across different channels. register 's3n://ashaw-1/jars/myudfs.jar'; register 's3n://ashaw-1/jars/dom4j-1.6.1.jar'; A = load 's3n://ashaw-1/Intermediate/duplicated/lots' using PigStorage(); Arac = load 's3n://ashaw-1/Intermediate/duplicated/lotsrac' using PigStorage(); A1 = limit A 5000000; Arac1 = limit Arac 5000000; B = foreach A1 generate myudfs.Flatten((chararray)$5); Brac = foreach Arac1 generate myudfs.Flatten2((chararray)$5); C = join B by (chararray)($0.$21), Brac by (chararray)($0.$21); D = filter C by $1.$0 == 1 OR $0.$0 == 1; STORE D INTO „s3n://ashaw-1/myoutputfolder/‟;
  • 15. Costs per run… XML Flattening results: • 10 Million quotes: Cluster size: Time to execute: Approx. cost: 10 x Small nodes 64 minutes. 11 compute hours - $1.155 per hour (approx. £0.72) 19 x Small nodes 31 minutes. 20 compute hours - $2.10 per hour (approx. £1.30) 8 x Large nodes 19 minutes 8 compute hours - $3.78 per hour (approx. £2.34) But we could have used spot instances…
  • 17. Wrapping up… • It will be a similar adoption pattern to cloud: − Those organisations that make it work and gain additional business insights will • market more accurately • sell more • have less customer churn • have better paying customers • Market forces will eventually force adoption or failure of their competitors – all other things being equal. It‟s Darwinian evolutionary forces at work in the marketplace. • Interestingly, the costs to exploit big data (well – at least to find out if there is some value that you are missing out on) are now very low due to vendors such as AWS, so it‟s a market advantage that is relatively cheap to attain − I.e. we‟re talking about a few enabled savvy staff and some “pay as you go” compute resources

Editor's Notes

  1. Oh dear – I’m between beer, you, and more beer!
  2. It’s on the TV now. So your bosses will be asking you about it 
  3. Some degree of cynicism forming!
  4. What is it, well let’s cover 3 or 4 V’s first…
  5. Volume
  6. Velocityhttp://commons.wikimedia.org/wiki/File:Ferrari_156_85_in_2011.jpg
  7. Variety - http://commons.wikimedia.org/wiki/File:Hundreds_and_thousands.jpg
  8. What is it technically? Lots of things…Hadoop is probably the “daddy”, along with pig, hive etc
  9. Newer kids on the block – Storm, Spark, Dremel, Impala etc etcetcImpala - ClouderaApache Drill (incubating) is a distributed system for interactive analysis of large-scale datasets, based on Google's Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.Dremel is a distributed system developed at Google for interactively querying large datasets. It is the inspiration for Apache Drill, and it powers Google's BigQuery service.Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much more quickly than with disk-based systems like Hadoop MapReduce.To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets.
  10. OK so that’sgreat, I’m sold – how do I get started?Create the killer combo of skills:Customers have the deep understanding of the structure of their data (even if they don’t have the deep insights into it)Marry that up with the technical skills to load the data, transform it, process and analyse it, and then provide visualisations of it (e.g. load the results into your enterprise Business Intelligence tool)Typically consists of two phasesDiscovery phase – adhoc data processing of multiple small data sets initially, then big data, searching for insight into the data – this is the “scientific method” in actionProduction phase – once some valuable insight it found, automate the extraction of that insight, e.g. to feed “propensity to churn” for each customer into your CRM system every night
  11. Use Case – explanation around the selection of the use case based around the original brain-storming around raw aggregator test data and the issue of not being able to get business insights out of this data due to the volumes – e.g. cross-channel cannibalisation – who comes to QMH then other Aviva brands and where do they subsequently purchase
  12. AWS Elastic Map ReduceCover:Amazon self-serve web console – EMR job flow Amazon pricing – standard EC2 + EMR service – explain that pay slightly more for EMR on top of EC2 as AWS providing it as a managed service - they have installed all the elements and dependencies required for Hadoop (i.e. java, hadoop, pig etc.) Could roll your own – no reason why couldn’t roll your own on top of EC2
  13. It’s nothing without visualisation…Pentaho – running on AWS of course
  14. Punchline - Picture is of Charles Darwin – it’s going to be “survival of the fittest”, or perhaps it would be more accurate to say “survival of the best informed”