SlideShare a Scribd company logo
SyncNorwich – Big Data on AWS

April 2013

Robin Meehan
CTO
http://flickr.com/photos/brunogirin/68341710/
What the hell is it?
http://commons.wikimedia.org/wiki/File:Loud_environment_headphones.jpg
http://commons.wikimedia.org/wiki/File:Ferrari_156_85_in_2011.j
http://commons.wikimedia.org/wiki/File:Hundreds_and_thousands.jpg
http://www.flickr.com/photos/krishaamer/2836262962/
Storm
   Spark
Dremel/Drill
   Impala
AWS Redshift
   etc etc




     http://flickr.com/photos/42033648@N00/
Big data exploitation – in practice




                                      11
Introduction…….

The example Aviva Use Case…

• Aviva have a number of brands/channels to market including
  insurance aggregators (e.g. CompareThe Market,
  GoCompare…)
• The raw aggregator quote data is of a scale to present a „Big
  Data‟ problem – there is great potential for gaining additional
  insights from this data

So…
• Define some candidate business questions
• Test them against significant volumes of data
• Measure cluster size/£/time performance




                                                                    12
Driving AWS EMR…

AWS Elastic Map Reduce…configuring a Hadoop Cluster...




                                                         13
Some pig…

Query B: ~10 million quotes (5m each channel). Joining quote data across different channels.


  register 's3n://ashaw-1/jars/myudfs.jar';
  register 's3n://ashaw-1/jars/dom4j-1.6.1.jar';
  A = load 's3n://ashaw-1/Intermediate/duplicated/lots' using PigStorage();
  Arac = load 's3n://ashaw-1/Intermediate/duplicated/lotsrac' using PigStorage();
  A1 = limit A 5000000;
  Arac1 = limit Arac 5000000;
  B = foreach A1 generate myudfs.Flatten((chararray)$5);
  Brac = foreach Arac1 generate myudfs.Flatten2((chararray)$5);
  C = join B by (chararray)($0.$21), Brac by (chararray)($0.$21);
  D = filter C by $1.$0 == 1 OR $0.$0 == 1;
  STORE D INTO „s3n://ashaw-1/myoutputfolder/‟;
Costs per run…
XML Flattening results:

• 10 Million quotes:
Cluster size:             Time to execute:   Approx. cost:
10 x Small nodes          64 minutes.        11 compute hours - $1.155 per hour (approx. £0.72)
19 x Small nodes          31 minutes.        20 compute hours - $2.10 per hour (approx. £1.30)
 8 x Large nodes          19 minutes         8 compute hours - $3.78 per hour (approx. £2.34)




         But we could
         have used
         spot
         instances…
Visualisation
Wrapping up…
• It will be a similar adoption pattern to cloud:
  − Those organisations that make it work and gain
    additional business insights will
       •   market more accurately
       •   sell more
       •   have less customer churn
       •   have better paying customers

• Market forces will eventually force adoption or
  failure of their competitors – all other things being
  equal. It‟s Darwinian evolutionary forces at work
  in the marketplace.

• Interestingly, the costs to exploit big data (well – at
  least to find out if there is some value that you are
  missing out on) are now very low due to vendors
  such as AWS, so it‟s a market advantage that is
  relatively cheap to attain
  − I.e. we‟re talking about a few enabled savvy staff
    and some “pay as you go” compute resources

More Related Content

Viewers also liked

Henna egipcia
Henna egipciaHenna egipcia
Henna egipcia
Genomma Lab
 
Conoscamos a slideshare
Conoscamos a slideshareConoscamos a slideshare
Conoscamos a slideshare
kolombianitabonita
 
Ma Evans
Ma EvansMa Evans
Ma Evans
Genomma Lab
 
Genomma lab bio dual
Genomma lab bio dualGenomma lab bio dual
Genomma lab bio dual
Genomma Lab
 
Pepino español cristian rivas
Pepino español cristian rivasPepino español cristian rivas
Pepino español cristian rivas
crm1995
 
Una semana de mi vida
Una semana de mi vidaUna semana de mi vida
Una semana de mi vida
lorenzino46
 
Genomma
GenommaGenomma
Genomma
Genomma Lab
 
Bengue
BengueBengue
Bengue
Genomma Lab
 
Especies en extincion
Especies en extincionEspecies en extincion
Especies en extincion
Sony2CVJerez
 
Dalay Genomma Lab
Dalay Genomma LabDalay Genomma Lab
Dalay Genomma Lab
Genomma Lab
 
Acne Problema y solución
Acne Problema y soluciónAcne Problema y solución
Acne Problema y solución
Genomma Lab
 
Especies en extinción
Especies en extinciónEspecies en extinción
Especies en extinción
Sony2CVJerez
 
Presentación2 clary
Presentación2 claryPresentación2 clary
Presentación2 clary
clary104
 
Presentación2 clary
Presentación2 claryPresentación2 clary
Presentación2 clary
clary104
 
Expösicion
ExpösicionExpösicion
Genomma lab dermoprada
Genomma lab dermopradaGenomma lab dermoprada
Genomma lab dermoprada
Genomma Lab
 
Especies en extinción
Especies en extinciónEspecies en extinción
Especies en extinción
Sony2CVJerez
 
Cartel
CartelCartel
Unigastrosol
UnigastrosolUnigastrosol
Unigastrosol
Genomma Lab
 
Conoscamos a slideshare
Conoscamos a slideshareConoscamos a slideshare
Conoscamos a slideshare
kolombianitabonita
 

Viewers also liked (20)

Henna egipcia
Henna egipciaHenna egipcia
Henna egipcia
 
Conoscamos a slideshare
Conoscamos a slideshareConoscamos a slideshare
Conoscamos a slideshare
 
Ma Evans
Ma EvansMa Evans
Ma Evans
 
Genomma lab bio dual
Genomma lab bio dualGenomma lab bio dual
Genomma lab bio dual
 
Pepino español cristian rivas
Pepino español cristian rivasPepino español cristian rivas
Pepino español cristian rivas
 
Una semana de mi vida
Una semana de mi vidaUna semana de mi vida
Una semana de mi vida
 
Genomma
GenommaGenomma
Genomma
 
Bengue
BengueBengue
Bengue
 
Especies en extincion
Especies en extincionEspecies en extincion
Especies en extincion
 
Dalay Genomma Lab
Dalay Genomma LabDalay Genomma Lab
Dalay Genomma Lab
 
Acne Problema y solución
Acne Problema y soluciónAcne Problema y solución
Acne Problema y solución
 
Especies en extinción
Especies en extinciónEspecies en extinción
Especies en extinción
 
Presentación2 clary
Presentación2 claryPresentación2 clary
Presentación2 clary
 
Presentación2 clary
Presentación2 claryPresentación2 clary
Presentación2 clary
 
Expösicion
ExpösicionExpösicion
Expösicion
 
Genomma lab dermoprada
Genomma lab dermopradaGenomma lab dermoprada
Genomma lab dermoprada
 
Especies en extinción
Especies en extinciónEspecies en extinción
Especies en extinción
 
Cartel
CartelCartel
Cartel
 
Unigastrosol
UnigastrosolUnigastrosol
Unigastrosol
 
Conoscamos a slideshare
Conoscamos a slideshareConoscamos a slideshare
Conoscamos a slideshare
 

Similar to Smart421 SyncNorwich Big Data on AWS by Robin Meehan

AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
Amazon Web Services
 
Cloud costs: my 2 cents
Cloud costs: my 2 centsCloud costs: my 2 cents
Cloud costs: my 2 cents
RightScale
 
Understanding cloud costs with analytics
Understanding cloud costs with analyticsUnderstanding cloud costs with analytics
Understanding cloud costs with analytics
RightScale
 
Stampede con 2014 cassandra in the real world
Stampede con 2014   cassandra in the real worldStampede con 2014   cassandra in the real world
Stampede con 2014 cassandra in the real world
zznate
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
ScyllaDB
 
AWS Cloud Kata | Bangkok - Getting to Profitability
AWS Cloud Kata | Bangkok - Getting to ProfitabilityAWS Cloud Kata | Bangkok - Getting to Profitability
AWS Cloud Kata | Bangkok - Getting to Profitability
Amazon Web Services
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
EDB
 
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
Jisc
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Amazon Web Services
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Amazon Web Services
 
AWS Cost Control
AWS Cost ControlAWS Cost Control
AWS Cost Control
Bob Brown
 
CQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspectiveCQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspective
Maria Gomez
 
AWS Cost Opt Meetup 2 - News corp - Spot On deep dive
AWS Cost Opt Meetup 2 - News corp - Spot On deep diveAWS Cost Opt Meetup 2 - News corp - Spot On deep dive
AWS Cost Opt Meetup 2 - News corp - Spot On deep dive
Peter Shi
 
Fraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWSFraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWS
Amazon Web Services
 
Com 220 grammar exercise 3
Com 220 grammar exercise 3Com 220 grammar exercise 3
Com 220 grammar exercise 3
promaminpi1972
 
Eitc team 1 of v3 annotated bibliography
Eitc team 1 of v3  annotated bibliographyEitc team 1 of v3  annotated bibliography
Eitc team 1 of v3 annotated bibliography
Jash Mehta
 
AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...
AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...
AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...
Amazon Web Services
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
Giridhar Addepalli
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
Yosuke Mizutani
 
Cost saving in cloud solutions
Cost saving in cloud solutionsCost saving in cloud solutions
Cost saving in cloud solutions
Sarp Saatçıoğlu
 

Similar to Smart421 SyncNorwich Big Data on AWS by Robin Meehan (20)

AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...
 
Cloud costs: my 2 cents
Cloud costs: my 2 centsCloud costs: my 2 cents
Cloud costs: my 2 cents
 
Understanding cloud costs with analytics
Understanding cloud costs with analyticsUnderstanding cloud costs with analytics
Understanding cloud costs with analytics
 
Stampede con 2014 cassandra in the real world
Stampede con 2014   cassandra in the real worldStampede con 2014   cassandra in the real world
Stampede con 2014 cassandra in the real world
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
 
AWS Cloud Kata | Bangkok - Getting to Profitability
AWS Cloud Kata | Bangkok - Getting to ProfitabilityAWS Cloud Kata | Bangkok - Getting to Profitability
AWS Cloud Kata | Bangkok - Getting to Profitability
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
 
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
Cloud present, future and trajectory (Amazon Web Services) - JIsc Digifest 2016
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
 
AWS Cost Control
AWS Cost ControlAWS Cost Control
AWS Cost Control
 
CQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspectiveCQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspective
 
AWS Cost Opt Meetup 2 - News corp - Spot On deep dive
AWS Cost Opt Meetup 2 - News corp - Spot On deep diveAWS Cost Opt Meetup 2 - News corp - Spot On deep dive
AWS Cost Opt Meetup 2 - News corp - Spot On deep dive
 
Fraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWSFraud Detection with Amazon Machine Learning on AWS
Fraud Detection with Amazon Machine Learning on AWS
 
Com 220 grammar exercise 3
Com 220 grammar exercise 3Com 220 grammar exercise 3
Com 220 grammar exercise 3
 
Eitc team 1 of v3 annotated bibliography
Eitc team 1 of v3  annotated bibliographyEitc team 1 of v3  annotated bibliography
Eitc team 1 of v3 annotated bibliography
 
AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...
AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...
AWS October Webinar Series - Using Spot Instances to Save up to 90% off Your ...
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Cost saving in cloud solutions
Cost saving in cloud solutionsCost saving in cloud solutions
Cost saving in cloud solutions
 

Recently uploaded

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 

Recently uploaded (20)

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 

Smart421 SyncNorwich Big Data on AWS by Robin Meehan

  • 1. SyncNorwich – Big Data on AWS April 2013 Robin Meehan CTO
  • 3.
  • 4.
  • 5. What the hell is it?
  • 10. Storm Spark Dremel/Drill Impala AWS Redshift etc etc http://flickr.com/photos/42033648@N00/
  • 11. Big data exploitation – in practice 11
  • 12. Introduction……. The example Aviva Use Case… • Aviva have a number of brands/channels to market including insurance aggregators (e.g. CompareThe Market, GoCompare…) • The raw aggregator quote data is of a scale to present a „Big Data‟ problem – there is great potential for gaining additional insights from this data So… • Define some candidate business questions • Test them against significant volumes of data • Measure cluster size/£/time performance 12
  • 13. Driving AWS EMR… AWS Elastic Map Reduce…configuring a Hadoop Cluster... 13
  • 14. Some pig… Query B: ~10 million quotes (5m each channel). Joining quote data across different channels. register 's3n://ashaw-1/jars/myudfs.jar'; register 's3n://ashaw-1/jars/dom4j-1.6.1.jar'; A = load 's3n://ashaw-1/Intermediate/duplicated/lots' using PigStorage(); Arac = load 's3n://ashaw-1/Intermediate/duplicated/lotsrac' using PigStorage(); A1 = limit A 5000000; Arac1 = limit Arac 5000000; B = foreach A1 generate myudfs.Flatten((chararray)$5); Brac = foreach Arac1 generate myudfs.Flatten2((chararray)$5); C = join B by (chararray)($0.$21), Brac by (chararray)($0.$21); D = filter C by $1.$0 == 1 OR $0.$0 == 1; STORE D INTO „s3n://ashaw-1/myoutputfolder/‟;
  • 15. Costs per run… XML Flattening results: • 10 Million quotes: Cluster size: Time to execute: Approx. cost: 10 x Small nodes 64 minutes. 11 compute hours - $1.155 per hour (approx. £0.72) 19 x Small nodes 31 minutes. 20 compute hours - $2.10 per hour (approx. £1.30) 8 x Large nodes 19 minutes 8 compute hours - $3.78 per hour (approx. £2.34) But we could have used spot instances…
  • 17. Wrapping up… • It will be a similar adoption pattern to cloud: − Those organisations that make it work and gain additional business insights will • market more accurately • sell more • have less customer churn • have better paying customers • Market forces will eventually force adoption or failure of their competitors – all other things being equal. It‟s Darwinian evolutionary forces at work in the marketplace. • Interestingly, the costs to exploit big data (well – at least to find out if there is some value that you are missing out on) are now very low due to vendors such as AWS, so it‟s a market advantage that is relatively cheap to attain − I.e. we‟re talking about a few enabled savvy staff and some “pay as you go” compute resources

Editor's Notes

  1. Oh dear – I’m between beer, you, and more beer!
  2. It’s on the TV now. So your bosses will be asking you about it 
  3. Some degree of cynicism forming!
  4. What is it, well let’s cover 3 or 4 V’s first…
  5. Volume
  6. Velocityhttp://commons.wikimedia.org/wiki/File:Ferrari_156_85_in_2011.jpg
  7. Variety - http://commons.wikimedia.org/wiki/File:Hundreds_and_thousands.jpg
  8. What is it technically? Lots of things…Hadoop is probably the “daddy”, along with pig, hive etc
  9. Newer kids on the block – Storm, Spark, Dremel, Impala etc etcetcImpala - ClouderaApache Drill (incubating) is a distributed system for interactive analysis of large-scale datasets, based on Google's Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.Dremel is a distributed system developed at Google for interactively querying large datasets. It is the inspiration for Apache Drill, and it powers Google's BigQuery service.Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much more quickly than with disk-based systems like Hadoop MapReduce.To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets.
  10. OK so that’sgreat, I’m sold – how do I get started?Create the killer combo of skills:Customers have the deep understanding of the structure of their data (even if they don’t have the deep insights into it)Marry that up with the technical skills to load the data, transform it, process and analyse it, and then provide visualisations of it (e.g. load the results into your enterprise Business Intelligence tool)Typically consists of two phasesDiscovery phase – adhoc data processing of multiple small data sets initially, then big data, searching for insight into the data – this is the “scientific method” in actionProduction phase – once some valuable insight it found, automate the extraction of that insight, e.g. to feed “propensity to churn” for each customer into your CRM system every night
  11. Use Case – explanation around the selection of the use case based around the original brain-storming around raw aggregator test data and the issue of not being able to get business insights out of this data due to the volumes – e.g. cross-channel cannibalisation – who comes to QMH then other Aviva brands and where do they subsequently purchase
  12. AWS Elastic Map ReduceCover:Amazon self-serve web console – EMR job flow Amazon pricing – standard EC2 + EMR service – explain that pay slightly more for EMR on top of EC2 as AWS providing it as a managed service - they have installed all the elements and dependencies required for Hadoop (i.e. java, hadoop, pig etc.) Could roll your own – no reason why couldn’t roll your own on top of EC2
  13. It’s nothing without visualisation…Pentaho – running on AWS of course
  14. Punchline - Picture is of Charles Darwin – it’s going to be “survival of the fittest”, or perhaps it would be more accurate to say “survival of the best informed”