Smart421 SyncNorwich Big Data on AWS by Robin Meehan

SyncNorwich – Big Data on AWS

April 2013

Robin Meehan
CTO

http://flickr.com/photos/brunogirin/68341710/

http://commons.wikimedia.org/wiki/File:Loud_environment_headphones.jpg

http://commons.wikimedia.org/wiki/File:Ferrari_156_85_in_2011.j

http://commons.wikimedia.org/wiki/File:Hundreds_and_thousands.jpg

http://www.flickr.com/photos/krishaamer/2836262962/

Storm
Spark
Dremel/Drill
Impala
AWS Redshift
etc etc

http://flickr.com/photos/42033648@N00/

Big data exploitation – in practice

11

Introduction…….

The example Aviva Use Case…

• Aviva have a number of brands/channels to market including
insurance aggregators (e.g. CompareThe Market,
GoCompare…)
• The raw aggregator quote data is of a scale to present a „Big
Data‟ problem – there is great potential for gaining additional
insights from this data

So…
• Define some candidate business questions
• Test them against significant volumes of data
• Measure cluster size/£/time performance

12

Driving AWS EMR…

AWS Elastic Map Reduce…configuring a Hadoop Cluster...

13

Some pig…

Query B: ~10 million quotes (5m each channel). Joining quote data across different channels.

register 's3n://ashaw-1/jars/myudfs.jar';
register 's3n://ashaw-1/jars/dom4j-1.6.1.jar';
A = load 's3n://ashaw-1/Intermediate/duplicated/lots' using PigStorage();
Arac = load 's3n://ashaw-1/Intermediate/duplicated/lotsrac' using PigStorage();
A1 = limit A 5000000;
Arac1 = limit Arac 5000000;
B = foreach A1 generate myudfs.Flatten((chararray)$5);
Brac = foreach Arac1 generate myudfs.Flatten2((chararray)$5);
C = join B by (chararray)($0.$21), Brac by (chararray)($0.$21);
D = filter C by $1.$0 == 1 OR $0.$0 == 1;
STORE D INTO „s3n://ashaw-1/myoutputfolder/‟;

Costs per run…
XML Flattening results:

• 10 Million quotes:
Cluster size: Time to execute: Approx. cost:
10 x Small nodes 64 minutes. 11 compute hours - $1.155 per hour (approx. £0.72)
19 x Small nodes 31 minutes. 20 compute hours - $2.10 per hour (approx. £1.30)
8 x Large nodes 19 minutes 8 compute hours - $3.78 per hour (approx. £2.34)

But we could
have used
spot
instances…

Wrapping up…
• It will be a similar adoption pattern to cloud:
− Those organisations that make it work and gain
additional business insights will
• market more accurately
• sell more
• have less customer churn
• have better paying customers

• Market forces will eventually force adoption or
failure of their competitors – all other things being
equal. It‟s Darwinian evolutionary forces at work
in the marketplace.

• Interestingly, the costs to exploit big data (well – at
least to find out if there is some value that you are
missing out on) are now very low due to vendors
such as AWS, so it‟s a market advantage that is
relatively cheap to attain
− I.e. we‟re talking about a few enabled savvy staff
and some “pay as you go” compute resources

Smart421 SyncNorwich Big Data on AWS by Robin Meehan

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Smart421 SyncNorwich Big Data on AWS by Robin Meehan

Similar to Smart421 SyncNorwich Big Data on AWS by Robin Meehan (20)

Recently uploaded

Recently uploaded (20)

Smart421 SyncNorwich Big Data on AWS by Robin Meehan

Editor's Notes