What is it technically? Lots of things…Hadoop is probably the “daddy”, along with pig, hive etc
Newer kids on the block – Storm, Spark, Dremel, Impala etc etcetcImpala - ClouderaApache Drill (incubating) is a distributed system for interactive analysis of large-scale datasets, based on Google's Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.Dremel is a distributed system developed at Google for interactively querying large datasets. It is the inspiration for Apache Drill, and it powers Google's BigQuery service.Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much more quickly than with disk-based systems like Hadoop MapReduce.To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets.
OK so that’sgreat, I’m sold – how do I get started?Create the killer combo of skills:Customers have the deep understanding of the structure of their data (even if they don’t have the deep insights into it)Marry that up with the technical skills to load the data, transform it, process and analyse it, and then provide visualisations of it (e.g. load the results into your enterprise Business Intelligence tool)Typically consists of two phasesDiscovery phase – adhoc data processing of multiple small data sets initially, then big data, searching for insight into the data – this is the “scientific method” in actionProduction phase – once some valuable insight it found, automate the extraction of that insight, e.g. to feed “propensity to churn” for each customer into your CRM system every night
Use Case – explanation around the selection of the use case based around the original brain-storming around raw aggregator test data and the issue of not being able to get business insights out of this data due to the volumes – e.g. cross-channel cannibalisation – who comes to QMH then other Aviva brands and where do they subsequently purchase
AWS Elastic Map ReduceCover:Amazon self-serve web console – EMR job flow Amazon pricing – standard EC2 + EMR service – explain that pay slightly more for EMR on top of EC2 as AWS providing it as a managed service - they have installed all the elements and dependencies required for Hadoop (i.e. java, hadoop, pig etc.) Could roll your own – no reason why couldn’t roll your own on top of EC2
It’s nothing without visualisation…Pentaho – running on AWS of course
Punchline - Picture is of Charles Darwin – it’s going to be “survival of the fittest”, or perhaps it would be more accurate to say “survival of the best informed”
Smart421 SyncNorwich Big Data on AWS by Robin Meehan
SyncNorwich – Big Data on AWSApril 2013Robin MeehanCTO
Introduction…….The example Aviva Use Case…• Aviva have a number of brands/channels to market including insurance aggregators (e.g. CompareThe Market, GoCompare…)• The raw aggregator quote data is of a scale to present a „Big Data‟ problem – there is great potential for gaining additional insights from this dataSo…• Define some candidate business questions• Test them against significant volumes of data• Measure cluster size/£/time performance 12
Some pig…Query B: ~10 million quotes (5m each channel). Joining quote data across different channels. register s3n://ashaw-1/jars/myudfs.jar; register s3n://ashaw-1/jars/dom4j-1.6.1.jar; A = load s3n://ashaw-1/Intermediate/duplicated/lots using PigStorage(); Arac = load s3n://ashaw-1/Intermediate/duplicated/lotsrac using PigStorage(); A1 = limit A 5000000; Arac1 = limit Arac 5000000; B = foreach A1 generate myudfs.Flatten((chararray)$5); Brac = foreach Arac1 generate myudfs.Flatten2((chararray)$5); C = join B by (chararray)($0.$21), Brac by (chararray)($0.$21); D = filter C by $1.$0 == 1 OR $0.$0 == 1; STORE D INTO „s3n://ashaw-1/myoutputfolder/‟;
Costs per run…XML Flattening results:• 10 Million quotes:Cluster size: Time to execute: Approx. cost:10 x Small nodes 64 minutes. 11 compute hours - $1.155 per hour (approx. £0.72)19 x Small nodes 31 minutes. 20 compute hours - $2.10 per hour (approx. £1.30) 8 x Large nodes 19 minutes 8 compute hours - $3.78 per hour (approx. £2.34) But we could have used spot instances…
Wrapping up…• It will be a similar adoption pattern to cloud: − Those organisations that make it work and gain additional business insights will • market more accurately • sell more • have less customer churn • have better paying customers• Market forces will eventually force adoption or failure of their competitors – all other things being equal. It‟s Darwinian evolutionary forces at work in the marketplace.• Interestingly, the costs to exploit big data (well – at least to find out if there is some value that you are missing out on) are now very low due to vendors such as AWS, so it‟s a market advantage that is relatively cheap to attain − I.e. we‟re talking about a few enabled savvy staff and some “pay as you go” compute resources
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.