Your SlideShare is downloading. ×
Smart421 star testing event   big data, london 26 sept 2013 r meehan v1.0
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Smart421 star testing event big data, london 26 sept 2013 r meehan v1.0


Published on

Contains the slides by Robin Meehan, CTO at Smart421 Ltd presented at Star Testing Event - Big Data event in London on Thursday 26 Sept 2013. This deck comprises 21 slides in total. More details on …

Contains the slides by Robin Meehan, CTO at Smart421 Ltd presented at Star Testing Event - Big Data event in London on Thursday 26 Sept 2013. This deck comprises 21 slides in total. More details on the case study for Aviva / can be found on webpage

Published in: Technology, Business

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • I’m NOT going to start with the traditional definition of big data stuff – I assume this has already been done to death today, and if it hasn’t you have certainly been hyped to death by sales spam etc. Let’s cover it by example later on if we need to by talking abut real examples…
  • If you are a traditional vendor, then cloud is pretty disruptive.I really believe in it – hell, we’re an AWS “advanced consulting partner” for managed service and big data competencies, the only one in the UK to have both.What’s the reason for adoption – agility and cost primarily.How much agility will I get – lots – I might get my product to market months earlier due to less infrastructure delays and I might get as low as 50% of the run cost.Sounds great!
  • So that makes us, as consumers of cloud services, happy, right? And it should! It’s a truly great thing.
  • Eyjafjallajökull in Iceland, 2010 – now that’s properly disruptive. The big data trend is potentially the same, when compared with cloud.
  • The consequences of losing competitive advantage in the “big data war” can be very serious – you can go out of business!Especially true in low margin businesses like telco, retail etce.g. if a telco like O2 or Vodafone can understand their customers better, and predict which customers will churn off the platform (in a saturated market, where there are no new customers to go after) and “save” the highest life-time value customers 1% more of the time than their competitors
  • What’s changed? Well from my point of view, organisations have been processing large data sets for a long time, but it required deep pockets and upfront Capex, i.e. you would want to be pretty confident that the business value is there.Now, the happy marriage of cloud with big data brings a transformational price point and reduced cost of entry to big data analytics.
  • OK so that’sgreat, I’m sold – how do I get started?Create the killer combo of skills:Customers have the deep understanding of the structure of their data (even if they don’t have the deep insights into it)Marry that up with the technical skills to load the data, transform it, process and analyse it, and then provide visualisations of it (e.g. load the results into your enterprise Business Intelligence tool)Typically consists of two phasesDiscovery phase – adhoc data processing of multiple small data sets initially, then big data, searching for insight into the data – this is the “scientific method” in actionProduction phase – once some valuable insight it found, automate the extraction of that insight, e.g. to feed “propensity to churn” for each customer into your CRM system every night
  • Use Case – explanation around the selection of the use case based around the original brain-storming around raw aggregator test data and the issue of not being able to get business insights out of this data due to the volumes – e.g. cross-channel cannibalisation – who comes to QMH then other Aviva brands and where do they subsequently purchase
  • AWS Elastic Map ReduceCover:Amazon self-serve web console – EMR job flow Amazon pricing – standard EC2 + EMR service – explain that pay slightly more for EMR on top of EC2 as AWS providing it as a managed service - they have installed all the elements and dependencies required for Hadoop (i.e. java, hadoop, pig etc.) Could roll your own – no reason why couldn’t roll your own on top of EC2
  • We’re all coders these days aren’t we? It’s hip! So here’s some code…
  • It’s nothing without visualisation…Pentaho – running on AWS of course
  • What challenges do we see in the market?Well – it’s a bit of a “solution looking for a problem”…in some sectors adoption is pretty high, in some it’s like talking about time travel.Also, access to the underlying data is a challenge, even though it’s held in internal systems.And the data quality of it.
  • And where is the technology going? Who are the new kids on the block?Well – it’s a HOT area and certainly a great time to be in IT – there’s a whole area of computer science-related research going on about and innovations all the time. A consistent trend is the movement from batch-oriented Hadoop-based processing to more real-time analytics.Real-time has two meanings really:Real time stream event processing – stock market trades, twitter firehose etcNear-immediate access to perform complex analytics queries on data as soon as it has been ingested, i.e. real time queries, not batch-basedA common theme is that a lot of these new technologies exploit aspects of Hadoop, e.g HDFS.A second trend is there is a plethora of vendors out there offering simplified means of running Hadoop-based analytics. A powerful thing, but kinda dangerous in some ways, as you need some degree of understanding of the underlying computing technology you are using to best use it.Newer kids on the block – Storm, Spark, Dremel, Impala etc etcetcImpala - ClouderaApache Drill (incubating) is a distributed system for interactive analysis of large-scale datasets, based on Google's Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.Dremel is a distributed system developed at Google for interactively querying large datasets. It is the inspiration for Apache Drill, and it powers Google's BigQuery service.Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much more quickly than with disk-based systems like Hadoop MapReduce.To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets.
  • Survival of the best informed!
  • Volume
  • Velocity
  • Variety -
  • What is it technically? Lots of things…Hadoop is probably the “daddy”, along with pig, hive etc
  • Transcript

    • 1. 28 September 2013 Star Testing Event – Big Data Robin Meehan, CTO, Smart421
    • 2.
    • 3. Cloud…
    • 4.
    • 5. …Big Data!
    • 6.
    • 7. So what’s changed?
    • 8. Big data exploitation – in practice 8
    • 9. Case study • Aviva have a number of brands/channels to market including insurance aggregators (e.g. CompareThe Market, GoCompare…) • The raw aggregator quote data is of a scale to present a ‘Big Data’ problem – there is great potential for gaining additional insights from this data 9 Define some candidate business questions Test them against significant volumes of data Measure cluster size/£/time performance
    • 10. Driving AWS EMR… 10 AWS Elastic Map Reduce…configuring a Hadoop Cluster...
    • 11. Some pig… register 's3n://ashaw-1/jars/myudfs.jar'; register 's3n://ashaw-1/jars/dom4j-1.6.1.jar'; A = load 's3n://ashaw-1/Intermediate/duplicated/lots' using PigStorage(); Arac = load 's3n://ashaw-1/Intermediate/duplicated/lotsrac' using PigStorage(); A1 = limit A 5000000; Arac1 = limit Arac 5000000; B = foreach A1 generate myudfs.Flatten((chararray)$5); Brac = foreach Arac1 generate myudfs.Flatten2((chararray)$5); C = join B by (chararray)($0.$21), Brac by (chararray)($0.$21); D = filter C by $1.$0 == 1 OR $0.$0 == 1; STORE D INTO ‘s3n://ashaw-1/myoutputfolder/’; Query B: ~10 million quotes (5m each channel). Joining quote data across different channels.
    • 12. Visualisation
    • 13. Costs per run… Cluster size: Time to execute: Approx. cost: 10 x Small nodes 64 minutes 11 compute hours - $1.155 per hour (approx. £0.72) 19 x Small nodes 31 minutes 20 compute hours - $2.10 per hour (approx. £1.30) 8 x Large nodes 19 minutes 8 compute hours - $3.78 per hour (approx. £2.34) But we could have used spot instances…
    • 14. Challenges…
    • 15. Storm Spark Dremel/Drill Impala AWS Redshift Accumulo etc etc Trends…
    • 16. Blog: Thank you! Robin Meehan CTO, Smart421
    • 17. Spare slides…
    • 18. 28 September 2013
    • 19. 28 September 2013
    • 20. 28 September 2013
    • 21.