(c) 2013 Ian Brown 5WE’LL TALK ABOUT• What is Big Data? What makes it "Big"?• Who needs Big Data? Where does it come from?• How does Big Data work? What are the tools and theissues?• What does the future of Big Data look like?
(c) 2013 Ian Brown 6WHAT IS BIG DATA?• To some extent “Big” really means“Difficult to handle”• Something of a misnomer:not only about size as three thingsdistinguish big data:• Volume (how much capacity you needto process/store)• Velocity (how quickly you need to process updates)• Variety (how complicated/non-standard the data is)VolumeVelocityVarietyBIG DATA
(c) 2013 Ian Brown 77source: datasciencecentral
(c) 2013 Ian Brown 8UNITSSource: www.wikipedia.com
(c) 2013 Ian Brown 9VOLUME• From pre-history to 2004 the world generated around 5 exabytesof data - we now produce that amount every 2 days• Data volumes are huge and growing: 1.8 zettabytes in 2011• = 1’800 Petabytes• =1.8 billion Terabytes• Data is predicted to grow x44 by 2020• >40% every year
(c) 2013 Ian Brown 10VOLUME• Whilst data has previously been “big” for some people,sometimes in the past - it’s definitely potentially big now (foreveryone) and getting bigger every day• Sources are networks (voice/data/video), social networks,sensors & transducers, GPS, banking, logistics, trade etc• 90% of the World’s digital data was gathered in the last 2years (source: IBM 2012)
(c) 2013 Ian Brown 11VARIETY (Variability)• Governments and Corporates have always had big databasesbut the data has always been structured - invoices, customers,inventory etc• Of the huge increase in data we just mentioned only 10-20%will be structured - the rest (80-90%) will be unstructured:• Video, email, social media, audio, images/scanned material• Traditional SQL databases (the clue is in the S) don’t do wellwith this sort of mixed data
(c) 2013 Ian Brown 12VELOCITY• Data is now coming at users constantly from global sourceswhich therefore gives a 24x7 problem.• Q.When do you stop to summarise/analyse? At what pointdo you cut-off for the day/week/period to run a report orplan the next action?• A. Sometimes you can’t! Analysis/processing/Action mayhave to happen on streaming data and corrections oractions are taken on-the-fly. Sometimes without storing thedata!
(c) 2013 Ian Brown 1313source: datasciencecentral
(c) 2013 Ian Brown 14HASN’T DATA ALWAYSBEEN “BIG”?• Maybe.• Historically computing was done in “batches” where stacks ofpunchcards or reels of tape (first paper, then magnetic) wereprocessed one file at a time.This had to be done when the businesswas “closed”.• If you closed at 18:00 and opened the next day at 09:00 you had awindow of 15 hours to do all your calculations and reports beforeyou had to stop and open for the next day’s business.• If you couldn’t get it done in 15 hours your data was “big”
(c) 2013 Ian Brown 15• Hence this is a relative question of how much data vs howmuch computing you can throw at it• For more than three decades we have seen a constantincrease in computing power which made the datagenerated by most businesses through their local customerslook “small”• Then the Web happened ....HASN’T DATA ALWAYSBEEN “BIG”?
(c) 2013 Ian Brown 16• Initially Web 1.0 and eCommerce opened up servers to many millions of events in termsof “hits” on web sites, logs, emails and a global multiplier of who could be a customer andaccess your system. Analysis of who was searching for what and who was buying whatabsorbed a lot of computing capacity.• Web 2.0 has added hundreds of millions of social networking users all broadcasting datain terms of photos, tweets, status updates, blog posts etc which has created a truly vastocean of data which can be trawled to learn about our behaviours, beliefs and likely futureactions.• If you want to process this data it certainly has volume, it doesn’t stop coming at you whenyou close for the night and so has tremendous velocity and if you are pulling it in from severalsources it quickly starts to exhibit complexity and variety• Traditional Hardware/Software has not kept pace with the growth of volume/velocity/varietyHASN’T DATA ALWAYSBEEN “BIG”?
(c) 2013 Ian Brown 17WHO NEEDS BIG DATA?• Generally: anyone who can derive a “big picture” insight by adding up all the small datapoints and “zooming out”• How much can you say about one tweet? A thousand tweets?• Twitter is generating > 9’000 tweets/sec which means it takes around 5 days to addanother billion tweets.Source: www.statisticbrain.com (2012)• What you “reckon” changes into sentiment analysis
(c) 2013 Ian Brown 18• Generally: anyone who can derive a “big picture” insight by adding up all the small datapoints and “zooming out”• How much can you say about one tweet? A thousand tweets?• Twitter is generating > 9’000 tweets/sec which means it takes around 5 days to addanother billion tweets.Source: www.statisticbrain.com (2012)• What you “reckon” changes into sentiment analysisSourceFlickrWHO NEEDS BIG DATA?
(c) 2013 Ian Brown 19THE SCALE CHANGESTHINGS• Big Data may be analogous to thedifference between the insight ina picture vs. a videoSource: slowmotionrunninghorse.com
(c) 2013 Ian Brown 20THE SCALE CHANGESTHINGS• Big Data may be analogous to thedifference between the insight ina picture vs. a videoSource: slowmotionrunninghorse.com
(c) 2013 Ian Brown 21WHY CARE?• Governments - release of open data: McKinsey est. $300m per yearsavings in US, $100m savings in Europe• Banks - fraud detection, algo trading: losses/profits. 2/3rd of 7 Bn US sharesa day ..• Life Sciences - genomics, drug research. 10yrs to seq the human genome• Retailers - buying patterns, CRM, if you like this ... : cross-selling• Social - Google, Facebook, LinkedIn,Twitter, Amazon, eBay: - Insight!• Networks - load management/routing, protecting networks• Probabalistic outcomes - Google Flu predictions (Nature: 2009)
(c) 2013 Ian Brown 22WHAT’S THE DIFFERENCE?•EXHAUSTIVE•SCRUFFY•PRAGMATICAnything missing ...?Source: damfoundation.org
(c) 2013 Ian Brown 23SO WHAT?• Three key pieces have shifted:• A shift from sampling to populations• A shift from exactness to “gisting”• A move from causality to correlation• Data no longer tied to the purpose for which it wascollectedData used to besmall, exact andcausal
(c) 2013 Ian Brown 24ASPECTSSource: www.datasciencecentral.com
(c) 2013 Ian Brown 25NEW SOURCES OF DATA• Information is now gathered on events and values that were nottraditionally thought of as data:• Current location (vs. address)• Whether you “like” someone else’s post• Things you nearly bought but didn’t• How much energy your office needs now• PLUS transactional systems, social media, sensors etc etc
(c) 2013 Ian Brown 26HOW DOES IT WORK?• Is this just a big database running on a powerful machine?• Not usually. Traditional databases don’t scale to this• Many hands make light work: Remember S.E.T.I. ?• Split it up and share it out between many nodes• Key analysis perspectives:• Real-time streaming data analysis (detect events and act)• Business Intelligence (asking specific questions of)• Data Mining (asking is there anything interesting here?)
(c) 2013 Ian Brown 28PHYSICALLYSource: Leons Petražickis, IBM Canada
(c) 2013 Ian Brown 29WHAT ARE THE PIECES?• HDFS Distributed File system (Google)• MapReduce (Google)• Split the problem into chunks• Spread it out over lots of (cheap) computing nodes• Reassemble the answer from the parts
(c) 2013 Ian Brown 30LOGICALLYSource: Leons Petražickis, IBM Canada
(c) 2013 Ian Brown 31WHAT IS THE APPROACH?• Somewhere to store it across different systems• e.g. Distributed File System (HDFS) - batch mode• Some way of specifying work in pieces/jobs• e.g. Hadoop (Yahoo) or MapReduce (for low-level jobs)• e.g. Pig or Hive or Oozie (for high-level apps/queries that translateto MapReduce)• Some way of reading/processing in real-time vs batch e.g. Hbase andFlume• Some way mining the data for trends/meaning (Data Mining/Machinelearning) e.g. Mahout• Some way of getting data in/out of SQL databases e.g. Sqoop
(c) 2013 Ian Brown 32HOW MANY “CHUNKS”?• eBay had 530 cores in 2010. It’s now in excess of 2’500cores• Yahoo has >4’000 cores• FaceBook have 23’000 cores with 20Pb of storage - becareful what you “like”...• Google aren’t telling .... (24Pb of data / day)• LinkedIn offer 100Bn recommendations / week
(c) 2013 Ian Brown 33WHERE CAN I GET SOME!!• IBM• ORACLE• MICROSOFT• EMC• Informatica• Apache - Open source• Amazon - Elastic computing / cloud-based hadoop• Small installations are free
(c) 2013 Ian Brown 38TRENDS• More data - MUCH MUCH MORE data• Internet of Things (IOT) - instrumentation/measurement• SmartEnergy meters 2005, RFID tags (1.3bn 2011 >30bn 2013)• each A380 engine gives 10TB every 30m: 640TB JFK->London• Big Science: Genomics, Pharmacology. LHC experiment gives 40TB/sec!!• Much more video and unstructured stuff (~60% of Internet traffic video by 2015)• The re-invention (or demise) of search/SEO• The need to move from local big data to distributed big data and sense-making networks• The rise of Observation - the need to filter and gain more control
(c) 2013 Ian Brown 39Where does that leave yourcompany?source: sap.com
(c) 2013 Ian Brown 40MAGIC BULLET?• Hadoop probably won’t replace your existing database• It is very good at large files/data sets so you not see so muchbenefit from large volumes of small files/datasets• It is very good at dealing with unstructured data so if your data islargely structured or can be made to look structured you may bebetter to stick with traditional databases• It doesn’t need to know about how you want to query the datawhich makes it very flexible but if your queries are always thesame you may be able to stick with SQL databases and BI/DWsystems
(c) 2013 Ian Brown 41TWO THINGS WORTHREMEMBERING ..