Big Data as Opposed to Small Data Mark Whitehorn

889
-1

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
889
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Each scan in the data file requires a lot of highly intensive processing in order to determine what proteins were present in the cell.Some examples…..Currently a single threaded pc based application is used
  • 4 minutes
  • Big Data as Opposed to Small Data Mark Whitehorn

    1. 1. BIG DATA - AS OPPOSEDTO SMALL DATAMark Whitehorn
    2. 2. What is Big data?Is it really just a marketing campaign?http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big_ruse.pdf“If you’re like me, the mere mention of Big Datanow turns your stomach….Why all the fuss? Why,indeed. Essentially, Big Data is a marketingcampaign, pure and simple.”Stephen Few 2
    3. 3. Big dataClearly I am not like Stephen Few.I don’t believe I have a particular axe to grind, Isimply find this interestingThis talk is designed to try to explain:• what Big Data is• what characteristics we have found useful• why it may be of interest to you 3• a paradox
    4. 4. DataAll computer applications manipulate data 4
    5. 5. DataSo, in the ’60 and ‘70s we rapidly learnt toseparate the data, and its manipulation, fromthe application 5
    6. 6. DataSo, in the ’60 and ‘70s we rapidly learnt toseparate the data, and its manipulation, fromthe applicationWhich led directly to the development ofdatabase engines and, ultimately, relationalones (DB2, Oracle, SQL Server) 6
    7. 7. DataData has always existed in two, very broad,flavours….. • Data that is treated as small, discrete packages and is a good fit with the relational way of storing and querying data • Data that is not as above 7
    8. 8. Data is stored in tables LicenseNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red 8 Mark Whitehorn
    9. 9. Data is stored in tables Each table has a name Car LicenseNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red 9 Mark Whitehorn
    10. 10. Data is stored in tables Car LicenseNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red Data is atomic 10 Mark Whitehorn
    11. 11. Data is stored in tables Columns Car LicenseNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red 11 Mark Whitehorn
    12. 12. Data is stored in tables Columns Car LicenseNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 GreenRows EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red 12 Mark Whitehorn
    13. 13. Data is stored in tables Car LicenseNo Make Model Year Color CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red Each row represents a unique entity in the ‘real’ world…… 13 Mark Whitehorn
    14. 14. 14
    15. 15. DataThe manipulation consists typically of sub-setting the data by rows and columns andthen doing some sums 15
    16. 16. DataNote that this kind of manipulation is treatingthe data as atomic, which is fine, because therelational model assumes atomicity of dataNote also, that the rows are unordered 16
    17. 17. Data• Data has always existed in two, very broad, flavours….. • Data that is inherently atomic and is a good fit with the relational way of storing and querying data • Data that is not as above 17
    18. 18. Examples• Examples of ‘other’ data: • Images • Music • Word docs • Sensor data • Web logs • Twitter • Machines • Point of Sale 18 • Mass spectrometers
    19. 19. What’s in a name?So, what do we call the ‘rest’? • Un-structured? • Semi-structured? • Multi-structured? • Non-relational? • Non-tabular? 19
    20. 20. What’s in a name?• What about: • Big data? 20
    21. 21. Other definitions? •VVVvvvv • Volume • Variety • Velocity • Value • Very interesting • Various other words beginning with V….. 21
    22. 22. Big Data – not new?• So why have we focused, for the last 30 years, almost exclusively on the first flavour?• Because it: • is easy (relatively easy – Jim Gray*) • represents a significant proportion of the available data *Jim Gray and Andreas Reuter - Transaction Processing: Concepts and Techniques (1993) Turning Award 1998 22
    23. 23. Big Data has come of age• Two factors have changed • Rise of the Machines • Increase is computational power• There is a great synergy here • We are acquiring far more big data and we have computational power to extract the information it contains 23
    24. 24. Big Data is hard• 3 Vs• It is highly variable• We often want to look inside the data • Frequently non-atomic • Need custom functions for virtually every operation • find the rotating wing aircraft in the image • Identify the best customer • What does the blog sphere think of our company? 24
    25. 25. Big Data• Examples • Log file • Mass spec. • Images 25
    26. 26. Big Data• Examples • Log file • Mass spectrometer • Image 26
    27. 27. Big Data • Examples • Log file • Mass spec. • Images 27
    28. 28. What is Big Data?• Examples • Log file • Mass spec. • Images BIG DATA
    29. 29. Summary so far……• Just as you can always fit an aircraft engine into a car chassis, you can always put Big Data in a table, but you probably don’t want to• The analysis is not sub-setting the data by rows and columns• So each class of big data usually require a (lovingly hand-crafted) custom analysis 30
    30. 30. Case StudyBig Data in the Life Sciences World The massed spectrometers Why would anyone do that? 31
    31. 31. Human Genome Project$3 billion – 13 YearsSequencing completed(2003). 32
    32. 32. Human Genome ProjectHuman Genome Project$3 billion – 13 YearsOur genes define us.Errr…. how doesthat work exactly? 33
    33. 33. What is a protein? DNA Protein blueprint product 34
    34. 34. Why study proteins PROTEOME GENOME Genes contain Proteins carry outinstructions for creating functions within a cell 35 proteins
    35. 35. Protein: ACTIN Example ProteinsFunction: Contracts Muscles Protein: Insulin Function: Controls Blood Sugar Protein: Keratin Function: Forms Hair and NailsO 2 Protein: Hemoglobin Function: Carries Oxygen Protein: Antibody Function: Fights Viruses 36
    36. 36. biSCIENCE20-25,000 genes in the humangenome.Every nucleated cell in the samehuman has the same genome.But not all genes are active atthe same time.Perm any 15-18,00 activeproteins in any one cell at any 37one time.
    37. 37. slowly changing millions of years over a day rapidly changing 38
    38. 38. Studying ProteinsProteins are chopped up using anenzyme to make them easier to measure.A specialised instrument (Mass Spectrometer) isused to measure (‘weigh’) the small proteinfragments.We can use the mass of the small fragments tocarry out intelligent database searches to identifywhich protein was detected. 39
    39. 39. Protein PeptidesMKLNISFPATGCQKLIEVDDERKLRTFYEKRMATEVAADALGEEWKGYVVRISGGNDKQGFPMKQGVLTHGRVRLLLSKGHSCYRPRRTGERKRKSVRGCIVDANLSVLNLVIVKKGEKDIPGLTDTTVPRRLGPKRASRIRKLFNLSKEDDVRQYVVRKPLNKEGKKPRTKAPKIQRLVTPRVLQHKRRRIALKKQRTKKNKEEAAEYAKLLAKRMKEAKEKRQEQIAKRRRLSSLRASTSKSESSQK Amino Acids 40
    40. 40. Mass SpectrometryAn analytical technique for the determination ofthe elemental composition of a sample. 41
    41. 41. Spectra P1 P2 P3 42
    42. 42. Mass SpectraFile Sizes: typically severalgigabytes per MS run.Identifications: range from 500- 438000 protein identifications.
    43. 43. pepTRACKER 44TRACK. VISUALISE. DISCOVER.
    44. 44. 80% 60% 40% 20% 45
    45. 45. Localisation Protein Peptide Alignment Map Normalised Profiles for Synthesis, Degradation and TurnoverComparison BetweenCompartments 46
    46. 46. Custom analysis and custom visualisation –vital tools in understanding big data 47
    47. 47. Intensive Data Processing Required to deriveInformation from the raw data Base Line Correction Peak Detection BIOConductor PROcess R Package Deisotoping 48 Proteomics Volume 3, Issue 8, Article first published online: 12 AUG 2003
    48. 48. “proteomics is much more complicatedthan genomics . . . while an organisms genome is more or less constant, the proteome differs from cell to cell and over time”Computationally, perhaps three ordersof magnitude more complex than HGP 49
    49. 49. biSCIENCEWhy bother trying to quantify it?Because this is payback time.Documenting the proteomeopens the door to a whole newworld. 50
    50. 50. biSCIENCESo, what is a data scientist?My favourite description comes from Twitter:“Yeah, so Im actually a data scientist. I just do thisbarista thing in between gigs.”More cynically:“A data scientist is just an analyst who lives inCalifornia.” 51
    51. 51. biSCIENCEPossibly more accurate is that a data scientist (DS) is“a better software engineer than any statistician anda better statistician than any software engineer”. 52
    52. 52. biSCIENCEDSs are also part artist and part engineer. Theyneed a toolbox of techniques, skills, processes andabilities from which to construct novel solutions.And they need the ability to create a UI that turnstheir abstract finding into something that the usersof the system can understand, so DSs also need theskills to create elegant visualisations that turn rawdata into information. 53
    53. 53. biSCIENCEAnd (yes, there’s more) they need to be able tocommunicate well with people. There is little use increating a superb analytical process if you can’tcommunicate how and why it works to the boardmembers. 54
    54. 54. biSCIENCEAnd then there is the curiosity. Duncan Ross(Director of Data Sciences at Teradata) characteriseddata scientists well:The first and most important trait is curiosity. Insanecuriosity. In many walks of life evolution selectsagainst the kind of person who decides to find outwhat happens “if I push that button”. Data Scienceselects for it. 55
    55. 55. biSCIENCESo, what are the general characteristics of a DS?They include:• insatiable curiosity (see above)• interdisciplinary interests• excellent communication skills• excellent analytical capabilities 56
    56. 56. biSCIENCEDSs also need a good working knowledge of:• machine learning techniques• data mining• statistics• maths• algorithm development• code development• data visualisation• multi-dimensional database design and implementation 57
    57. 57. biSCIENCESpecific skills include the technologies to handle bigdata:• NoSQL databases• Hadoop and related technologies• MapReduce and its implementation on differing software platforms 58
    58. 58. biSCIENCEDSs also have an intimate knowledge of languagessuch as:• SQL• MDX• R• Functional and OOP languages such as Erlang and Java 59
    59. 59. biSCIENCEMost of all, no matter what they are called, all truedata scientists have started playing with some dataat 8:00PM and suddenly found it is 3:00AM.
    60. 60. Case Study Twitter Who loves you? Social/text/sentiment 61
    61. 61. Consider the humble tweet… 62
    62. 62. Consider the humble tweet…As, indeed, Sally Bercow shouldhave done 63
    63. 63. Consider the humble tweet…As, indeed, Sally Bercow shouldhave done *Innocent Face* 64
    64. 64. Consider the humble tweet…I’d just like to apologise for that last slide but I would point outthat it “contained no accusation whatsoever … Mischievous but not libellous.” 65
    65. 65. Case Study Oil Rig data Gone fishing Sensor data 66
    66. 66. Lessons learned• Engagement• Choose you battles – look for an area where you can gain competitive advantage• Choose your platform carefully• Programming – algorithm development• Data scientists • Custom algorithms 67 • Custom visualisations
    67. 67. Thank you very much for listeningAny Questions?Mark Whitehorn(MarkWhitehorn@computing.dundee.ac.uk) 68
    68. 68. BIG DATA - AS OPPOSEDTO SMALL DATA60 minutesMark Whitehorn
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×