Beautifying Data in the real world


Published on

Beautifying Data in the real world

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Beautifying Data in the real world

  1. 1. Instructor: Professor Lothar PiepmeyerBeautifying Datain the Real World Group 5: Toan Do - An Du Vinh Nguyen - Tan Tran 1
  2. 2. How big is the data on the Internet?2004: The first time Internet exceed 1EB2005: Eric Schmidt estimated it was 5 million Terabytes (~ 5EB)Cisco forecasts that in 2015, the size of the Internet will reach nearly 1,000 EB How big is it? Source:
  3. 3. If 1 byte = 0.5mm Source:3
  4. 4. ContentIntroductionOpen Notebook Sciences appoachingCurating and presenting the dataBeautfifying the dataData Visualization & Building a portal from open data and free servicesDemonstration
  5. 5. Data on the internet Source:
  6. 6. Problems of data in real world(Scientific)Noisy source of dataThe barrier of data presentation OCR version Text version Human-readable Machine readable …How to verify the data?
  7. 7. Open Notebook SciencePurpose: record full scientific research raw data, make it available and onlineBenefits: obtain detailed descriptions of procedures improve the communication of science increase the progress reduce time lost due to the repetition of failed experiments …
  8. 8. Apply ONS on free services
  9. 9. Crowdsourcinga distributed problem-solving and production model
  10. 10. Crowdsourcing
  11. 11. Crowdsourcing
  12. 12. Crowdsourcing Source:
  13. 13. Validating crowdsourced dataAccording to ONS, all detail data have been recordedThe doubtful data also be kept and marked for
  14. 14. Unique Identifiers for ChemicalEntityStandardize dataFacilitate the integration with other data setsConsider 3 possibilities  CAS Registry Number  InChI  SMILES
  15. 15. CAS Registry Number Proprietary Cannot converted to chemical structure Dependent to a external organization to issueFor example, the CAS number of water is 7732-18-5: the checksum 5 is calculated as (8 1 + 1 2 + 2 3 + 3 4 + 7 5 + 7 6) = 105; 105 mod 10 = 5
  16. 16. InChI IUPAC International Chemical Identifier Freely usable and non-proprietary Do not have to be assigned by some organization Can be computed from structural information Human readable (with practice)
  17. 17. SMILES  Simplified molecular-input line-entry system  More human-readable than InChI  Can convert to InChI
  18. 18. 18
  19. 19. Analysis OptionsAccess to live dataGet SummaryComplex Statistical representations of modelsMark the skeptical data for later consideration
  20. 20. 20
  21. 21. Google Docs APIAllows developers to create, retrieve, update, and delete Google Docs files and collectionsAlso provides some advanced features like resource archives, Optical CharacterRecognition, translation, and revision history.Useful to store data in the cloud, perform resource management, convert document formats
  22. 22. Google Visualization APIChart Library JavaScript classesData Table JavaScript DataTable classData Source Chart Tools Datasource protocol
  23. 23. 23
  24. 24. 24
  25. 25. RESTful Web Service Representational State Transfer - a simpler alternative to SOAP - and Web Services Description Language (WSDL) based Web services Principles:  Use HTTP methods explicitly.  Be stateless.  Expose directory structure-like URIs.  Transfer XML, JavaScript Object Notation (JSON), or both.
  26. 26. Compare REST and SOAPWhos using REST? All of Yahoos web services use REST, including Flickr, API uses it, pubsub, bloglines, technorati, and both eBay, and Amazon have web services for both REST and SOAP.Whos using SOAP? Google seams to be consistent in implementing their web services to use SOAP, with the exception of Blogger, which uses XML-RPC. You will find SOAP web services in lots of enterprise software as well.
  27. 27. Compare REST and SOAPREST SOAP Lightweight - not a Easy to consume - lot of extra xml sometimes markup Rigid - type Human Readable checking, adheres to Results a contract Easy to build - no Development tools toolkits required
  28. 28. 28
  29. 29. An Effort to Aggregate Data fromMultiple SourcesIntroducing ChemSpider An online lookup engine for Chemists 40 mil substances Multiple data sources A "link farm" to other sources
  30. 30. What is "wrong" with 30
  31. 31. Wikipedia.comNot “wrong”: Very informative for human being
  32. 32. Wikipedia.comThis little guy is left behind Not machine-readable
  33. 33. Semantic WebDescribing things in a way that computers applications can understand it. “The Beatles was a band from Liverpool”Describes the relationships between things (like A is a part of B and Y is a member of Z) and the properties of things (like size, weight, age, and price)“..will make all the data in the world look like one huge database“ – Tim Berners-Lee
  34. 34. Resource Description FrameworkIs a language to describe resources on the webComponent of the Semantic WebData is self-describing Triples: "subject", "predicate" and "value“ URIs are used to denote resources
  35. 35. RDFGraph Database Nodes EdgesWell-suited for Knowledge Representation Beautified Data => Knowledge
  36. 36. RDF Example<?xml version="1.0"?><rdf:RDFxmlns:rdf=""xmlns:cd="http://www.recshop.fake/cd#"><rdf:Descriptionrdf:about="http://www.recshop.fake/cd/Empire Burlesque"> <cd:artist>Bob Dylan</cd:artist> <cd:country>USA</cd:country> <cd:company>Columbia</cd:company> <cd:price>10.90</cd:price> <cd:year>1985</cd:year></rdf:Description></rdf:RDF>
  37. 37. Semantic Web Example: DBPedia“Old School” wikipedia: DbPedia Entries  
  38. 38. Query Language: SPARQL (sparkle)Query Language for RDF Graph Traversal Matching the triplesExample: Data:<> <> "SPARQL Tutorial” Query: SELECT ?title WHERE { <> <> ?title . } Query Result: title "SPARQL Tutorial"
  39. 39. To Infinity and Beyond• DB2 and Oracle are ready for this train•Object Database Versant OODBMS, anybody?•Machine-Readable Data Will they become self-awareness? 39
  40. 40. “Data Finds Data” and Semantic Data Model – A Hypothesis 40
  41. 41. Non-Obvious Relationship Awareness LÂM BẢO 41
  42. 42. Non-Obvious Relationship Awareness LÂM’s iPhone LÂM BẢO 42
  43. 43. Non-Obvious Relationship Awareness LÂM’s iPhone BẢO’s SS Galaxy LÂM BẢO 43
  44. 44. TheGioiDi LÂM’s iPhone BẢO’s SS GalaxyLÂM BẢO 44
  45. 45. TheGioiDi LÂM’s iPhone BẢO’s SS GalaxyLÂM BẢO 45
  46. 46. TheGioiDi LÂM’s iPhone BẢO’s SS Galaxy LÂM BẢOConnection Detected! -Bao could have met Lam at Thegioididong? -They could have discussed their World dominationscheme during the meeting there?-??? 46
  47. 47. TheGioiDi LÂM’s iPhone BẢO’s SS GalaxyLÂM BẢO 47
  48. 48.  Data Visualization Building a portal from open data andfree services
  49. 49. Visualization of Data Top million web sites (per Alexa traffic data) was performed in early 2010 ] Source
  50. 50. Visualization of Data
  51. 51. Second LifeSecond Life is a 3D world where everyone you see is a real person andevery place you visit is built by people just like you.
  52. 52. 3D Visualization in SL
  53. 53. SL- The Opportunity for "Edutainment" iSchool Teaching: Quizzes and Lectures Classrooms with Powerpoint Research Center Drexel Island on Second Life
  54. 54. 3-D Environments
  55. 55. Visualization To Suggest NewExperiments
  56. 56. Building A Portal From Open Data And Free Services Freely hosted Wiki service Google Spreadsheet Google Docs API / javascripts Visualization services/anlalysis services (2D, 3D) RDF/ Senmantic Web/ Webservices Cost: free or fit to the purpose
  57. 57. Key To Success Model+ Transparency Information Data Records
  58. 58. Demonstration Google Docs Second Life
  59. 59. ReferencesOreilly – Beautiful data – Chapter 16th Beautifying data in the real world is-the-internet-spoiler-not-as-big-as-itll-be-in- 2015/SMILE to 3D – Secon Life, Cg&feature=player_embedded