Your SlideShare is downloading. ×
Sailing on the ocean of 1s and 0s
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Sailing on the ocean of 1s and 0s

1,946
views

Published on

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,946
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Thanks to DevExpress for having me.
  • The world of science has changed, and there is no question about this. The new model is for the data to be captured by instruments or generated by simulations before being processed by software and for the resulting information or knowledge to be stored in computers. Scientists only get to look at their data fairly late in this pipeline. The techniques and technologies for such data-intensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm for scientific exploration.
  • Data is everything and everywhere.
  • Digital curation is generally referred to the process of establishing and developing long term repositories of digital assets for current and future reference[2] by researchers, scientists, historians, and scholars. Enterprises are starting to utilize digital curation to improve the quality of information and data within their operational and strategic processes.[4].
  • •Identify what data you need to curate: Will you be curating newly createddata and/or legacy data? How is new data created? Do users create the data, oris it imported from an external source? How frequently is new data created/up-dated? What quantity of data is created? How much legacy data exists? Wheredoes the legacy data reside? Is it stored within a single source, or scatteredacross multiple sources.•Identify who will curate the data: Curation activities can be carried out byindividuals, departments or groups, institutions, communities, etc.•Define the curation workflow: How will curation activities be carried out? Thecuration process will be heavily influenced by the previous two questions. Thetwo main methods to curate data are a curation group/department or a sheercuration workflow that enlists the support of users.•Identity the most appropriate data-in and data-out formats: What is thebest format for the data to be expressed? Is there an agreed standard formatwithin the industry or community? Choosing the right data format for receivingdata and publishing curated data is critical; often a curation effort will need tosupport multiple formats to ensure maximum participation.•Identify the artifacts, tools, and processes needed to support the curationprocess: A number of artifacts, tools, and processes can support data curationefforts, including workflow support, web-based community collaboration plat-forms. A number of algorithms exist to automate or semi-automate curationactivities [6] such as data cleansing3, record duplication and classification algo-rithms [7] that can be used within sheer curation.
  • XML provides an elemental syntax for content structure within documents, yet associates no semantics with the meaning of the content contained within. XML is not at present a necessary component of Semantic Web technologies in most cases, as alternative syntaxes exists, such as Turtle. Turtle is a de facto standard, but has not been through a formal standardization process.XML Schema is a language for providing and restricting the structure and content of elements contained within XML documents.RDF is a simple language for expressing data models, which refer to objects ("resources") and their relationships. An RDF-based model can be represented in a variety of syntaxes, e.g., RDF/XML, N3, Turtle, and RDFa.[22] RDF is a fundamental standard of the Semantic Web.[23][24][25]RDF Schema extends RDF and is a vocabulary for describing properties and classes of RDF-based resources, with semantics for generalized-hierarchies of such properties and classes.OWL adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. "exactly one"), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes.SPARQL is a protocol and query language for semantic web data sources.
  • The mandated miles per gallon increased each year as shown by the numbers along the right side of the drawing. The problem with this picture is that those numbers are represented by horizontal lines and those lines are not nearly proportional to the numbers. For example, the line representing 18 is 0.6 inches long, yet the line representing 27.5 is 5.3 inches long.Tufte created a formula to quantify this kind of misleading graphic. He called it The Lie Factor. The Lie Factor is equivalent to the Size of the effect shown in the graphic, divided by the size of the effect in the data (Figure 3c)In the fuel economy example, the Data Increase is 53%, but the Graphical Increase is 783%, resulting in a Lie Factor of 14.8!
  • Transcript

    • 1. Sailing on the Ocean of 1's and 0's
    • 2. Chris Woodruff
      Chris Woodruff
      cwoodruff@live.com
      Blog – http://chriswoodruff.com
      Technical Architect -- Perficient
      Coordinator, Grand Rapids DevDay
      INETA Director
      Co-host of Deep Fried Bytes Tech Podcast
      http://deepfriedbytes.com
    • 3. Where are we sailing today?
      Lets look at Data
      Go on to making Data valuable
      Look at ways to share Data
      Finally lets talk about making Data look good
    • 4. Science Paradigms
      1000’s Years Ago
      Science was empirical
      Describing
      100’s Years Ago
      Theoretical
      Using Models
      Last Few Decades
      Computational
      Simulations
      Today (eScience)
      Data Exploration
      Unified Theory
      Data Generated by Instruments or Simulations
      Scientists Analyzes data after curated
      from The Fourth Paradigm: Data-Intensive Scientific Discovery
    • 5. Before we get into the water lets talk about the Digital Ocean
      The Internet
    • 6. Why the Internet Won
      Simple architecture - HTML, URI, HTTP
      Networked - value grows with data, services, users
      Extensible - from Web of documentsto ...
      Tolerant - even w/ imperfect mark-up, data, links, software
      Universal - independent of systems and people
      Free / cheap - browsers, information, services
      Simple / powerful / productive for users - text, graphics, links
      Open standards
    • 7. What is Data?
      The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data (plural of "datum") are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived. Raw data, i.e. unprocessed data, refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols.
    • 8. What really is Data?
      Information that has no meaning or understanding.
    • 9. What is Data Really?
    • 10. Where is Data produced?
    • 11. How much data is generated on internet every year/month/day?
    • 12. How much data is moved on internet everymonth/day?
      21 exabytes per a month
      Around 675 petabytes per a day
      The amount of data produced each year would fill 37,000 libraries the size of the Library of Congress. (2003)
    • 13. Exabyte == a quintillion (or a million trillion) bytes or units of computer data. One exabyte is equivalent to 50,000 years’ worth of DVD-quality data.
    • 14. How much data does twitter produce?
      Twitter users are averaging 27.3 million tweets per day with an annual run rate of 10 billion tweets
      According todata from Pingdom
    • 15. How much Data is Facebook generating?
      More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.
      Average user creates 90 pieces of content each month
    • 16. Internet users are generating petabytes of data every day
    • 17. How much Data does your organization produce?
    • 18. Curating Data
    • 19. Definition
      “Data curation is the selection, preservation, maintenance, collection and archiving of digital assets.”
    • 20. What is involved in D Curation?
      Collecting verifiable digital assets
      Providing digital asset search and retrieval
      Certification of the trustworthiness and integrity of the collection content
      Semantic and ontological continuity and comparability of the collection content
    • 21. Challenges of D Curation
      Storage format evolution and obsolescence
      Rate of creation of new data and data sets
      Broad access and searching flexibility and variety
      Comparability of semantic and ontological definitions of data sets
    • 22. Setting up a Curation Process
      Identify what data you need to curate
      Identify who will curate the data
      Define the curation workflow
      Identity the most appropriate data-in and data-out formats
      Identify the artifacts, tools, and processes needed to support the curation process
    • 23. Tools to Curate Data
      Physical
      SQL Databases
      Wiki’s
      SharePoint
      Data Warehouses
      Collaborative
      DBPedia
      Azure Datamarket
      Semantics!!
    • 24. “Open” Data
    • 25. Semantic Web
      • XML provides an elemental syntax for content structure within documents, yet associates no semantics with the meaning of the content contained within.
      • 26. XML Schema is a language for providing and restricting the structure and content of elements contained within XML documents.
      • 27. RDF is a simple language for expressing data models, which refer to objects ("resources") and their relationships.
      • 28. RDF Schema extends RDF and is a vocabulary for describing properties and classes of RDF-based resources, with semantics for generalized-hierarchies of such properties and classes.
      • 29. OWL adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. "exactly one"), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes.
      • 30. SPARQL is a protocol and query language for semantic web data sources.
    • Open Data Protocol (OData)
      The Open Data Protocol (OData) enables the creation of HTTP-based data services, which allow resources identified using Uniform Resource Identifiers (URIs) and defined in an abstract data model, to be published and edited by Web clients using simple HTTP messages.
    • 31. The Key to “Open Data”?
      Shared Agreed upon Protocols
      Metadata
      Shared Vocabularies
    • 32. Visualization of Data
    • 33. Think about your Data
    • 34. Produce Great Graphical Information
    • 35. Minard's Diagram of Napoleon's March on Moscow
    • 36. Have Integrity in your Graphical Information
      Edward Tufte’s
      The Lie Factor
    • 37. Have Context with your Graphical Information
    • 38. Use less “Ink”
    • 39. Get Rid of the Junk
    • 40. Thanks Dave Giard!!!
    • 41. Examples of Great Visual Data
    • 42.
    • 43.
    • 44. Data Experience (DX)
    • 45. Wrap Up
      Think about your data
      Learn more about how your users work with the data you curate
      Learn about better ways to share your data
      Visualize and show the information your data best for your users
      Be a Data Experience Expert
    • 46. The Fourth Paradigm: Data-Intensive Scientific Discovery
      Required Reading
    • 47. The Visual Display of Quantitative Information
      Required Reading
    • 48. Beautiful Visualization: Looking at Data through the Eyes of Experts
      Required Reading
    • 49. Discussions

    ×