US Office of Personnel Management: Notes on "Big Data"


Published on

A presentation given December 13, 2012 for the US Office of Personnel Management "Big Data" initiative

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Without systematic management of data knowledge is at risk…
  • All data go through processes of development. This 1986 NASA publication is still an excellent guide to basics of scientific data management…
  • Writing over 100 years ago, TC Chamberlin suggested that is structuring hypotheses – the method of working with multiple working hypotheses was superior…
  • The “definition” phase of the data management cycle is often neglected. In some instances of “big science” definitions for certain types of data are standard and fully specified. In other cases, it is assumed that the reason for a data type is obvious… Nevertheless, in planning for full-life cycle management – the data type definition, the basis for selecting a given type of data , is essential, even if given only a cursory explanation. For purposes of policy formation, wise planners will always ask how the given data type(s) compare with all other possible – or conceivable data types.
  • “Life-cycle” as a metaphorimplies a dependent developmental sequence. In fact, effective data management is more complicated than a simple sequence. There are certainly dependent developmental sequences but there are also sustained values that must be attended to throughout any cycle. For example, protection of the integrity of data so that no extraneous effects inadvertently deform data, rigorous record keeping, documenting provenance, chain-of-custody/lineage of data, careful documentation of scientific workflow (expert-competence of creators, apparatus, calibration, methodology) the sequences of transformation, migration/ emulation decisions, retention/disposition decisions…
  • Data Managers are usually not privy to the data definition process hence there is a natural tendency to treat data definition as a given and to begin full life cycle management with “acquisition” – nevertheless, managers should include in their descriptive tasks some explanation of data type definition (an explanation which should account for the selection of a given data type over and against primary alternatives)
  • The US IWGDD model is less than comprehensive and more vague – it is only broadly indicative of necessary processes…
  • The accompanying text is more helpful but still not comprehensive…
  • Schema are very tempting – particularly given the “devil’s toolbox” provided by MS PowerPoint – but unless very rigorously employed, they can misrepresent or obfuscate…
  • The text accompanying the DCC model is very helpful in differentiating “full life cycle” actions / “sequential actions” and “occasional actions” -- the graphic is much less effective…
  • This Oracle “model” focuses on “databases” – not on “data” per se…
  • Michener’s chart from 2006 makes a better effort at suggesting constant elements and feedback loops…
  • US Office of Personnel Management: Notes on "Big Data"

    1. 1. Some notes on “Big Data” :What does “full life-cycle” data management mean ? Tom Moritz, OPM “Big Data” July, 2012
    2. 2. Open Government and “Transparency”Two dimensions:-- Data aboutgovernment operations (all three branches!)-- Data that represent the products of government activity
    3. 3. Case Studies
    4. 4. “A representation of the cholera epidemic of the nineteenth century”
    5. 5.
    6. 6.
    7. 7. The “Flash Crash”: “On the afternoon of May 6, 2010, the U.S. equity markets experienced an extraordinary upheaval. Over approximately 10 minutes, the Dow Jones Industrial Average dropped more than 600 points, representing the disappearance of approximately $800 billion of market value. The share price of several blue-chip multinational companies fluctuated dramatically; shares that had been at tens of dollars plummeted to a penny in some cases and rocketed to values over $100,000 per share in others. As suddenly as this market downturn occurred, it reversed, so over the next few minutes most of the loss was recovered and share prices returned to levels close to what they had been before the crash.” “Large-Scale Complex IT Systems.” By Ian Sommerville, et al. Communications of the ACM, Vol. 55 No. 7, Pages 71-77. systems/fulltext Paul Strand: “Wall Street, New York City, 1915” Aerial view of pedestrians walking along Wall Street in strong sunlight and building in background with large recesses(likely 23 Wall Street, the headquarters of J.P. Morgan & Co.). Photograph by Paul Strand (a student of Lewis Hine), 1915; published in Camera Work, v. 48, p. 25. October 1916.
    8. 8. The “Flash Crash” (2) “…the trigger event was identified as a single block sale of $4.1 billion of futures contracts executed with uncommon urgency on behalf of a fund-management company. That sale began a complex pattern of interactions between the high-frequency algorithmic trading systems (algos) that buy and sell blocks of financial instruments on incredibly short timescales. “A software bug did not cause the Flash Crash; rather, the interactions of independently managed software systems created conditions unforeseen (probably unforeseeable) by the owners and developers of the trading systems. Within seconds, the result was a failure in the broader socio-technical markets that increasingly rely on the algos…” “Large-Scale Complex IT Systems.” By Ian Sommerville, et al. Communications of the ACM, Vol. 55 No. 7, Pages 71-77.
    9. 9. The “Flash Crash”(3): Key Insights  “Coalitions of systems, in which the system elements are managed and owned independently, pose challenging new problems for systems engineering.” “ When the fundamental basis of engineering – reductionism – breaks down, incremental improvements to current engineering techniques are unable to address the challenges of developing, integrating, and deploying large- scale complex IT systems.” “Developing complex systems requires a socio-technical perspective involving human, organizational, social and political factors, as well as technical factors.” “Large-Scale Complex IT Systems.” By Ian Sommerville, et al. Communications of the ACM, Vol. 55 No. 7, Pages 71-77.
    10. 10. The Digital Environment…
    11. 11. The “Ecology” of Digital Data GRIDS Data International Centers Collaborative Research EffortIndividual National Disciplinary InitiativesLibraries Cooperative ProjectsLocal / IndividualsPersonalArchiving “Small Science” “BIG Science”
    12. 12. The Public Domain“The institutionalecology of thedigitalenvironment”(Yokai Benkler)Sectors (public < -> private) andJurisdictional ScaleTHE ROLE OF SCIENTIFIC AND TECHNICAL DATA AND INFORMATION IN THE PUBLIC DOMAIN PROCEEDINGS OF A SYMPOSIUM Julie M. Esanuand Paul F. Uhlir, Editors Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office ofInternational Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division,National Research Council of the National Academies, p. 5
    13. 13. The “small science,” independent investigator approach traditionally hascharacterized a large area of experimental laboratory sciences, such aschemistry or biomedical research, and field work and studies, such asbiodiversity, ecology, microbiology, soil science, and anthropology. Thedata or samples are collected and analyzed independently, and theresulting data sets from such studies generally are heterogeneous andunstandardized, with few of the individual data holdings deposited inpublic data repositories or openly shared.The data exist in various twilight states of accessibility, dependingon the extent to which they are published, discussed in papers but notrevealed, or just known about because of reputation or ongoing work,but kept under absolute or relative secrecy. The data are thusdisaggregated components of an incipient network that is only aseffective as the individual transactions that put it together.Openness and sharing are not ignored, but they are not necessarilydominant either. These values must compete with strategicconsiderations of self-interest, secrecy, and the logic of mutuallybeneficial exchange, particularly in areas of research in whichcommercial applications are more readily identifiable.The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. Julie M. Esanu and PaulF. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office ofInternational Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global AffairsDivision, National Research Council of the National Academies, p. 8
    14. 14. “Small” data collections may become “Big” (and more complex)by successive aggregation of sources…
    15. 15. Linked Open Data 2009 2011 Courtesy of Tim Lebo, RPI Jun 2012 2012 @timrdf 16
    16. 16. “Data” ? [technical definition]“…’data’ are defined as any information that can be stored in digital form and accessed electronically, including, but not limited to, numeric data, text, publications, sensor streams, video, audio, algorithms, software, models and simulations, images, etc.”-- Program Solicitation 07-601 “Sustainable Digital Data Preservation and Access Network Partners (DataNet)”Taken in this broadest possible sense, “data” are thus simply electronic coded forms of information. And virtually anything can be represented as “data” so long as it is electronically machine-readable.
    17. 17. “The digital universe in 2007 — at 2.25 x 1021bits (281 exabytes or 281 billion gigabytes) — was 10% bigger than we thought. The resizing comes as a result of faster growth in cameras, digital TV shipments, and better understanding of information replication. “By 2011, the digital universe will be 10 times the size it was in 2006. “As forecast, the amount of information created, captured, or replicated exceeded available storage for the first time in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home. “Fast-growing corners of the digital universe include those related to digital TV, surveillance cameras, Internet access in emerging countries, sensor-based applications, datacenters supporting “cloud computing,” and social networks.The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011 -- Executive Summary.IDC Information and Data, March, 2008
    18. 18. “As you go down the Long Tail the signal-to-noise ratio gets worse. Thusthe only way you can maintain a consistently good enough signal to findwhat you want is if your filters get increasingly powerful.” Chris Anderson “Is the Long Tail full of crap?” May 22, 2005
    19. 19. “Data” [epistemic definition]“Measurements, observations or descriptions of a referent -- such as an individual, an event, a specimen in a collection or an excavated/surveyed object -- created or collected through human interpretation (whether directly “by hand” or through the use of technologies)” -- AnthroDPA Working Group on Metadata (May, 2009)
    20. 20. Data Entropy: the risks of inaction and the urgency of action“…data longevity is increased. Comprehensive metadata counteract the natural tendency for data to degrade in information content through time (i.e. information entropy sensu Michener et al., 1997; Fig. 1).” W. K. Michener “Meta-information concepts for ecological data management” Ecological Informatics 1 (2006) 3-7 Tom Moritz, OPM “Big Data” July, 2012
    21. 21. Data Development: “Data Reduction - Processing Level Definitions” (an example) Report of the EOS Data Panel Vol IIA, NASA, 1986 (Tech Memorandum 87777) Tom Moritz, OPM “Big Data” July, 2012
    22. 22. T.C. Chamberlin Tom Moritz, OPM “Big Data” July, 2012
    23. 23. Hypotheses and data as evidence: Inductive< -- > Deductive feedback loops? “What science does is put forward hypotheses, and use them to make predictions, and test those predictions against empirical evidence. Then thescientists make judgments about which hypotheses are more likely, given the data. These judgments are notoriously hard to formalize, as Thomas Kuhnargued in great detail, and philosophers of science don’t have anything like a rigorous understanding of how such judgments are made. But that’s only a worry at the most severe levels of rigor; in rough outline, the procedure is pretty clear. Scientists like hypotheses that fit the data, of course, but they also like them to be consistent with other established ideas, to be unambiguous and well-defined, to be wide in scope, and most of all to be simple. The more things an hypothesis can explain on the basis of the fewer pieces of input, the happier scientists are.” -- Sean Carroll “Science and Religion are not Compatible” Discover Magazine June 23rd, 2009 8:01 AM Tom Moritz, OPM “Big Data” July, 2012
    24. 24. Full Life Cycle Management? Tom Moritz, OPM “Big Data” July, 2012
    25. 25. US NSF “DataNet” Program “the full data preservation and access lifecycle” • “acquisition” • “documentation” • “protection” • “access” • “analysis and dissemination” • “migration” • “disposition”“Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation” NSF 07- 601 US National Science Foundation Office of Cyberinfrastructure Directorate for Computer & Information Science & Engineering Tom Moritz, OPM “Big Data” July, 2012
    26. 26. IWGDD = [US] “Interagency Working Group on Digital Data” ng_power_web.pdf Tom Moritz, OPM “Big Data” July, 2012
    27. 27. IWGDD“DIGITAL DATA LIFE CYCLE” Exhibit B-2. Life Cycle Functions for Digital Data* • Plan −− Determine what data need to be created or collected to support a research agenda or a mission function -- Identify and evaluate existing sources of needed data −− Identify standards for data and metadata format and quality −− Specify actions and responsibilities for managing the data over their life cycle • Create −− Produce or acquire data for intended purposes −− Deposit data where they will be kept, managed and accessed for as long as needed to support their intended purpose −− Produce derived products in support of intended purposes; e.g., data summaries, data aggregations, reports, publications • Keep −− Organize and store data to support intended purposes -- Integrate updates and additions into existing collections -- Ensure the data survive intact for as long as needed • Acquire and implement technology −− Refresh technology to overcome obsolescence and to improve performance −− Expand storage and processing capacity as needed −− Implement new technologies to support evolving needs for ingesting, processing, analysis, searching and accessing data • Disposition −− Exit Strategy: plan for transferring data to another entity should the current repository no longer be able to keep it −− Once intended purposes are satisfied, determine whether to destroy data or transfer to another organization suited to addressing other needs or opportunities Tom Moritz, OPM “Big Data” July, 2012
    28. 28. Tom Moritz, OPM “Big Data” July, 2012
    29. 29. “JISC DCC Curation Lifecycle Model” Tom Moritz, OPM “Big Data” July, 2012
    30. 30. Database Lifecycle Management “The Database Lifecycle Management covers the entire lifecycle of the databases, including: • Discovery and Inventory tracking: the ability to discover your assets, and track them • Initial provisioning, the ability to rollout databases in minutes • Ongoing Change Management, End-to-end management of patches , upgrades, schema and data changes • Configuration Management, track inventory, configuration drift and detailed configuration search • Compliance Management, reporting and management of industry and regulatory compliance standards • Site level Disaster Protection Automation” Tom Moritz, OPM “Big Data” July, 2012
    31. 31. W. K. Michener “Meta-information concepts for ecological data management” Ecological Informatics 1 (2006) 3-7 Tom Moritz, OPM “Big Data” July, 2012
    32. 32. “Sustainable data curation” “There are several main elements necessary to sustain data curation: “Robust data storage facilities (hardware and software) that are capable of accurately handling data migration across generations of media. “Backup plans, that are tested, so irreplaceable data are not at risk. Unintended data loss can occur for many reasons: some major causes are: poor stewardship leading to the loss of metadata to understand where the data is located and documentation to understand the content, physical facility and equipment failure (fire, flood, irrecoverable hardware crashes), accidental data overwrite or deletion. “Science-educated staff with knowledge to match the data discipline is important for checking data integrity, choosing archive organization, creating adequate metadata, consulting with users, and designing access systems that meet user expectations. Staff responsible for stewardship and curation must understand the digital data content and potential scientific uses. “C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008 , page 10. 2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09] Tom Moritz, OPM “Big Data” July, 2012
    33. 33. Sustainable data curation(cont.)  “Non-proprietary data formats that will ensure data access capability for many decades and will help avoid data losses resulting from software incompatibilities…  “Consistent staffing levels and people dedicated to best practices in archiving, access, and stewardship…  “National and International partnerships and interactions greatly aids in shared achievements for broad scale user benefits, e.g. reanalyses, TIGGE…  “Stable fundingnot focused on specific projects, but data management in general…”C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008 , page 10-11. 2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09] Tom Moritz, OPM “Big Data” July, 2012
    34. 34. “Data Quality” ???In general colloquial terms, “Data Quality” is the fundamental issue of concern to scientists, policy makers, managers/decision makers and the general public.“Quality” can be considered in terms of three primary values:• Validity: logical in terms of intended hypothesis to be tested (all potential types of data that could be chosen should be weighed for probative value…)• Competence (Reliability) :consideration of the proper choice of expert staff, methods, apparatus/gear, calibration, deployment and operation• Integrity: the maintenance of original integrity of data as well as tracking and documenting of all recording, migration, transformations and sequences of transformation of data Tom Moritz, OPM “Big Data” July, 2012
    35. 35. “…the “validation” of any scientific hypotheses rests upon the sum integrity of all original data and of all sequences of data transformation to which original data have been subject. “ – Tom Moritz “The Burden of Proof” Tom Moritz, OPM “Big Data” July, 2012
    36. 36. A Primary Goal of Open GovernmentPublic Access to Data that is:• Of High Quality ( SEE –previous discussion)• Free – no cost or minimal cost• Open – easily discoverable and accessible – “A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.” * http:// ]• Effective / Useful / Usable – both technically usable and descriptively identified in ways that support ready analysis, citation, use, reuse… T. Moritz “The Burden of Proof: Data as Evidence in Science and Public Policy” MicroSoft Research, GRDI2020, Stellenbosch, South Africa , Sept., 2010 5c30-42a7-94df-d9cd5f4b147c
    37. 37. Thanks for your attention… Tom Moritz Tom Moritz Consultancy Los Angeles +1 310 963 0199 tommoritz (Skype)
    38. 38. Saturn images courtesy of R J Robbins and The Research Coordinating Network for the Genomics Standards Consortium…
    39. 39.
    40. 40. Rosalind Franklin’s Image “Franklins B-form data, in conjunction with cylindrical Patterson map calculations that she had applied to her A-form data, allowed her to determine DNAs density, unit-cell size, and water content. With those data, Franklin proposed a double- helix structure with precise measurements for the diameter, the separation between each of the coaxial fibers along the fiber axis direction, and the pitch of the helix.3 “The diffraction photograph of the B form of DNA taken by Rosalind Franklin in May 1952 was by far the best photograph of its kind. Data derived from this photograph were instrumental in allowing James Watson and Francis Crick to construct their Nobel Prize­winning model for DNA.” (Courtesy of the Norman Collection on the History of Molecular Biology in Novato, Calif.)
    41. 41. “Notebook entries show that Rosalind Franklin (a)recognized that the B form of DNA was likely to have atwo-chained helix; (b) was aware of the Chargaff ratios;(c) knew that most, if not all, of the nitrogenous bases inDNA were in the keto configuration…; and (d)determined that the backbone chains of A-form DNA areantiparallel.” (Courtesy of Anne Sayre and JeniferFranklin Glynn.)
    42. 42. “Transcript of letter fromJames Watson to Max Delbruck March 12, 1953”“The basic structure is helical – it consists of two intertwining helices…” 01-large.html