Big Data = Bigger Metadata


Published on

Ian White of Urban Mapping from the O'Reilly Strata conference in Feb 2012

Published in: Business, Education, Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Some background to Urban Mapping. Wasn’t a straight forward path, but it’s very relevant-started close to 10 yrs ago with a printed map that reveals different layers of thematic imagery—streets, subways, neighborhoods, depending on the angle of viewing. We all know what happened to print, so I shifted the business to a new medium-in 2006 or so we collected much of the same data, but now using a spatial database as opposed to regular old vector/adobe illustrator. The writing was on the wall for licensing content to local web publishers, so shifted again-this time we moved upstream—continue to develop our own data, but greatly expand that effort to include commercial data and deliver it through our own mapping service. We do this for customers in various market segments, like Tableau Software, where we perform a few geo-services like hosting the base map and overlaying data.
  • I can be a bit of a curmudgeon and I hope a cautionary point of view has a place. Let’s talk about what Big Data is not. I’ll talk later about what it is.First thing to note is that Big Data isn’t really about data at all. But I am. It’s about tools and processes to manage and exploit info-nuggets. There’s nothing revolutionary about saying this, but I wanted to make it explicit. Second, big data isn’t especially new– Wall St and Walmart have been processing and deriving value for decades, but they don’t talk about it. Why? Because they make money doing so and don’t need to alert the competition. Anybody hear of Teradata? Whenever companies want to talk about what they are doing, it’s usually a red flag for me, meaning the technology, industry or something else hasn’t sufficiently evolved. But I’m also not saying Big Data is a rehash of enterprise software. More on that later…Finally, Big Data has democratized access to powerful tools at little cost. This doesn’t necessarily mean everybody knows how to use these tools. There can be some blowback, such as high credit card bills, analysis without direction/objective and lack of knowledge about basic statistics
  • There’s been exponential growth in data and it comes from any number of places. Some are shown here—mobile devices as probes, which vast capabilities to record all kinds of environmental variables, open government, social media and a desire for analytics which has been rebranded as business intelligence,
  • Processing and storage costs drop like rocks—enterprise software has been offering big solutions for decades to banking and others, but with incredibly low barriers to entry virtually anybody can participate.
  • Kal-i-um-akuswas a noted poet in the Library of Alexandria in 3rd century BC.
  • He created pin-a-keez, or Lists, a way of organizing works in the libraryEmbarked on the effort to organize 120k scrolls, by title, author, birthplace, father, education, summary of contents and other info. This was first effort to systematically create a bibliographic system. A direct link to metadata 2 millennia later
  • 1595, Johan van der Does publishedNomenclator– this was the first instance of a printed catalog of library holdings. Represented a significant advancement over the Kal-i-um-akuslists, but it too close to two millennia to get here
  • The modern cataloging system: Dewey Decimal System, created 1876. Its father was Melville DeweyThe Dewey Decimal System attempted to organize all knowledge into ten main classes. Further subdivided into ten divisions, and each division into ten sections, giving ten main classes, 100 divisions and 1000 sections. Allows for infinite hierarchy, numerical and faceted (linking content from different areas).Other systems followed: Universal Decimal Classification, Library of Congress, etc…
  • This photo is from the Card Division at the Library of Congress in the1920s. The amount of physical metadata is astounding. Millions of library cards with metadata
  • The next major advancement was in the late 1960s. Early attempts at electronic indexing focused on a taxonomy of keywords and related information. Was efficient for reporting on what the system contained, but also kept the long running divorce between artifact and metadataThe online computer library center was created as a nonprofit to further access to library resources across institutions and decrease costs.The OCLC acquired the Dewey Decimal System and as any standards body does, sought to perpetuate its existence over the decadesThen the internet happened
  • That meant out wit the old, In with the new. This photo is library cards going into storage. Not sure why they’d even be archived after the transition to databases was made, but that’s for another time
  • So this is the situation. Beginning in the late 60s, electronically-stored metadata began to grow. The library cards (at left) went away, but the bifurcation was complete. Total separation of the thing from the description of the thing. And it sort of made sense– IT was in its infancy, so storage and processing costs were high. Publishers also exerted a great deal of control over how they permitted libraries to index and make available works.
  • To put the last 2000 years in perspective, Kal-i-um-akus created the first crude schema, leaving a place for metadata to be storedThe Nomenclator gave us the first bibliographic catalog, printed and bound, produced annuallyThe Dewey Decimal System was born in 1876 and was the basis of an extensive metadata system for published worksThen…the internet happened. In the top right you see the corner of a cloud. That’s my way of representing what happens next.The volume of data product grows exponentially, overtaking 2000 plus years of history in no time.
  • So how about the bifurcation/divorce I mentioned? The web brought the artifact and metadata together again
  • Google Books. Sure, we have the Dewey Decimal type stuff along with ISBN, retail price, etc…but we also threw in the whole damn book—full text search.Amazon does it too
  • In my industry, the state of metadata is horrendous. We’re stuck in the green screen days. Proprietary data formats and slow moving vendors don’t help.While I’m the first person to admit GIS needs to get off its ass and change, radically, there’s also something the real time streaming web can learn from us.
  • We hear about the rise of the curator, the part social scientist, part librarian, part RDBMS wiz and statistician.This is increasingly important across all industries—when dealing with a torrent of data, domain experts will be required to help make sense of it.
  • The Knowledge Hierarchy, as it is sometimes known, has been used to represent relationships between the stuff that turns into something meaningful. You could look at this going from a letter to a sentence to a paragraph or an ingredient to a recipe to a meal or something else. The details don’t matter here, but I think about the fundamental building block of data.One geocoded tweet has little or no value on its own. Contrast that with per capital income for this ZIP code. By amassing enough geocoded tweets, it’s clear we can get to something meaningful, but I don’t know how many tweets that is. I do know that per capita income can directly inform my marketing plans for selling a new shampoo.
  • With that, here’s some more wet blanket for everybody. Using Google Trends, I looked at a number of terms that might indicate the old fashioned RDBMS, SQL way of life and most seem to follow the blue line, which represents the term ‘metadata.’ Big Data, coincidentally, first appears a few months before the first Strata conference in 2011. ‘Curation’ has a longer life but doesn’t show the surge of Big Data, and everybody’s favorite ‘data scientist,’ doesn’t register as much more than a rounding error. I’m not using Google Trends to fully substantiate my argument, but I do hope you take a dose of skepticism before fully embracing ‘this.’
  • In close, I’d like to leave you with an emergent cliché. It’s also my measure of how geeky an audience I have: one person’s metadata is another person’s data.
  • Big Data = Bigger Metadata

    1. 1. Big Data = Bigger MetaO’Reilly Strata ConferenceFebruary 29 2012
    2. 2. Pivot/Skate, etc… Founded 2003 Poor man’s GIS Panamap Refounded 2006 Neighborhood boundaries Mass transit data Refocused 2009 SaaS for mapping + on-demand data
    3. 3. Achtung! NoSQL is no panacea Big Data isn’t about data Big Data isn’t new Big Data doesn’t present a Boolean quandary With power comes responsibility AWS bills Lady Gaga tweets Innumeracy (correlation v causation)
    4. 4. Big v Important Big Important Heterogeneous Well-defined schema Raw High value (not free) Distributed Test-driven Streaming/real time Relational Search for meaning Historical Time-sensitive Enterprise-focused Philosophical
    5. 5. Data Exhaust Analytics Probes Social Media Gov 2.0
    6. 6. Platforms Commoditization of compute and storage
    7. 7. A Brief History of Metadata Callimachus Library of Alexandria, Egypt
    8. 8. A Brief History of Metadata “Pinakes” (lists) Title Category Author Author birthplace Father Word count Callimachus
    9. 9. A Brief History of Metadata
    10. 10. A Brief History of Metadata
    11. 11. A Brief History of MetadataCard catalog room,Library of Congress c. 1920
    12. 12. A Brief History of Metadata Dewey Decimal System goes electronic in 1967
    13. 13. Out with the Old, in with the NewArchiving card catalogsafter digitization
    14. 14. Why Can’t We Be Together? Metadata Data
    15. 15. Exponential Growth in Data Unprecedented rate of data creation, 1995-todayData Pinakes Catalog Taxonomy Database 300 BC 1595 AD 1876 1970
    16. 16. Oh, How I’ve Missed YouThe reunification of metadataand the artifact
    17. 17. Together At Last
    18. 18. GIS Data is Unevolved + =
    19. 19. Enter the Data CuratorPart social scientist, part librarian,part statistician, part RDBMS wiz
    20. 20. DIKW Model Data Fact, Signal, Symbol Information Structural v Functional Symbolic v Subjective Knowledge Processed Procedural Propositional
    21. 21. Popularity (Google Trends)
    22. 22. Words to Live By dx / dt
    23. 23. Thank you! R.I.P. Schema