Your SlideShare is downloading. ×
BIGis thiswhich I see before me?DATAPhoto: Stefan Insam, “Deadly carvings” CC-BY-SAhttp://www.flickr.com/photos/ramsesorigi...
VOLUMEVELOCITYVARIETYThis is a well-known characterization of “big data:” volume, velocity, and variety. Big VOLUMES of da...
Where they most breed and haunt...Photo: CERN, “The Large Hadron Collider/ATLAS at CERN ” CC-BYhttp://www.flickr.com/photos...
Why, I can buy me twenty at any market...http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_f...
But in these casesWe still have judgment here; that we but teachBloody instructions, which, being taught, returnTo plague ...
BIGis thiswhich I see before me?DATAPhoto: Stefan Insam, “Deadly carvings” CC-BY-SAhttp://www.flickr.com/photos/ramsesorigi...
Your face, my thane, is as a book where menMay read strange matters.BIG DATA IN LIBRARIES?Kind gentlemen, your painsAre re...
BIGis thiswhich I see before me?DATAYES. Do we really even need to ask?So some of you are looking at me right now all like...
5 billion web-archive filesLIBRARY OF CONGRESS50 billion tweets5 million newspaper pages(x)00,000 e-journal articlesdigita...
HATHI TRUSTSo we’re all familiar with this page by now; in fact, a lot of the institutions represented in this room are Ha...
Threescore and ten I can remember well:Within the volume of which time I have seenHours dreadful and things strange...doi:...
BIGis thiswhich I see before me?DATAYES... BUT.
So if you weren’t watching, you missed this one: Harvard Library for a very brief time piloted a service called Library Ho...
BIGis thiswhich I see before me?DATANO. This isn’t even data.This is a subtle point, but one that governments are particul...
...thereby shall we shadowThe numbers of our host and make discoveryErr in report of us.So, this web page I’ve taken a scr...
Photo: Luz, “amber” http://www.flickr.com/photos/nieve44/3800137286/ CC-BYup, up, and seeThe great dooms image!Making an in...
The sacred storehouse of his predecessors,And guardian of their bones.Image: http://library.music.indiana.edu/tech_s/manua...
Strange things I have in head, that will to hand;Which must be acted ere they may be scann’d.Photo: Mike Linksvayer, “dsc0...
BIGis thiswhich I see before me?DATANOT YET.Libraries also have data that doesn’t look all that big -- or all that powerfu...
AGGREGATIONAy, in the catalogue ye go for men;As hounds and greyhounds, mongrels, spaniels,curs,Shoughs, water-rugs and de...
We shall not spend a large expense of timeBefore we reckon with your several loves...All the cataloguers in the room know ...
OAISTERI see thee compassd with thy kingdoms pearl...Any Michigan folks here? Here’s a blast from the past for you: OAIste...
O proper stuff!This is the very painting of your fear...DPLAWe have another chance to try aggregation, in the guise of the...
BIGis thiswhich I see before me?DATASIGH. It could be, but...And that question leads me to what if I were the Porter in Ma...
The multiplying villanies of natureDo swarm upon himThere’s so much we should know about books that we don’t. And we don’t...
...there cannot beThat vulture in you, to devour so manyI can’t really add anything to the combination of Les Carr and Sha...
As two spent swimmers, that do cling togetherAnd choke their art.And here’s where, as I generally do, I bite the hand that...
when worldcat.org docome to DPLAneAnd it’s not just libraries who want to treat WorldCat as a big juicy Big Data-store, ei...
when worldcat.org docome to DPLAneThink upon what hathchanced, and, at more time,The interim having weighd it,let us speak...
BIGif this iswhich I see before meDATANOW WHAT?
SKILLSSo I was asked to talk about what skills and scaffolding we need to make and use big data in our libraries. And I’ms...
SKILLSEvery one that does so is atraitor, and must be hanged.And the reason I couldn’t is that I know what way too many ac...
SCAFFOLDINGWhat bets should we make, now and future?What do we build? Fix?Who cares, right now? Who else should?What can w...
BIGis thiswhich I see before me?DATAThis presentation is availableunder a Creative CommonsAttribution 3.0 United Stateslic...
Upcoming SlideShare
Loading in...5
×

Is this BIG DATA which I see before me?

1,743

Published on

Given for an OCLC member symposium on shared data, May 31 2013.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,743
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Is this BIG DATA which I see before me?"

  1. 1. BIGis thiswhich I see before me?DATAPhoto: Stefan Insam, “Deadly carvings” CC-BY-SAhttp://www.flickr.com/photos/ramsesoriginal/6652582259/So, hi, I’m Dorothea Salo from the School of Library and Information Studies at the Universityof Wisconsin at Madison, and the first thing I’m going to do is apologize for the talk title inthe day’s agenda, which is a horrific MISquotation of Shakespeare’s Macbeth. Totally my fault,not Eric’s or OCLC’s, sorry about that, it’s correct on the slide!So, Big Data.
  2. 2. VOLUMEVELOCITYVARIETYThis is a well-known characterization of “big data:” volume, velocity, and variety. Big VOLUMES of data is probably the first thing to spring to mindwhen somebody says “big data” -- don’t think I need to explain it -- but size is not everything! (CLICK) VELOCITY matters too: how fast do thesedata pile up? how fast do they need to be cleaned up and used? how fast does interaction with the data need to be? how easy is it to get data wherethey’re going in the form they need to be in?(CLICK) And that gets to the third vee, VARIETY: From a computational perspective -- and computers are notoriously persnickety and dumb aboutthis -- how clean are the data in the first place? how much effort does it take to clean them, and how much of that effort can be automated? Notethat high variety is not a good thing! Ideal data for analysis is clean, CONSISTENT (this one’s important!); it’s easy to understand, and simple to usea computer to mess around with. In the real world, though, big data tends to mean more variety than is wanted. So a bit of hope here for librariesas we struggle with variety in our data: it’s not just us! we’re not alone!So keep these vees in mind as I go on talking. None of them is more important than any other; they all factor in to making the best use of Big Data.
  3. 3. Where they most breed and haunt...Photo: CERN, “The Large Hadron Collider/ATLAS at CERN ” CC-BYhttp://www.flickr.com/photos/11304375@N07/2046228644/http://gigaom.com/2013/04/04/why-facebook-home-bothers-me-it-destroys-any-notion-of-privacy/So where’s big data? It’s everywhere. It’s in science -- oops, the Large Hadron Collider twitched, that’s another petabyte. It’s on the web, ofcourse, from Google to Facebook to Amazon.
  4. 4. Why, I can buy me twenty at any market...http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovationWhat need we fear who knows it, whennone can call our power to account?Regalado and Leber, MIT Technology Review, http://www.technologyreview.com/news/514386/intel-fuels-a-rebellion-around-your-data/And even beyond the online giants, Big Data has hit business, where the hype cycle is highest, and where “big data” seems to mean something like“anything we can collect about our customers or users and their behavior to correlate with other companies’ data in flagrant violation of any notionof privacy.” And I think it’s important to watch how that debate evolves, as academe and its libraries keep getting told “behave like a business!”and businesses keep behaving so horrendously.The top quote, incidentally, is said by Lady Macbeth, and it’s about husband acquisition. That Lady Macbeth, business genius for our time!
  5. 5. But in these casesWe still have judgment here; that we but teachBloody instructions, which, being taught, returnTo plague the inventor:Examples via: Inside Higher Ed, http://insidehighered.com/Big data is in education, who knew? And we in academic libraries should be watching this, as well as folk who haveserved on IRBs, because it’s troubling from a student-privacy perspective and I don’t know who has moreauthority in academe to speak truth to power about privacy than academic librarians.
  6. 6. BIGis thiswhich I see before me?DATAPhoto: Stefan Insam, “Deadly carvings” CC-BY-SAhttp://www.flickr.com/photos/ramsesoriginal/6652582259/So, of course libraries have data, and we use data in decisionmaking, in asserting our value, in collection-development and service decisions, and so on. All I need to do is say “LibQual,” right? The question I was asked toaddress today, though, is whether libraries have, or will have, “Big Data.”
  7. 7. Your face, my thane, is as a book where menMay read strange matters.BIG DATA IN LIBRARIES?Kind gentlemen, your painsAre registerd where every day I turnThe leaf to read them.And I have several answers to that question.YES, libraries have big data. Of course we do.YES, libraries have or could have big data, BUT its collection or use is somehow problematic.NO, sometimes what libraries have there isn’t big data. It might be big, it might be important, but it’s not actuallydata, and that is often problematic.Some library data could be big data, but NOT YET it’s not.And finally... Big Data, SIGH. We could have big data and it’d be super-cool if we did, but something completelyunnecessary is in the way.
  8. 8. BIGis thiswhich I see before me?DATAYES. Do we really even need to ask?So some of you are looking at me right now all like “what a dumb question! Of course libraries have Big Data,where have you been for the last twenty years?!”
  9. 9. 5 billion web-archive filesLIBRARY OF CONGRESS50 billion tweets5 million newspaper pages(x)00,000 e-journal articlesdigital audio, video, etc.Leslie Johnston, Library of Congress, as reported by Lorcan Dempsey in “Big data... big trend,”http://orweblog.oclc.org/archives/002196.htmlAnd you know, those people are quite right. National libraries and some major research libraries have been in the big-volume data game for sometime because of digitization, and more recently, conscious collection of large volumes of born-digital materials. Here’s what Leslie Johnstonclaimed last year the Library of Congress is hanging onto digitally: five million newspaper pages, some hundreds of thousands of e-journal articles,five billion web-archive files, scads of digital audio and video, and what by now is probably close to if not more than a hundred billion tweets.Interestingly, I’ve seen news stories that hint that the Library of Congress’s Twitter database is running into a serious velocity problem! They haveall the tweets, just not the computational power to let researchers or anybody else DO anything with them. And it’s too big a dataset to bedownloadable, so the combination of high volume and a hoped-for high velocity is pretty deadly.
  10. 10. HATHI TRUSTSo we’re all familiar with this page by now; in fact, a lot of the institutions represented in this room are Hathi Trust members. It’s worthremembering that Hathi Trust came about in order to solve a classic big data volume problem: where the heck to PUT all those page scans andOCRed texts from the Google Books project!And as Hathi grows and changes, we see people tackling more problems that would sound really familiar to a big-data analyst in business or a so-called data scientist: what can we find out from this gigantic pile of bits? How do we best clean up the OCR so that linguistic and literary analysis isreliable, and how do we deal with language variation over time?And I have to tell you, as a historical-linguist-in-a-past-life and a sometime computer programmer, a lot of the analyses I see Ph.Ds proudlytrotting out these days are pretty weak. I don’t just mean “the digital humanities,” either, though there’s plenty of eye-rolly work there -- that“culturomics” stuff coming out of Google’s comp-sci people has some pretty obviously overbroad conclusions stemming from a failure to considerthe limitations of their evidence base appropriately. But, you know, there’s a lesson in that: with big data, we’re all learning by doing. We’ll getbetter at it; just give us time, and room to monkey around.
  11. 11. Threescore and ten I can remember well:Within the volume of which time I have seenHours dreadful and things strange...doi:10.2218/ijdc.v7i1.219 Developments in Research 114The International Journal of Digital CurationVolume 7, Issue 1 | 2012Developments in Research Funder Data PolicySarah Jones,Digital Curation Centre,University of GlasgowAbstractThis paper reviews developments in funders’ data management and sharing policies, and exploresthe extent to which they have affected practice. The Digital Curation Centre has been monitoring UKresearch funders’ data policies since 2008.1There have been significant developments in subsequentyears, most notably the joint Research Councils UK’s Common Principles on Data Policy and theEngineering and Physical Sciences Research Council’s Policy Framework on Research Data. Thispaper charts these changes and highlights shifting emphasises in the policies. Institutional datapolicies and infrastructure are increasingly being developed as a result of these changes. Whileaction is clearly being taken, questions remain about whether the changes are affecting practice onthe ground.So yeah, libraries COLLECTIVELY have big data and have had it for a long time! not new at all. What’s changing is that INDIVIDUAL libraries arestarting to run into high-volume and high-variety data problems. In academic libraries, for example, faculty are starting to look to us to help withresearch-data management. Some digital libraries are seriously getting into targeted web archiving, too.And here’s where I go all finger-shaky at us: right now, in May twenty-thirteen, most of us are not investing NEARLY enough in computinginfrastructure and development to be able to keep up well. We heard this morning from Sarah Pritchard that data management and curation is athing in research institutions; I’m here to tell you that the opportunity for libraries to stake a claim to research-data management and archiving inparticular is a TIME-LIMITED one. If academic libraries don’t prove we can help -- and that means a lot more than putting together a committee orhiring one person -- researchers SHOULD and WILL go elsewhere.So we can have Big Data... but only if we decide we want it badly enough.
  12. 12. BIGis thiswhich I see before me?DATAYES... BUT.
  13. 13. So if you weren’t watching, you missed this one: Harvard Library for a very brief time piloted a service called Library Hose that tweeted the titles ofbooks that had been checked out of the library, shortly after that checkout. Eyes were rolled, fusses were fussed, and the Library Hose was shutdown, because honestly, it’s kind of a bad idea. But that’s only a funny example of extremely serious questions about ethical uses of the data thatlibraries could and sometimes do collect about patrons, individually and in aggregate, on-purpose and inadvertently: search data, patron-computer-use data, patron-behavior data.And we discussed this earlier in the Q&A, but in my mind at least, this one’s easy. We want to differentiate ourselves from Google, our searchcompetitor? We want to differentiate ourselves from Facebook, our social-activity competitor? We want to differentiate ourselves from Amazon, ourcontent-purveying competitor? Easy. WE DO NOT SELL OUT OUR PATRONS THROUGH THEIR DATA. EVER. FOR ANY REASON. Even if they invite usto. No matter how tempting it is, how many nifty things we could build, or how hard our patrons push us to do things that we KNOW could turnaround and bite them, in this age of increased surveillance from government and business and black-hat hackers everywhere. “Political problems,”rather than technical ones, yeah, sure, but you can’t just wish political problems away. I’m avoiding the obvious cheap shot here out of respect forthe dead, but I’m sure all of you can fill it in for me. In lieu of that, I’ll just say that AOL and Netflix both learned really quickly that “sanitizing” datadoesn’t, and “deidentified” data isn’t.We don’t sell out our patrons. We just don’t. That’s our first requirement whenever we talk about using or even KEEPING certain kinds of patrondata, or patron-traceable data. And the only way to keep data safe is often to destroy it or refuse to keep it in the first place. Fact of the computinglife. MOVING ON...
  14. 14. BIGis thiswhich I see before me?DATANO. This isn’t even data.This is a subtle point, but one that governments are particularly struggling with as OPEN data becomes a thing forthem: it’s possible to turn data into something that looks like data but isn’t. Which often defeats the purpose ofcollecting or sharing the data in the first place. Does this happen in libraries? You betcha. And often, it happenswith exactly the kind of data we’ve been discussing today.
  15. 15. ...thereby shall we shadowThe numbers of our host and make discoveryErr in report of us.So, this web page I’ve taken a screenshot of here, a sort of library-activity infographic thinger, is brilliant and I love it. When it made the rounds ofmy online librarian friends, there was a chorus of I WANT MY LIBRARY TO DO THIS.But it’s not data. There’s data underneath it somewhere, but as presented, this is not data. It could be -- and if we could collect this informationfrom libraries all over the place, it could even be BIG data -- but it’s not. The problem is that third vee, variety. If I wanted to compute on thesenumbers, I’d have to grab the HTML and laboriously write code to extract the numbers from it, and as soon as Traverse Area District Librarychanges their content-management system or does a redesign, my code breaks. Multiply this by all the libraries in all the cities and towns in all thestates everywhere, and you see the problem.So, acknowledging that qualitative data is often-though-not-always an exception to this rule, take this rule away with you: *if it’s not computable,it’s not data.* Big or otherwise. Libraries have treated the computability of the data we create and collect as a low-priority consideration for far toolong.
  16. 16. Photo: Luz, “amber” http://www.flickr.com/photos/nieve44/3800137286/ CC-BYup, up, and seeThe great dooms image!Making an infographic or a pie chart or a data HTML table takes pieces of the data -- usually not even everything -- and reduces them tosomething that tells a story, because graphs and charts and tables almost always tell stories much better than the actual data do.So a graph or a table or a chart or an infographic is data trapped in amber. It’s very beautiful, and human beings appreciate that beauty, BUT... youcan’t get those little particles of data back out, much less do anything useful with them if you did! They’re just not computable any more. You’vedoomed your data!Any data you’re putting out there in PDFs, incidentally? It’s not data any more! Stop that! We in libraries should be setting the example here! Andwe should lean on our vendors about this, too. There’s just no point in them providing data that we can’t use for our purposes.
  17. 17. The sacred storehouse of his predecessors,And guardian of their bones.Image: http://library.music.indiana.edu/tech_s/manuals/training/marc/record1.htmlWhich brings me to the skeleton in the closet (speaking of bones): MARC. If I had a nickel for every cataloger who’s asked me what the problem iswith MARC and AACR2 and ISBD, I would never need to work a day in my life again.Here’s the problem in a nutshell, and it’s not news, because Kim alluded to it earlier with respect to harmonizing serials holdings in the CIC. Therecords we put into our library catalogs are marginally computable at best. If you don’t believe me, ask any programmer anywhere who’s workedwith MARC records. And you heard Kim talk about Google Books and library metadata -- look, Google has the smartest engineers anywhere; ifTHEY can’t compute on our data, it’s NOT computable. That uncomputability is costing us untold amounts of money in systems and cleanupprogrammers, not to mention mindshare on the larger information web that libraries are only a part of. We have GOT to do better.Another aspect of the MARC problem gets back to the third vee I talked about, “variety.” Local practice, rule interpretations and other changes overtime that don’t get retroactively fixed in old records, places where AACR2 just throws up its hands and says “as long as it’s human-readable, dowhat you want,” -- all this INCREASES the variety in our catalog records, which DECREASES their computability and reuse value. Whatever happenswith RDA and BIBFRAME and similar efforts, if we end up with yet another sloppy tower of Babel, it’s not solving the problems we have.Cataloging for your users -- COMPUTERS, THEIR PROGRAMMERS, AND THEIR USERS *ARE* YOUR USERS.
  18. 18. Strange things I have in head, that will to hand;Which must be acted ere they may be scann’d.Photo: Mike Linksvayer, “dsc02977.jpg” http://www.flickr.com/photos/mlinksva/2254052444/ CC-BYDigital librarians, among whom I include myself -- come on, we know we’re not off the hook here! I raninstitutional repositories for six years, I got an entire ARTICLE out of one authority-control mishap where oneauthor had eight different name variants in the IR. Our data isn’t clean and consistent. It isn’t computable, and itcan’t be aggregated usefully or consistently. Let’s not pretend!What we can do, though, is watch the Big Data pioneers and the techniques they use to cut through the chaos.Natural-language processing. Fuzzy matching. If you haven’t played with Open Refine, which used to be GoogleRefine, you completely need to grab some random data from your catalog or digital library or wherever and dothat, it’s actually really fun! If only so that you see what the possibilities are.
  19. 19. BIGis thiswhich I see before me?DATANOT YET.Libraries also have data that doesn’t look all that big -- or all that powerful -- when you only have it from a singlelibrary, but if you add together that same data from a whole BUNCH of libraries, suddenly you have somethingsuper-interesting.
  20. 20. AGGREGATIONAy, in the catalogue ye go for men;As hounds and greyhounds, mongrels, spaniels,curs,Shoughs, water-rugs and demi-wolves, are cleptAll by the name of dogs: the valued fileDistinguishes the swift, the slow, the subtle,The housekeeper, the hunter, every oneAccording to the gift which bounteous natureHath in him closed; whereby he does receiveParticular addition.The term of art for this, of course, is “aggregation,” and it happens all over the place already, it’s nothing new.Any data, any data at ALL, can be aggregated... in theory. In practice, a successful aggregation depends a LOT onkeeping a lid on that third Big Data vee, variety. It may also depend on velocity, keeping things current, fixingerrors quickly, and similar speed-dependent concerns.
  21. 21. We shall not spend a large expense of timeBefore we reckon with your several loves...All the cataloguers in the room know this already, of course, because of WorldCat. I’m not a cataloguer anddefinitely no expert, but I do know that OCLC does its level best to enforce certain kinds of consistency incontributed MARC records, above and beyond what MARC and AACR2 and RDA insist on, because if they don’t,the search engine doesn’t work! And, you know, we all know they don’t do a perfect job of it... but to some extentthat’s on us, because of the MARC closet skeletons I mentioned earlier.
  22. 22. OAISTERI see thee compassd with thy kingdoms pearl...Any Michigan folks here? Here’s a blast from the past for you: OAIster, which now belongs to our good hosts atOCLC. See, we’ve tried large-scale aggregation with HIGHLY heterogeneous metadata -- far more variable thanthe MARC coming from skilled cataloguers -- before. With OAIster, it didn’t work out so well. Variety in our databit us yet again, as did some really pretty stupid and evitable structural flaws in the harvesting protocol OAI-PMH,such as total lack of error reporting and no flag for metadata-only records so that searches could exclude them.So what have we learned from the wonderful, bizarre, epic mess that is OAIster? Let’s see.
  23. 23. O proper stuff!This is the very painting of your fear...DPLAWe have another chance to try aggregation, in the guise of the Digital Public Library of America. It’s very early days yet, but I did want to call outone thing that I think DPLA is doing right: cutting the Gordian knot of intellectual-property rights in metadata. Long story short, some metadata istoo factual to qualify for copyright protection in the US; other metadata such as abstracts clearly does qualify.But DPLA isn’t playing that game. They say very clearly, if you want to play with us, you do NOT play intellectual-property games with yourmetadata. You start up with that, we kick you out. They’re gambling, of course, that they become enough of a name to conjure with that they canmake this stick. As I said, it’s early days, but I’m not betting against them -- and I appreciate this approach very, very much.Here’s what I want to know, though. Can DPLA get past the metadata-quality issues that made a mess out of the National Science Digital Library,never mind OAIster? They seem to be leaving training and quality control to their Service Hubs. Maybe that’ll work. But I don’t see any kind offeedback loop being built in here, and it worries me some.
  24. 24. BIGis thiswhich I see before me?DATASIGH. It could be, but...And that question leads me to what if I were the Porter in Macbeth, I’d call something simultaneouslygrandiloquent and obscene, but since I’m not the Porter in Macbeth, I’ll just call it “the graveyard of missed big-data opportunities.”
  25. 25. The multiplying villanies of natureDo swarm upon himThere’s so much we should know about books that we don’t. And we don’t know it IN SPITE OF all the effort we spend cataloging books! The aboveis from a public-librarian friend of mine, Laura Crossett, and what she was trying to do was make sure they had all the books in any series whereany book in the series was circulating well. And the series information stumped her. And this is just a STUPID problem to have, and honestly, Ithink we have it because our ideas and practices around cataloging are so fragmented and so calcified.And digital librarians, we only get to gloat about this because we’re often describing unique materials. Otherwise, we’re just as bad.We need to build Big Data together -- it’s not just the responsibility of the Library of Congress or the New York Public Library or Harvard or OCLC,it’s everybody’s responsibility. And one of the ways we do it is by eliminating redundant labor, well beyond copy cataloging even, so that we canactually do things like record series information and relationship information... so that we can further embiggen and enrich our data! I think thelinked-data infrastructures that several national libraries are building can do that... if we let them!
  26. 26. ...there cannot beThat vulture in you, to devour so manyI can’t really add anything to the combination of Les Carr and Shakespeare. I’m just going to admire this for asecond... and no, I don’t know what the fourth and fifth vees are either.But seriously, Les is right. We saw it with serials and their metadata, we’re seeing it now with e-textbooks, we’reeven starting to see it with a few kinds of research data, and I don’t know who’s gonna stop it and build a real BigData commons if it’s NOT academic libraries. So if you need a reason to get involved with open access and opendata, this is it: it beats the heck out of the velociraptor alternatives.And because Deb Blecic mentioned it earlier: Non-disclosure agreements are a velociraptor indicator. I don’t likethem, and I don’t think any of us should. Just sayin’.
  27. 27. As two spent swimmers, that do cling togetherAnd choke their art.And here’s where, as I generally do, I bite the hand that’s feeding me. OCLC, you are clearly of two minds on thisBig Data thing, and I think you’re hurting yourself by it. On the one hand, there’s OCLC Research, which is makingamazing Big Data things like the Virtual International Authority File and working hard -- and pretty successfully --to embed them in the larger information world.And on the other hand, there’s the dog-in-manger intellectual-property shenanigans OCLC proper keeps trying topull with the records contributed to WorldCat, which made the national library of Sweden pull out of WorldCataltogether, and is infuriating those of us who are paying attention to where the Big Data world is going.Please, OCLC, get your act together. If you’re going to insist on being a velociraptor, please spin off OCLCResearch so that you don’t drown it when we drown you -- and we will. It will take time, just as the open-accessmovement took time, but we can destroy you and we will. Or take OCLC Research as your model and stop being adang velociraptor.
  28. 28. when worldcat.org docome to DPLAneAnd it’s not just libraries who want to treat WorldCat as a big juicy Big Data-store, either. This is a FriendFeed comment from a librarian quotingwhat OCLC actually lets affiliates do and why. You can’t read the first comment here, so I will: “I can, barely, stretch this definition to include thework that I’m doing on my research project... but the grad student who wants to use WorldCat data for a bibliographic study of the spread ofpublishing in New Spain is pretty much out of luck.”Stop it, OCLC. Just stop it. You are shutting yourself out of Big Data-land, and when you do that, you shut us libraries out too. Hathi Trust is willingto fight in federal court to allow researchers to do research on its corpus, and OCLC comes at researchers with legalese? Stop it. DPLA insists thatall contributed metadata be available for any meditated reuse, within reasonable limits of bandwidth, and OCLC gives researchers static? Stop it.Bring worldcat-org to DPLAne, instead.
  29. 29. when worldcat.org docome to DPLAneThink upon what hathchanced, and, at more time,The interim having weighd it,let us speakOur free hearts each to other.All right, I’m done lecturing OCLC, and now I’m going to lecture everyone else in this room, because I clearly haven’t made enough enemies, right?To some extent, OCLC is doing what it’s doing because it knows that a lot of academic libraries love to free-ride. We heard about this from Kimtoday briefly with regard to collective print management -- “can the CIC let California do it?” -- and, you know, I come out of open access andopen source, so I’ve seen it firsthand. Contribute programmer time to an open-source project? Nope. Pay for a membership in an open-sourcefoundation, or participate in a collective digital-preservation system? Fuhgeddaboudit; who has money for that? Put actual acquisitions moneytoward open access? Bah, we have an institutional repository... under somebody’s desk... in the third sub-basement... somewhere; that’s enough,right?But at some point, free-riding prevents useful collective action, and I think Big Data is one of those points. Big Data isn’t free. Open data, big orsmall, isn’t free. It’s really tempting to pretend it is and free-ride anyway, I get that. But free riding is slimy and lazy and unethical and we need tostop doing it. No one library can talk OCLC down off the ledge; heaven knows Sweden tried. But maybe we can talk OCLC down together, as acommunity. Shouldn’t we try?
  30. 30. BIGif this iswhich I see before meDATANOW WHAT?
  31. 31. SKILLSSo I was asked to talk about what skills and scaffolding we need to make and use big data in our libraries. And I’msorry, Eric, but I tried and tried to make a slide answering that question and I just couldn’t.
  32. 32. SKILLSEvery one that does so is atraitor, and must be hanged.And the reason I couldn’t is that I know what way too many academic libraries DO with lists of skills -- they think they can just hire some poorMacduff with a random grab-bag list of skills and call him a “Big Data Coordinator” or some such thing, and then they’ve solved the big dataproblem and they can go home and have a drink.I don’t work in libraries any more in part because my own career was badly hurt by that kind of “skills thinking” with respect to scholarlycommunication and open access. I don’t think thinking about library services in terms of laundry lists of skills works! And I KNOW it hurts people,because I’ve had former students come back to me for advice over it, and I’ve seen it hurt much better librarians than I ever had a hope of being.So now that my job is preparing people for librarianship, I explicitly warn my students about skills thinking and how it manifests in jobdescriptions, and I tell them not to apply for those laundry-list, unsupported-single-person-in-a-disregarded-corner jobs. The dice are just tooloaded against them.So if you think you’re going to hire a Research Data Coordinator, or a Digital Humanities Librarian, or one bioinformaticist, or one statistician,somebody with serious skills, and you’re going to wind that person up and turn them loose and miracles will happen? (CLICK) Well, I’m with LadyMacduff on this one -- hang the traitors!
  33. 33. SCAFFOLDINGWhat bets should we make, now and future?What do we build? Fix?Who cares, right now? Who else should?What can we use, right now?How can we experiment?I hate the word “infrastructure,” because it’s impersonal and overused, so I’m going to suggest “scaffolding”instead, by way of a more holistic, less skills-focused mode of thinking about the opportunities Big Data mighthave for us, and what we’ll have to be and do to capitalize on those opportunities.I like “scaffolding” because -- look, you can hire a Michaelangelo, but if you don’t put that scaffolding under him,he’s not painting you no Sistine Chapel. So here are some questions I think are worth asking.- note that “who cares” means you look at your existing staff as well as your environment, because ignoring yourcurrent people is stupid and counterproductive.- bets: there’s no such thing as a sure thing. you have to bet. betting means risk, risk means failure. fail fast andoften.
  34. 34. BIGis thiswhich I see before me?DATAThis presentation is availableunder a Creative CommonsAttribution 3.0 United Stateslicense.Dorothea SaloUniversity of Wisconsin–MadisonAnd that’s where I’m at on this just now! Hope it helped, and I’m findable on Twitter and the web if you havequestions.

×