Here’s a slide of a slide…Dan Ariely, a behavioral economist at Duke University has been posting this analogy all over social media and at presentations. He alsohas a book called Predictably Irrational, which I have not read yet, but it talks about his work in behavioral predictions. He also has a number of Ted talks that are very interesting. So what is Big Data, really anyway?
We may be wondering just “how big is big data”? If you played 20 questions as a kid you might have asked “is it bigger than a breadbox?” While some are reporting datasets in such unfathomable scale as Petabytes, exabytes, and zettabytes, really, any data that is too big for traditional technology. In other words, it is too big for our breadbox….A good working definition for most of us is a data that file is too big for Excel too load. In Excel 2013, the maximum size it can handle is 1,048,576 rows by 16,384 columns. But really, there are three features that make Big Data “big”
A common definition of Big Data relies on what are known as the 3 v’s. These are Variety Velocity and Volume, a term first coined by Doug Laney at a firm called Gartner. Variety means that we are not just collecting more of the same data we’ve always collected. Instead we are collecting different types of data. Variety also means that we do not have the type of structured datasets we used to have in relational databases – you know, the ones with nice tables with neat and tidy rows and columns. Now data often is in forms that don’t fit in columns like video and audio, sensor data, documents, flash and so forth. You may have heard of “NoSQL” databases, which are an alternative to the traditional relational database models that accommodate this type of dataset. Velocity has to do with the tremendous speed at which we are collecting this data and the rate at which data is being generated. You may have heard stats such as facebook generates 500 Terabytes of data per day. Many businesses use clickstream analysis on their website which generates a great deal of data in a hurry. The IDC Digital Universe study indicates that by the year 2020 society will be generating 50 times the amount of information currently being generated. In 2011 this number was 1.8 Zettabytes A zettabyte is a 1 with 21 zeroes after it so the rate of growth in 20 years will truly be staggering.And this gets us to our final v, volume. As is likely obvious by looking at the first two V’s, the sheer amount of data that can be collected now is really kind of unfathomable. Google for example, receives over 2 million search queries IN A SINGLE MINUTE. 72 hours of new video are uploaded to Youtube in a minute. 47,000 apps are downloaded from iTunes every minute.
Let’s take a moment before we go any further and discuss the differences between big data and open data. You can see by this Venn diagram that there are big data sets that are not open. These are proprietary datasets in business and other locations where security is an issue, but there are also datasets from scientific and government sources of big data that ARE open. Open Government data conversely is not all “big” but there is a great deal of public access to it on federal, state, and local levels. Furthermore there are open data sources that are not government sources, such as business and scientific data that are not necessarily “big” but are pubicly available. So this should give you an idea of how Big Data and Open Data are related.
Is big data a game changer? First and foremost, big data turns the scientific method on its head. Traditionally, any inquiry or decision starts with a hypothesis. We make an educated guess, and then look for the data to support or contradict this hypothesis. In Big Data analytics, we start with the data, and we look for patterns. This data is unstructured, it can be multidisiciplinary, and it can be highly predictive. Also, traditionally an inquiry or decision seeks to find the answer as to WHY the hypothesis is confirmed or rejected. In big data analytics, we identify the patterns without necessarily receiving information as to why those patterns do exist.
In his Data Science Central Blog, Vincent Granville has identified 9 types of data science specializations. Statistics – this area deals with testing and modeling, theoretical approaches and developing new techniques for approaching large datasetsMathematics – slightly different in that these people deal with operations research: optimization, quality control, etc.Data Engineering – those strong in data engineering deal mainly with the structure and architecture of databases/filesystems/storageSoftware engineering – know several programming languages and work on code development. Machine Learning– these experts are the ones that program the algorhithms and complex computations Business– these are subject experts in terms of determining appropriate metrics, ROI, what to include on a dashboardVisualization-- charts and graphs, making data analysis understandable to the user or decision makerGIS – focuses more exclusively on the spatial representation of data
What big data allows us to do“human insight at machine scale”identify patterns – but also outliers and unique instancesBehavioral predictionsSentiment analysisActivity “hotspots” – geographic such as the Arab Spring, Google’s flu predictionFor the social sciences, we can get empirical evidence – surveys subjective, observational studies are not “natural habitat,” etc. Here are some examples of the amazing things that are being done with big data currently:
Market-basket research: Diapers and Beer! Broccoli cam – sensors determine when the produce department is out of broccoli and sends worker out to refillNate Silver < - Moneyball – turned the scouting profession on its head. Netflix <- highly specific classifications of movie genres to create recommendationsLinguamatics: text mining predicted prime minister election using tweetsNYC fire inspectorsCataloged 60 pieces of metadata about all inspectable buildings, used to prioritize inspections
Harper Reed, Obama campaign techie, in an October 2013 article in the Chronicle of Higher Ed Wired Campus blog, says Big Data is “bs”. It is used to generate fear in enterprises to spur equipment upgrades, in other words, spend money on technology. He says: “you can get a lot of this stuff done just in Excel” So, just having the capacity for scalability in an enterprise does not mean that you are “doing big data.”
Big data requires more treatment and handling. This includesData cleansing: dirty data, missing data, more outliers, removing duplicatesParsing and treating: extracting data from its original source into something resembling a datasetTransformation into usable format is key
Another issue is false patterns, false correlations. For example Gene Pease, in his Talent Management Blog notes that The height of an elementary school student is correlated to his or her reading level. In Jeffry Stanton’s text Introduction to Data Science he says “bigger means weirder.” So we need to be careful with regard to the assumptions and conclusions we derive from the data. Again, big data is not concerned with the “why” of a pattern, it only identifies that the pattern exists.As one author noted “when looking at the whole haystack, EVERYTHING looks like a needle”
Big data is first and foremost a decision making tool. This means that for all the technology and fancy processing, storage and tools available, without competent subject matter experts to identify data flow in an organization or enterprise, identify the areas where data is lacking, and how the data can be used, it’s all for naught. The human element is what turns data information. So where do we, as information professionals fit into the equation?
There are a number of directions we, as librarians and information professionals can pursue as we move into more data-driven activities in our organizations, mainly as an outgrowth of existing skill sets we posess. For example: Metadata extraction, creation, classificationPrivacy experts/intellectual freedomQuality experts – identify reliable and authoritative data sources and analysisPolicy advisors for our organizationCuration/selectionStorage/managementAccess/gatekeepersAssuring data can be turned into informationKnowledge managementCompetitive Intelligence“be the link pulling biz and IT together”Michelle Hudson of Yale: Some day We’re all going to be data librarians”
In it’s article “Big Data’s Impact in the World”, The New York Times cited A report by the McKinsey Global Institute, the research arm of a well known consulting firm, projected that the United States needs 140,000 to 190,000 more workers with “deep analytical” expertise and 1.5 million more data-literate managers, whether retrained or hired. All disciplines are becoming increasingly data intensive whether political science, sociology, transportation, or the traditional sciences and medicine. As information professionals we have the opportunity to flex our Information Literacy muscles and extend them to Data Literacy. Those of us in higher education can add data literacy to our instructional and consultation activities, and librarians in other capacities can bring their own patrons and stakeholders up to speed on key data concepts – how to collect, store, gather, evaluate and interpret data. As my colleague Kim Silk of the University of Toronto has said to me: much as we teach people information and media literacy;data literacy – understanding what the data is telling us, understanding (significant or misleading) statistics, outliers, sample size, correlations – is critical for 21st century citizens.”
Our vendor partners are already getting in on the action. For example, Thomson Reuters’ Eikon desktop analysis software for financial offices has twitter and news sentiment analysis tools. These are primarily aimed at the financial sector, but what they do is allow for assessment of news events and predict the effect on changes in the financial markets. Many other partners are using big data internally to identify usability of their interfaces, frequency of use of resources, common search terms. As our vendor partners become more data driven, we will need to be data literate ourselves in order to understand the resources made available to us by our vendor partners, as well as how and why these resources work.
So here in my opinion, are the major takeaways from the first part of this webinar: We know that velocity, variety, and volume are the hallmarks of big data. Big data isn’t just more of the same data, and it isn’t necessarily tons and tons of data (although often it is). A good rule of thumb is any dataset that is too big to fit in Excel is “big data” for our purposes. Big data holds the promise of amazing capabilities, through identifying both patterns and outliers in the data we have collected. We can identify behavioral patterns in an empirical way, such as through marketbasket analysis, or collect and use new types of metadata to improve safety practices. But this cannot be done without the human element. Technology upgrades are only part of the equation and may not even be necessary – it takes subject matter experts to ask the right questions, interpret, clean and collect the data. Finally, as information professionals we have the ability to be involved with data and data issues in a variety of capacities, but our main strength may be in Data Literacy initiatives for our patrons and stakeholders..We did not have time for: Stats lessons, Privacy issues, Computer processes, Data structure, Etcetcetc
I’d like to move at this point on to recommending some resources for learning more about the topic. I am sure you realize that this presentation has only touched the tip of the iceberg on the topic of Big Data. There are many paths to pursue to learn more, many specializations to focus on.
Big Data A Revolution, is a best seller I am in the middle of it right now and it gives a laymans understanding of the concepts and impact of big dataThere are lots of “for Dummies” books on various aspects of Big Data – many free in PDF form from various web sources. Big Data for Dummies, etc
An Introduction to Data Science- open source (free!) textbook with lots of good information, an easy read, short chapters (available on iTunes)Frontiers in Massive Data Analysis- a report by the national academies press, discusses big data in mainly social science disciplines, free on web
I will put the URLs in the slideshare version of this presentation.SU guide: data sources, programming guide, news, associations,linkedIN groups many free sourcesALA list of resources, academic focus, but there are many good articles and a good collection of informationData Information literacy wiki at Purdue is documenting the development of a standardized curriculum for data literacy and data science, and they are doing research as to the level of data literacy and critical instructionThere are a number of schools that are offering Massive Open Online Classes, Syracuse University offers one periodically, University of Washington has one, can be done online, Caltech, MIT Have the more technical/computing focused programs
Another issue Librarians might be called upon for their expertise is information policy and best practices as they relate to data issues – use, storage, sharing, privacy, and so forth. Many of these practices are still in the process of being developed. For example the Council for Big Data Ethics and Society: hasn’t launched yet, is supposed to soon. It is a collaboration with National Science Foundation. Their website says they intend to “address such issues as security, privacy, equality, and access” to “develop frameworks to help researchers, practitioners and the public understand the social, ethical, legal, and policy issues that underpin the big data phenomenon. They have a newsletter sign-up but I have yet to receive anything from it. Research Data Management Services: primarily for academic libraries, this report deals with storage, access, repositories and data management in an academic environment but there may be lessons for other types of libraries as well.
Here are a couple of other resources on big data policy and best practicesRebuilding the Mosaic: National Science Foundation Social Behavior and Economic Council’s report on data driven research in the social sciences related to world development. They identify focusing on population change, disparities, communications, media, and social networking in the future. GovLab: a blog on governance policies of science and technology – search “data” in the search box for some good articles related to big data governance and policyTerminology: this is another area that is an issue with current Big Data projects- computer scientists, social scientists, statisticians all have different language for the same things: case vs instance vs observation as an example == all equal the “rows” in a dataset. There is an argument that this ISO standard for statistical terminology should be amended to create a standardized language for data analytics
These are some newsletters that can be delivered to your email inbox that I find useful. There are tons of these though, there may be others you will find on the web that are also useful. Data Science Weekly – free newsletter, variety of topics and includes jobsData Science Central – nice blog, newsletter with broad focus, professional development for the data scientist (or aspiring data scientist)R-Bloggers – tips and tricks for using the statistical software RForgot to mention the O’Reilly mailing lists. O’reilly as you may know is a publisher of IT manuals and provides blogs, other resources related to technology.
Here are my favorite blogs on the topic, in no particular order. Hilary Mason – she’s a data scientist and she posts interesting articles about some data analysis, lots of visualizations, but also professional development topics for data professionals. She was an innovator who had an extensive role in in creating bit.ly – among other things, they are well known for a tool that will convert a long URL into something shorter and more manageable. She speaks a lot and hosts a data related conference in NYC called DataGotham. Mathbabe – cathyo’neil she is a mathematician but not an academic, she has some nice introductory posts for those interested in data science, less visualization than Hilary, she focuses more opinion and techniqueBits Blog – technology and business news from the New York TimesNo Free Hunch – problem solving bent – “the sport of data science” from Kaggle, a consulting company. They identify fun problems and solve them using data science techniques and they announce many competitions and challenges where data scientests can strut their stuff. What’s the Big Data- Gil Press, who has a column at Forbes, focuses on impact of big data in society, business, government, IT right now he’s done a lot about the market for big data and its influence on business
Next I would like to show you some interesting tools that you can play with if you want to explore big data and its capabilities for yourself. There are a lot of open source resources that are available and user friendly.
The first thing we will cover is finding datasets. There are a surprising number of sources for datasets out there that are free and online. Some are easier to use than others. I am pointing out three well known or interesting resources, but there are many others I could have included. These three that I have chosen will give you an idea of some of the variety of data that is out there.
Google Data Explorer provides many datasets, and Google Trends, which we will talk about later provides visual display of data. Most of the public datasets available on Google Data Explorer are governmental in nature, as you can see by the list of data providers on the left.
Amazon Web Services – a wide variety of datasets on many interesting topics, many of these are also government sources, but not all
Scale Unlimited is a big data consulting firm that makes some big datasets freely available for testing and modeling purposes. They have a wide variety of datatypes including media, graphic, geographic. One of the datasets contains all of the Enron emails.
These are some tools for creating databases and analyzing or querying your dataset. I must confess I am just learning about how these work now, so I only have brief explanations of them. R is an open source, command language tool for statistical analysis. I liked the old DIALOG, so I love R. It has many extensible packages that can create a lot of flexibility and precision in data analysis. Hive/Hadoop = both of these tools are run by Apache which is a Google spinoff. Both are open source. Hadoop allows for what is known as parallel processing – distributed computing. Hive is the language and infrastructure that allows you to query the data in Hadoop and do analysis. It is very similar to SQLPostgreSQL – provides an object relational database management system, which is used by Etsy and Creative commons, two organizations I think are very popular with librarians! Again, it uses a query language similar to SQLProject Bamboo Dirt: open source “digital research tools for scholarly use” a variety of tools for data management, analysis, visualization as well as other topics. MLcomp: compares and evaluates computer algorithms. Evaluate your algorithm on their existing dataset or Evaluate your dataset to see what is the best algorithm to use for it.
Once you have queried and analyzed your data, you will want to display it in a manner that your patrons or stakeholders will understand and be able to use for making decisions. This is known as data visualization. Here are some cool tools that are free on the web.
PiktoChart – very user friendly data visualization design and editing, as you can see mainly “infographics”
Esri is a geospatial tool which means it is good at visualizing data that displayed using maps. For example here is a map related to commuting times across the US.
Big ML – fee for service, but for datasets under 16 MB you can play with their visualization tools
ManyEyes: from IBM – upload your dataset and create a wide variety of visualizations: maps, histograms, graphs, text based analysis
GoogleFusion Tables – way of providing visualization for big or multiple datasets in table format – charts ,maps, network graph, etc.
Chartsbin- with this tool you can create interactive (clickable) visualizations, that can be embedded in web pages or exported. They also share their own visualizations from various authoritative sources( government, scholarly journals, technical reports)
iCharts – another nice one that allows for interactive widgets that can be embedded, published on the web, etc.
Maybe you don’t want to get into analysis -- you just want to see what others are doing, here are some cool sites that give you a glimpse as to what various organizations are doing with big data and the results that they are making available to the public:
CSSeer- crossover data from CiteSeer which is a free bibliometric (citation) analysis tool and wikipedia to recommend scholarly experts in a field.
Streetbump- crowd-sourced pothole locator
My Magic Plus- coming from Disney – you get a wristband that tracks your every move around the park, what you spend, where you go, how long you wait, what you buy, everything
Information is Beautiful: independent “data journalist” David McCadless creates just gorgeous visual displays, and then the data is available in Google Docs for anyone to use
Facebook Blog: fascinating articles and visualizations of what is happening with Facebook data
Google Trends: what are people searching, visualizations, “zeitgeist”<- what did the world search for in 2013
Flowing Data: fun visualizations on a variety of topics
GapMinder: Educational bent, describes itself as a “museum” on the internet – focus is on world development: factfinding and needs assessment
Professional Development opportunities abound for info pros who wish to get their feet wet in big data and data science. In fact, I am working with a group of SLA members to create a Data Caucus. We are currently working on amending our scope to be compatible with other SLA units, and hope to send out a revised petition soon, so be on the lookout for those emails!IASSIST – the International Association for Social Science Information Services and Technology is an organization for data users in the social sciences – a small group but international, emphasis is on research and teaching – library/information professionals and others ASIS&T – Association for Information Science and Technology – interdisciplinary, focused on technologyLinkedIN- check the SU library guide for some LinkedINgroups that deal with data issues.
Thank you for your time and attention today. In a few days I will have these slides up on slideshare and they will include hotlinks to the resources I’ve been describing. Don’t forget about the Data Caucus and I hope you now have some starting points for learning more about Big Data. The term “big data” may be a buzzword – the practices and principles involved with big data issues are still evolving, but our capacity for ever increasing volume, velocity, and variety of data is not going to disappear any time soon. Do we have time for a few questions.
Data 101- Big Data: What is it and Why Do We Care?
What is it and Why Do We Care?
Elaine M. Lasda Bergman
University at Albany
March 6, 2014
for the Special Libraries Association
What we’re going to cover today
• What is Big Data
• What is great about Big Data
• What is not so great
• The role of Librarians and Info Pros in the Big
• Tools and Resources
How Big is Big?
Add Data Literacy!
What We Just Talked About
• The Three V’s
• Amazing Capabilities
• The Human Element
• Our Roles as Information Professionals
Now the Fun Stuff!
• Big Data: A Revolution that Will Transform
How We Live, Work, and Think, by Viktor
• “For Dummies” Books
• An Introduction to Data Science, by Jeffrey
• Frontiers in Massive Data Analysis
General Resource Lists/Training
• Syracuse University Library Guide on Data Science
• ALA ACRL “Keeping Up With Big Data” page
• Data Information Literacy at Purdue wiki
• Council For Big Data, Ethics and Society
• Research Data Management Principles, Practices, and
Prospects – CLIR
• Data Science Weekly http://www.datascienceweekly.org/
• Data Science Central http://www.datasciencecentral.com/
• R-Bloggers http://www.r-bloggers.com/
• Hilary Mason http://www.hilarymason.com/
• Mathbabe http://mathbabe.org/
• Bits Blog in NY Times http://bits.blogs.nytimes.com/
• No Free Hunch http://blog.kaggle.com/
• What’s the Big Data http://whatsthebigdata.com/
One Final Note:
SLA Data Caucus initiative!
LinkedIN Groups see:
Elaine Lasda Bergman
@ElaineLibrarian on Twitter