The New Challenge of Data Inflation
Upcoming SlideShare
Loading in...5
×
 

The New Challenge of Data Inflation

on

  • 640 views

M

M

Statistics

Views

Total Views
640
Views on SlideShare
636
Embed Views
4

Actions

Likes
1
Downloads
7
Comments
0

2 Embeds 4

http://www.linkedin.com 2
https://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

The New Challenge of Data Inflation The New Challenge of Data Inflation Document Transcript

  • IQT QUARTERLYThE NEW chALLENgE OF DATA INFLATIONBy Jon GosierThe online chatter of individuals, networks of friends or professionals, and the data beingcreated by networked devices are growing exponentially, although our time to consumeit all remains depressingly finite. There is a plethora of solutions that approach thechallenge of data inflation in different ways. But what methodologies have worked atscale and how are they being used in the field?The Infinite Age entirely? Is the content the original blog post, or is it that post coupled with the likes, retweets, comments,First, let’s look at the landscape. The amount of and so on?content being created daily is incredible, andincludes the growth of image data, videos, networked Content is now born and almost instantaneouslydevices, and the contextual metadata created around multiplies, not just through copying, but also throughthese components. the interactions individuals have with it — making it their own and subsequently augmenting it.The Information Age is behind us; we’re now living inan Infinite Age, where the flow of data is ever-growing In the Infinite Age, one person may have created theand ever-changing. The life of this content may be content, but because others consumed it, it’s as muchephemeral (social media) or mostly constant (static a representation of the reader as it is the author.documents and files). In either case, the challenge isdealing with it at human scale. The Inflation EpochIn the time it takes to finish reading a given blog post, In physics, the moment just after the Big Bang, butchances are high that it’s been retweeted, cited, prior to the early formation of the universe as weaggregated, pushed, commented on, or ‘liked’ all over understand it, is called the Planck Epoch or the GUTthe web. An all-encompassing verb to describe this is Era (for the Grand Unified Theory that explains how allsharing. All of this sharing confuses our ability the four forces of nature were once unified). Theto define the essence of that original content because Planck Epoch is defined as the moment of acceleratedvalue has been added at different points by people growth when the universe expanded from a singularitywho’ve consumed it. smaller than a proton.Does that additional activity become a part of the We mirror this moment a trillion times each day whenoriginal content, or does it exist as something else online content is created and augmented by the 06 Vol. 3 No. 3 Identify. Adapt. Deliver. ™
  • IQT QUARTERLYsurrounding activity from others online. From the have massive data centers that enable these indexesmoment we publish, the dogpile of interaction that and queries to take seconds.follows is a bit like the entropy that followed the Big As big data becomes a growing problem inside ofBang: exponential and rapid before gradually trailing off organizations as much as outside, index and searchto a semi-static state where people may still interact, technology has become one way to deal with data. Abut usually with less frequency, before leaving to new swath of technologies allow organizations toconsume new content being published elsewhere. accomplish this without the same level ofWhen people talk about big data, this is one of the infrastructure because, understandably, not everyproblems they are discussing. The amount of content company can afford to spend like Google does on itsthat is created daily can be measured in the billions data challenges. Companies like Vertica, Cloudera,(articles, blog posts, tweets, text messages, photos, 10Gen, and others provide database technology thatetc.), but that’s only where the problem starts. There’s can be deployed internally or across cloud serversalso the activity that surrounds each of those items (think Amazon Web Services), which makes dealing— comment threads, photo memes, likes on Facebook, with inflated content easier by structuring at theand retweets on Twitter. Outside of this social database level so that retrieving information takescommunication, there are other problems that must fewer computing resources.be solved, but for the sake of this article we’ll limit our This approach allows organizations to capturefocus to big data as it relates to making social media enormous quantities of data in a database so that itcommunication actionable. can be retrieved and made actionable later.How are people dealing with social data at this scale tomake it actionable without becoming overwhelmed? Methodology 2 – contextualization andMethodology 1 – Index and Search Feature Extraction Through the development of search technologies,Search technology as a method for sorting throughincredibly large data sets is the one that most people the phrase “feature extraction” became commonunderstand because they use it regularly in the form of terminology in information retrieval circles. FeatureGoogle. Keeping track of all the URLs and links that extraction uses algorithms to pull out the individualmake up the web is a seemingly Sisyphean task, but nuances of content. This is done at what I call anGoogle goes a step beyond that by crawling and atomic level, meaning any characteristic of data thatindexing large portions of public content online. This can be quantified. For instance, in an email, the TO:helps Google perform searches much faster than and FROM: addresses would be features of that email.having to crawl the entire web every time a search is The timestamp indicating when that email was sent isexecuted. Instead, connections are formed at the also a feature. The subject would be another. Withindatabase level, allowing queries to occur faster while each of those there are more features as well, but thisusing less server resources. Companies like Google is the general high-level concept. Waveform 1 = Value for all Categories over time Waveform 2 = Value for Sub-Category over time Waveform 3 = Value for disparate Category over time Stacked Waveform graph used to plot data with values in a positive and negative domain (like sentiment analysis). Produced by the author using metaLayer. IQT QUARTERLY WINTER 2012 Vol. 3 No. 3 07
  • IQT QUARTERLYFor the most part, search uses such features to helpimprove the efficiency of the index. In contextualizationtechnologies, the practice is to use these features tomodify the initial content, adding it as metadata andthereby creating a type of inflated content (as we’veaugmented the original).When users drag photos onto a map on Flickr, they arecontextualizing them by tagging the files with locationdata. This location data makes the inflated contentactionable; we can view it on a map, or we can segmentphotos by where they were taken. This new locationdata is an augmentation of the original metadata,and creates something that previously did not exist.When we’re dealing in the hundreds of thousands,millions, and billions of content items, feature extractionis used to carry out the previous example at scale.The following is a real-world use case, though I’ve Infographic showing network analysis of the SwiftRiverbeen careful not to divulge any of the client’s platform. Produced by the author using Circos.proprietary details.Recently at my company metaLayer, a colleaguecame to us with one terabyte of messages pulled Methodology 3 – Visualizationfrom Twitter. These messages had been posted Visualization is another way to deal with excessive dataduring Hurricane Irene. The client needed to generate problems. Visualization may sound like an abstractconclusions about these messages that were not term, but the visual domain is actually one of the mostgoing to be possible in their original loosely structured basic things humans can use for relating complexform. He asked us to help him structure this data, concepts to one another. We’ve used symbols forfind the items he was looking for (messages about thousands of years to do this. Data visualizations andHurricane Irene or people affected by it), and extract infographics simply use symbols to cross barriers ofthe features that would be useful for him. understanding about the complexities of research soThe client lacked the sufficient context to identify what that, regardless of expertise, everyone involved has awas most relevant in the data set. So we used our better understanding of a problem.platform to algorithmically extract features like Content producers like the New York Times have foundsentiment, location, and taxonomic values from each that visualizations are a great way to increaseTwitter message using natural language processing. audience engagement with news content. TheBecause it was an algorithmic process, this only took explosion of interest in infographics and dataa few hours, allowing the client to get a baseline that visualizations online echoes this.made the rest of his research possible. Now the datacould be visualized or segmented in ways that weren’t To visualize excessive data sets, leveraging some of thepossible with the initial content. The inflated content previous methods makes discovering hidden patterns— metadata generated algorithmically — included and trends a visual, and likely more intuitive, process.elements that made his research possible. We nowhad an individual profile of every message that gave Methodology 4 – crowd Filteringus a clue about its tone, location of origin, and how to and curationcategorize the message. This allowed the clients team The rise of crowdsourcing methodologies presents ato look at the data with a new level of confidence. new framework for dealing with certain types ofIn the context of our social data challenge, these information overload. Websites like Digg and Redditextracted features might be used on their own by an are examples of using crowdsourcing to vet andapplication, or they might become part of an index or prioritize data by the cumulative interests of a givendatabase like the one mentioned previously. community. On these websites, anyone can contribute information, but only the material deemed to be 08 Vol. 3 No. 3 Identify. Adapt. Deliver. ™
  • IQT QUARTERLY Big data can refer to managing an excess of data, to an overwhelming feed of data, or to the rapid proliferation of inflated content due to the meta- values added through sharing.interesting by the highest number of people will rise Curation has risen as a term over the past decade toto the top. By leveraging the community of users and refer to the collection, intermixing, and re-syndicationconsumers itself as a resource, the admin passes the of inflated content. It is used to refer to theresponsibility of finding the most relevant content to construction of narratives without actually producingthe users. any original content, instead taking relevant bits of content created by others and using them as theThis, of course, won’t work in quite the same way for building blocks for something new. This presentationan organization’s internal project or your data mining of public data as edited by others represents a newproject, but it does work if you want to limit the work, though this “new work” may be made up ofinformation collected to those in the crowd who have nothing original at all. By carefully selecting curatorssome measure of authority or authenticity. whom you know will be selective in what they curate,The news curation service Storyful.com is a great the aggregate of information produced should be of aexample of using crowd-sourced information to report higher quality.on breaking events around the globe. Its system works conclusionnot because the masses are telling the stories (thatwould lead to unmanageable chaos), but because the Big data, like most tech catchphrases, means differentstaff behind Storyful has pre-vetted contributors. This things to different people. It can refer to managing anis known as bounded crowdsourcing, which simply excess of data, to an overwhelming feed of data, or tomeans to extend some measure of authority or the rapid proliferation of inflated content due to thepreference to a subset of a larger group. In this case, meta-values added through sharing. Pullingthe larger group is anyone using social media around actionable information from streams of socialthe world, whereas the bounded group is only those communication represents a unique challenge in thataround the world that Storyful’s staff has deemed to it embodies all aspects of the phrase and thebe consistent in their reliability, authority, and accompanying challenges. Ultimately, if content isauthenticity. This is commonly referred to as curating growing exponentially, the methods to manage it havethe curators. to be capable of equal speed and scale.Jon Gosier is the co-founder of metaLayer Inc., which offers products for visualization, analytics, and the structure ofindiscriminate data sets. metaLayer gives companies access to complex visualization and algorithmic tools, makingintelligence solutions intuitive and more affordable. Jon is a former staff member at Ushahidi, an organization thatprovides open source data products for global disaster response and has worked on signal to noise problems withinternational journalists and defense organizations as Director of its SwiftRiver team. IQT QUARTERLY WINTER 2012 Vol. 3 No. 3 09