Dowsing rods, or prescribed analytics, often fail to deliver valuable insights from data because they rely on predefined questions rather than allowing discoveries to emerge from exploration. The author argues that data analysis works best as a general process that moves from freely creating and curating unstructured data online to analyzing and interpreting it in a way that surfaces unexpected findings, represented as X in the diagram.
Devin Gaffney, introduction\nJanet Brokerage cloud scholarship recipient \nWant to share with you my thoughts behind what I’m working on with these resources\n
This is my friend Hugh. Put out this tweet, got this klout notification, posted to facebook\nIts wrong. Why?\n
This comes from a very clear source - in the 2000’s, the technological, economic, and infrastructural conditions were finally right for unfettered user-based data creation on a massive scale. It didn’t matter what they created, so long as they created it on some website... somewhere. This is sort of the core definition of the Web 2.0 model - drive people to a website, keep them there creating whatever they want, and display that stuff on many different pages with many different advertisements once the company is to scale\n
The current model is a direct derivative of the previous model - people created all this data, online advertising is very bad at what it does - can you use what people do online to make better advertising? It’s sad, but that’s where the drive comes from. Ultimately, what we could start to define as the next iteration of the web is being defined by the process of creating websites that extract meaning for both casual and high-end user bases, in the effort to understand why, how, what, when, and where people produce content. \n
But its totally being done in the backwards direction - what we have now, with all these different firms, is an environment where we create “secret sauce” algorithms that give us very top-down analytics to understand the fundamentals of people and their use of the internet. This isn’t methodologically safe.\n\nThe problem is that we put the specific repertoire of analytics available before the question at hand. If you look at academic papers that are quantitative analyses of aspects of the Internet (what we could call a host of names, popularly web science and webometrics), the vast majority of the papers are totally homebrewed approaches based on methodological practices found in literature ranging from graph theory to crowd psychology, and shoe-horning it into the data available as best as they can. A combination of this shoe-horning, the fitness of the data for the question at hand, and luck are the big variables in writing a good article that pushes our understanding of the internet further.\n\nWhat I’ve learned over the past three years of studying Twitter’s data is that the largest area for improvement is making both this shoe-horning process and data availability easier, and if possible, to eliminate it altogether.\n
That’s been the goal with 140kit, a Twitter analytics platform I work on. Twitter is clearly a potentially useful data source for understanding people and their relationship with the internet, but two major constraints exist with Twitter research. Due to Twitter’s terms of service, we can’t provide simple raw dumps of twitter data to the research community. Due to Twitter API limitations, we can only collect data live - no retrospective datasets can be collected in a representative way - these constraints, however, have brought on a sort of liberating approach: Since we can’t easily access old data, we need many researchers to serve as the ears and eyes for all researchers when something interesting happens on Twitter. Since we can’t provide raw data, we needed to come up with a flexible way to answer as many questions as possible for the researchers. What we build is a platform not for analyzing data, but for facilitating analysis. Here is the page on our system where we define an analytic for users to employ - we simply add in a couple of fields, tell the system what language it is written in, and what function to call, and it’s added.\n
From there, we can add descriptions and rules for user-inputted variables, and make rules for dependent analytics, allowing us to chain together processes.\n
Once its written, analytics are easy to apply to any arbitrary set of data collected - any user can request an analysis to be conducted on any dataset on the system - when our data goes for more than a day without anyone visiting it, its archived to keep computational costs to a minimum, and brought live upon request.\n
When it’s live, the developer can define their own page for how the resultant data will look, what it will say, and what actions the user can take - it can be as simple as a table, or can be an embedded visualization of any stripe - that is similarly abstracted from the site’s code\n
And the code is made available all on github and written in as straightforward a manner as it can be - when we open this up to real widespread development, we certainly anticipate some sort of vetting process. Hopefully, someday, we hope to have a component online that allows users to visually stitch together their own basic analytics.\n
And since this is hosted on a scalable infrastructure, everything we run for the system is based on a setup using multiple machines - we’re still a little ways from being able to ramp up an instance on a push-button level, but currently, we have an administrative system on the site that allows us to control, monitor, and scale all the processes currently running.\n
This is where people like you all can come in - the way we’re doing this for Twitter is a general process, not some case-specific approach. If we want to really study the internet, we should prioritize the potential diversity of questions over the single-track analytics approach we currently see. It’s easy to imagine more generalized versions that allow researchers to crawl the web at large, or focus in on specific domains of research - either way, using this approach will leave us all better off. While the prevailing model will probably still lead to some commercial success, its not making us any better off in the long run - allowing flexibility in analytics shouldn’t be some advanced feature or far out addition - putting the focus on this flexibility will ultimately let people ask the questions they want to actually ask, and make us all better off, regardless of what sector we work in.\n