Let pi,j be the rate at which word i occurs in document j, and pj be the average across documents( sum Pij/ndocs)The size of each word is mapped to its maximum deviation ( maxi(pi,j- pj ) ), and its angular position is determined by the document where that maximum occurs.
“Using Google's enormous bigram dataset, I produced a series of visualizations that explore word associations. Each visualization pits two primary terms against each other. Then, the use frequency of words that follow these two terms are analyzed. For example, "war memorial" occurs 531,205 times, while "peace memorial" occurs only 25,699. A position for each word is generated by looking at the ratio of the two frequencies. If they are equal, the word is placed in the middle of the scale. However, if there is a imbalance in the uses, the word is drawn towards the more frequently related term. This process is repeated for thousands of other word combinations, creating a spectrum of word associations. Font size is based on a inverse power function (uniquely set for each visualization, so you can't compare across pieces). Vertical positioning is random.To better achieve a even distribution, I normalized the frequencies of bigrams based on total primary term frequency. So, for example, in the case of war vs. peace, there are 81,839,381 bigrams starting with war and 31,263,375 bigrams starting with peace. If I render the spectrum without normalization, it ends up lopsided toward war (since the usage totals are so much higher). To compensate, I scale down all of war's bigrams so that the overall frequencies are even.”
“it shows the relative weight of the phrases starting with 'business' that get used in books, in a way so that all the lines add up to 100% for any given year. This gives a good picture of how the use of 'business' changes relative to itself, without the overall trends for the individual words and with stopwords filtered out:”
“Bookworm demonstrates a new way of interacting with the millions of recently digitized library books. The Harvard Cultural Observatory already collaborated with Google Books on the Google ngramsviewerthat has data for years. Bookworm doesn't work so closely with Google Books: instead, it uses texts in the public domain, in this case, books from the Open Library and Internet Archive. They have gathered millions of digital texts, and the descriptions of them librarians have made over the last two centuries. Bookworm uses that information to let you search for trends in any corpus you can create out of the library metadata, and to link to the underlying books so you can read them.”
Also how you position marks on a canvas in relation to each other
See also http://ieg.ifs.tuwien.ac.at/~aigner/teaching/ws06/infovis_ue/papers/spiralgraph_weber01visualizing.pdf http://www-users.cs.umn.edu/~carlis/spiral.pdf
Emergent Social Positioning: origins: 1.5 degree egonet (how followers follow each other, how hashtaggers follow each other)- projection maps from followers to folk they commonly follow;-- projection maps from hashtaggers to folk they commonly follow- projection maps from friends to folk who commonly follow them
AP Wikileaks Iraq war: http://www.guardian.co.uk/news/datablog/2010/dec/16/wikileaks-iraq-visualisation “Each report is a dot, labeled by its key words. Reports with similar key words have edges drawn between them. The location of the dot has nothing to do with geography. Instead, we ran an algorithm that pulls dots with edges between them closer together. Then we labeled each cluster by the key words that are common to the reports in that cluster, and colored each report/dot by the "incident type," as entered by military personnel. The result is an abstract map of the bloodiest month of the war.”
Issuecrawler http://www.govcom.org/Issuecrawler_instructions.htm#4“I wanted to see if weeks later, I could identify #ididnotreport as a visible issue in web sphere. More specifically, I wanted see if I could use network analysis techniques to see to see if this sombering meme was still a cross-platform issue, prevalent across Twitter, Tumblr, Pinterest and other websites.One of the key tools to use for defining issues in the web sphere is the Digital Methods Initiative’s “Issue Crawler“, a web network location and visualization software that consists of crawlers, analysis engines and visualisation modules, that crawl specified sites and capture the outlinks from those specified sites.I used Issue Crawler’s Co-link analysis module to crawl the seed URLs by page from the query term “#ididnotreport” through 3 iterations – this then retained the pages that received at least two links (at a crawl depth of 2) from the seeds. I then used a Cluster Map to plot my issuecrawl result as a spring map.”
“Google Scraper allows you to harvest the top 100 Google search returns for your chosen issue, and then input these into it’s processing tool, which then outputs a measured result in the form of “issue clouds” that you can use to analyse the prevalence of the perceived issue/key words you have scraped.For this study of Tumblr, I scraped and harvested the top 100 Google search returns for “ididnotreportsite:tumblr.com”.”
“I call my diagrams Directed Sentence Drawings because the direction of the line segments are a function of their topic. As before, each sentence is assigned a topic or remains neutral based on the vocabulary it contains. I place a neutral point in the middle of the diagram and four other topic points form a diamond shape around it (see below). For the State of the Union diagrams produced below I used the four topics Government, Domestic, Economy, and Security. The algorithm is as follows:start at the neutral pointfind the topic for the sentence and use it to set the color for the linedraw the line from the current position towards the topic that it is aboutthe length of the line is proportional to the length of the sentenceif the line is continuing in the same direction as the last segment, draw a small circle at the starting pointif the line is reversing direction, use a small arc to shift it over so it doesn't overlay the previous segment”
“Rather than using arcs to connect identical patterns within a document I'm connecting instead segments that contain similar words. Here is the algorithm:break the document up into a stream of wordsthrow away any 'stop words' (a, at, of, the ...)divide the remaining stream of more interesting words into 50 equal segments based on linear positioncalculate a similarity metric between each pair of segments based on the amount of overlapping wordsdraw a diagram where the document segments are connected by arcs with the transparency determined by the similarity between the segments. Use a threshold so that weakly connected arcs don't get drawn at all.show the top two words for each arc drawn at both segment endpoints”
Unlike most Many Eyes visualizations, the word tree starts with a blank slate instead of a full visualization of the data. You must choose a search term to display a word tree. After a word or phrase is typed, the computer finds all the occurrences of that term, along with the phrases that appear after it. For instance, the word "word" occurs a number of times in the previous paragraphs.You will notice that in the words following "word" there are many repeated phrases. For instance, "tree" follows "word" five times, and "or phrase" follows three times. To create a word tree, the computer merges all the matching phrases.You can manipulate the tree in several ways. To zoom into a particular branch, clicking on a word in the tree. If you control-click on a word, the diagram will use that new word as the main search term. And if you wish to see the context occurring before rather than after a phrase, select End. As you navigate the word tree, you can use the Back and Forward buttons just as you would in a browser to quickly step through your history of views.
Many eyes phrase net“Phrase net analyzes a text by looking for pairs of words that fit particular patterns. You can specify this pattern by using asterisks as wildcard characters. For instance, the pattern "* and *" will match phrases like "play and sing" or "vexation and regret." Punctuation matters, so it will not match "left, and then". You can choose from some useful defaults or you can type your own patterns in the field below the list.After you specify a pattern, the program creates a network diagram of the words it finds as matches. Two words are connected if they occur in the same phrase. The size of a word is proportional to the number of times it occurs in a match; the thickness of an arrow between words tells you how many times those two words occur in the same phrase. The color of a word indicates whether it is more likely to be found in the first or second slot of a pattern. The darker the word, the more often it appears in the first position.DefiningpatternsMatching different patterns gives different views of the text. Each text is unique, so it is worth experimenting. For instance, looking for the pattern "* and *" will often highlight key related concepts. In contrast, the pattern "* 's *" will often result in a diagram of the main people and the things they possess. The simplest pattern is "* *" which links words if they come in immediate succession; this is often provides a surprisingly clear view, especially for short documents. Sometimes there is a special pattern that will provide information on a particular document. For example, applying "* begat *" to the King James Bible yields a rough family tree.There are three ways to specify a pattern. The easiest is to choose one of the defaults from the list on the left. A second way is to type a pattern with two asterisks for the "slots" of the pattern. Note that you need exactly two asterisks for the pattern to work. Finally, there's an advanced programmers-only option, which is to use a "regular expression" with two capturing groups. For an introduction to regular expressions, read this tutorial (java.sun.com/docs/books/tutorial/essential/regex/).”
1. Qualitatively Visual Tony HirstDept of Communication & Systems, blog.ouseful.info The Open University @psychemedia
2. STOP Words
3. Stop wordlists to tf-idf… term frequency-inverse document frequency Information retrieval research
4. wordlists to tf-idf…Term frequency term frequency-inverse document frequency Inverse Document frequency Information retrieval research
5. tf-idf LARGE when a term is common in a small number of docs SMALL when a term is common across many docs
10. Explanatory visualizationData visualizations that are used totransmit information or a point ofview from the designer to thereader. Explanatory visualizationstypically have a specific “story” orinformation that they are intendedto transmit.Exploratory visualizationData visualizations that are used bythe designer for self-informativepurposes to discoverpatterns, trends, or sub-problemsin a dataset. Exploratoryvisualizations typically don’t havean already-known story.
28. -grab a list of companies that may be associated with“Tesco” by querying the OpenCorporates reconciliation APIfor tesco-grab the filings for each of those companies-trawl through the filings looking for directorappointments or terminations- store a row for each directorial appointment ortermination including the company name and the director.