Your SlideShare is downloading. ×
  • Like
Blogviz Thesis by Manuel Lima
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Blogviz Thesis by Manuel Lima


Blogviz is a flash driven visualization model for mapping the transmission and internal structure of top links across the blogosphere. It explores the idea of meme propagation by assuming a parallel …

Blogviz is a flash driven visualization model for mapping the transmission and internal structure of top links across the blogosphere. It explores the idea of meme propagation by assuming a parallel with the spreading of most cited URLs in daily weblog entries.

Blogviz is currently a portrait of blogosphere’s topic activity during the first 64 days of 2005. Nevertheless, the model was developed to easily incorporate different timeframes. Blogviz will continue to expand in the future, to the possible point of including real-time data.

found here:

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. blogviz Mapping the dynamics of Information Diffusion in Blogspace by Manuel Lima A thesis document submitted in partial fulfillment of the requirements for the degree of Master of Fine Arts in Design and Technology. Parsons School of Design May 2005 Thesis Instructor: Christopher Kirwan Writing Instructor: Mark Stafford Manuel Lima
  • 2. blogviz Mapping the dynamics of Information Diffusion in Blogspace by Manuel Lima Abstract Blogviz is a visualization model for mapping the transmission and internal structure of top links across the blogosphere. It explores the idea of meme propagation by assuming a parallel with the spreading of most cited URLs in daily weblog entries. The main goal of Blogviz is to unravel hidden patterns in the topics diffusion process. What’s the life cycle of a topic? How does it start and how does it evolve through time? Are topics constrained to a specific community of users? Who are the most influential and innovative blogs in any topic? Are there any relationships amongst topic proliferators? Keywords Information Diffusion, Memetics, Weblogs, Online Social Communities, Complex Networks, Information Architecture, Information Visualization, Diffusion of Innovations, Epidemiology, Small Worlds
  • 3. Acknowledgements − Scott Patterson Jared Schiffman David Kearford Fura Johannesdottir Thank you for your feedback − Christopher Kirwan Mark Stafford Thank you for your guidance, openness and continuous motivation − My dearest Parents Thank you for your eternal support and dedication
  • 4. Table of Contents 1 Introduction 1 1.1 Concept 1 1.2 Memetics 3 1.3 Diffusion of Innovations 5 1.4 Epidemiology 10 12 2 Impetus 16 2.1 Subject of Analysis 18 3 Context 18 3.1 Online Social Communities 21 3.2 Weblogs 23 3.3 Blogosphere 24 4 Audience 26 5 Precedents 38 6 Methodology 38 6.1 Summer Research 39 6.2 Visual Explorations 42 6.3 Prototype #1 44 6.4 Prototype #2 47 6.5 Prototype #3 50 6.6 Prototype #4 53 6.7 Final Application 63 7 Technical Sources 63 7.1 Blog Engines 64 7.2 Blogviz Data 68 8 Conclusion 73 9 Bibliography Appendix A Summer Research Presentation Appendix B Complex Networks: Visual Explorations
  • 5. 1 Introduction Blogging presents one of the most interesting social phenomenons of our time. This change in the flow of online information might radically change the way we look at news providers and large media conglomerates. It also provides an extraordinary online laboratory to analyze how trends, ideas and information travel through social communities. 1.1 C0ncept Blogviz is a non-commercial research project developed with the intent of disentangling this highly complex network for further study, research and analysis. The main goal of Blogviz is to improve our understanding of the dynamics of information propagation among weblogs. An underlying question to Blogviz is: “How can we measure meme as a unit of cultural evolution?”. The answer is not easy. Memes, due to their widespread trait and frequent untraceable evolutionary track, become extremely hard to measure accurately. In opposition to this commonly undetectable meme pool, the blogosphere offers a discernible and documented map of thousands of memes, with clear trails of progression, structured by date and time. There are many possible ways of looking at information diffusion in blogspace. It can be based on conversation threads, comment threads, key sentences, themes, tags, or top links. Blogviz analyzes top links, occasionally called topics, which represent the most cited URLs appearing in blog entries in any given day. These popular links represent particular memes that provide an idea of sources, stories and themes that have occupied the attention of bloggers over a certain period of time. By exploring the evolution of these topics through time, Blogviz will not only able to track its popular dispatchers and key innovators, but also, follow its dissemination pattern from the beginning to an eventual tipping point, where it might leap the blog community and reach the mainstream. 1
  • 6. Blogviz embodies a flash driven interactive visualization model with extensive use of information visualization and information architecture. Why is Information Visualization central to Blogviz? Information Visualization can be defined as quot;the use of computer-supported, interactive, visual representations of abstract data to amplify cognitionquot; (Card, Mackinlay & Shneiderman, 1999). Information Visualization does not only makes data easier for human interpretation but it also discovers and highlights relationships in data elements, usually reducing the processes of searching by gathering information in a small rich space. Therefore, Blogviz employs Information Visualization with the key intent of uncovering hidden patterns in the data and deriving plausible conclusions, which promote an advanced knowledge of information dynamics in blogspace. By unraveling the modus operandi behind the blogosphere we might be able to improve our knowledge on the mechanics of online social communities and, to some extent, the mechanics of complex social networks. Blogviz is currently a portrait of blogosphere’s topic activity during the months of January and February 2005. The selection of a time period was purely arbitrary. In order to make this project a reality within the thesis development time limitations, a decision was made in order to constrain the project to a more specific time span. Nevertheless, the model was developed to easily incorporate different timeframes. Blogviz will continue to expand in the future, to the possible point of including real-time data. Blogviz uses existing data from three different blog search engines organized in a database that will soon be available for public access. (see Technical Sources for additional information) 2
  • 7. 1.2 Memetics From a conversation with my Thesis Writing instructor, Mark Stafford, I was able to understand how my thesis had become closely related to the concepts of memetics or meme behavior. We came to the conclusion that I was developing a “topological model of meme activity”, even if until then I was somehow oblivious to it. That title actually remained for a while when characterizing Blogviz. But later on I decided to change it, since the word meme was slightly audience limiting and the expression topological could result in inadequate interpretations. I still question why the notion of Memetics didn’t came up in my research earlier, but what is particularly interesting is that it was there from the beginning, immersed in every iteration of my work. I think I was too much concentrated in the idea of a word-of-mouth behavior, an expression used by Malcolm Gladwell in “The Tipping Point” and by Duncan Watts in “Six Degrees: The Science of a Connected Age”. The vital point is that Memetics is the principle theory when contextualizing Blogviz, and because of that, understanding the theory of Memetics is a crucial measure to comprehend the underlying concept of Blogviz. 1.2.1 What’s a Meme? The term was first coined by Richard Dawkins’s, in 1976, on his notorious book “The Selfish Gene”. In the words of Dawkins the word quot;memequot; refers to quot;a unit of cultural transmission, or a unit of imitationquot;. More specifically, a meme can be defined as a self- propagating unit of cultural evolution, a unit of information, held in an individual's memory or in an outside artifact (e.g. book, record or tool), which is likely to be communicated or copied to another individual's memory or retention system. Examples of memes are ideas, catch-phrases, melodies, technologies, icons, theories, inventions, languages, designs, fashions, and traditions. This covers all forms of beliefs, values and behaviors that are normally taken over from others rather than discovered independently. A meme is basically a pattern of information that induces people to repeat it. People try to “infect” each other with memes they find most appealing, despite of the memes' objective value or truth. 3
  • 8. 1.2.2 What is Memetics? Memetics is the study of evolutionary models of information transmission based on the concept of the meme. In spite of its roots in evolutionary biology and computer simulation, memetics has become more of a social science, focusing primarily on the spread of information within human society. Rather than debate the inherent quot;truthquot; or lack of quot;truthquot; of an idea, memetics is largely concerned with how that idea itself gets replicated. Another definition of Memetics declares it is the theoretical and empirical science that studies the replication, spread and evolution of memes. As portrayed in the Journal of Memetics*: “It’s core idea is that memes differ in their degree of ‘fitness’, i.e. adaptation to the socio-cultural environment in which they propagate. Because of natural selection, fitter memes will be more successful in being communicated, ‘infecting’ a larger number of individuals and/or surviving for a longer time within the population. Memetics tries to understand what characterizes fit memes, and how they affect individuals, organizations, cultures and society at large”. Since the premise of Memetics is to investigate the evolutionary mechanisms that determine the propagation of information within a population of human, animal or artificial agents, we can easily perceive why this science is vital to the understanding of cults, ideologies, or marketing campaigns of all kinds. A meme is acknowledged as a self-propagating unit of cultural evolution, analogous to the gene (the unit of genetics). And because of memes’ similar behavior to life forms, Memetics embraces the analytical techniques of diverse sciences, such as, epidemiology, evolutionary science, immunology, diffusion of innovations, linguistics, and semiotics. * Journal of Memetics ( 4
  • 9. 1.3 Diffusion of Innovations I believe any type of Information Diffusion Model (IDM) in Social Networks must derive extensive practical knowledge from the sciences of epidemiology and diffusion of innovations. These two domains help us understand many of the factors that characterize the spreading of information and adoption process in social communities. Epidemiology and Diffusion of innovations also share many similarities and are surprisingly linked together. For these reasons I decided to include in this thesis a short description of these areas, since in addition to the concept of Memetics, they create an extraordinary context to the understanding of Blogviz. I don’t make wide explanations of each domain but rather comparisons between them on how they relate to this thesis’s assertion. In order to delineate a common ground for the following definitions, this paper assumes that an innovation can be characterized as a new meme, given that it is also described as a new idea. In the context of information diffusion in the blogosphere, it assumes the process of adoption to be the process by which a blogger, aware of the existence of a new meme (or innovation), decides to mention it on his/her own personal blog, in the form of a post or part of a post. This action can be understood as an “adoption” by the blogger of this particular unit of information, therefore contributing to its replication. The study of innovation adoption and diffusion has its origins in the Midwestern United States. In an Iowa State University study, Ryan and Gross (1943) showed that the pattern of adoption and diffusion of a maize hybrid was systematic, hence opening the door for further research. Diffusion is the process by which an innovation is communicated through certain channels over time among the members of a social system (Everett M. Rogers, 1995). The innovation includes quot;any thought, behavior, or thing that is new because it is qualitatively different from existing formsquot; (Jones, 1967). The characteristics of an innovation, as perceived by members of a social system, determine its rate of adoption. Just by analyzing these last statements one can easily grasp a series or similarities with the notion of Memetics. Even to the point that the theory of Diffusion of Innovations also considers the unit of adoption not exclusive to an individual person, but extending to other types of retention systems. 5
  • 10. The four main elements in the diffusion of new ideas are: (1) The innovation (2) Communication channels (3) Time (4) The social system (context) 1.3.1 The Innovation These are the characteristics that determine an innovation’s rate of adoption: – Relative advantage – Compatibility – Complexity – Trialability – Observability to those people within the social system. 1.3.2 Communication Channels A communication channel is the means by which messages get from one individual to another. Mass media channels are more effective in creating knowledge of innovations, whereas interpersonal channels are more effective in forming and changing attitudes toward a new idea, and thus in influencing the decision to adopt or reject a new idea. Most individuals evaluate an innovation, not on the basis of scientific research by experts, but through the subjective evaluations of near-peers who have adopted the innovation. (Everett M. Rogers) In a broad sense, the communication channel in the context of Blogviz is indubitably the Internet. Without it there wouldn’t even be any kind of communication between bloggers. However, without blogrolls and posting citations within each blog, the restrict channels among them would be very difficult to perceive. Blogrolls are the backbone of blog communities, the edges that keep all the nodes interconnected, and therefore, are the key factors in understanding how information develops across the blogosphere. In fact, a major characteristic of online social communities is that they are based on communication channels, not on physical co-location. A blogroll is a listing of websites that often appear as links on weblogs, usually on a left or right frame of the page. This list of links is used to relate the site owner's interest or affiliation with other webloggers. 6
  • 11. 1.3.3 Time The Diffusion of Innovations theory divides the element of Time in three main dimensions, in which only two can be fully applied to the context of information diffusion in the blogosphere. > Innovation-decision – The innovation-decision process is the mental course of action in which an individual passes from first knowledge of an innovation to forming an attitude toward the innovation, to a decision to adopt or reject it, and if adopting it, to implement this new idea and confirm the decision. In the case of a blogger deciding to post or not a specific meme in his/her weblog, this decision process is so fast that it’s almost impossible to measure. It applies to other memes, and definitely to other innovations, but it’s not relevant as a measurement in top links replication. > Innovativeness – Innovativeness is the degree to which an individual is fairly faster in adopting new ideas in relation to other members of a social system. Innovativeness, in opposition to the innovation-decision process, is an extremely significant measurement in top links replication, as in most information diffusion models. There are five adopter categories, or member classifications of a social system, based on their level of innovativeness: – Innovators – Early adopters – Early majority – Late majority – Laggards Bell-shaped curve showing categories of individual innovativeness and percentages within each category 7
  • 12. Innovativeness among social systems is characterized by a bell-shaped curved where time and incidence of adoption are the two main vectors. This concept, in the context of Blogviz, is further explored in the Methodology chapter of this thesis. Many search engines and community tools analyzing the blogosphere, assume a direct correlation between blogs popularity and innovativeness. I believe this assumption is incorrect. Their thinking is very simple. If a specific blog has a high number of inbound links and therefore a sizeable readership, it must imply that it’s in the frontline in finding and publishing original information. The HP Information Dynamics Lab study on the “Implicit Structure and the Dynamics of Blogspace” (Eytan Adar et al) showed exactly the opposite. The study demonstrated that popular blogs are rarely among the first ones to start a specific trend. Many popular blogs claim most of their “discoveries” by not citing their original source, which are usually smaller unfamiliar blogs. The level of popularity of each blog might be directly related to its scale of influence, but not necessarily to its level of innovativeness. So who are these unknown bloggers that bring fresh ideas to the blogspace? Who are these innovators or trendsetters? Blogviz will allow an exposure of these anonymous sources, crucial in the dynamics of topics diffusion. > Rate of adoption – The rate of adoption describes how fast an innovation is adopted by members of a social system in a given time period. When mapping the cumulative adoption time path or temporal pattern of a diffusion process, the resulting distribution can generally be described as taking the form of an S-shaped (sigmoid) curve. Time and cumulative adoption (or infected population) are the plot main vectors. 8
  • 13. 1.3.4 The Social System The fourth main element in the diffusion of new ideas is the social system, which basically creates a boundary for the diffusion and adoption of an innovation to occur. A social system is defined as a set of interrelated units that are engaged in joint problem- solving to accomplish a common goal (Everett M. Rogers). The members or units of a social system may be individuals, informal groups, organizations, and/or subsystems. In regards to the replication of top links among weblogs, the social system is undoubtedly the blogosphere, depicted as a fertile network of endless social communities. This vast communication network consists of interconnected individuals (bloggers) who are linked by shared interests and patterned flows of information. At a first glance, considering the highly interconnected web of links, connections and shared interests among bloggers, it might seem easy to understand the adoption process of a particular unit of information or innovation. However, another crucial conclusion exposed by the HP Information Dynamics Lab study, mentioned before, declared that “for URLs appearing on at least 2 blogs, 77% of blogs do not have a direct link to another blog mentioning the URL earlier. For those URL’s present on at least 10 blogs, 70% are not attributable to direct links”. There have been several studies on how the system’s social structure, and norms or established behavior patterns, affect the diffusion of innovations within a particular social system. But another area of research that is closely linked to Blogviz relates to opinion leadership. It can be described as the degree to which an individual is able to influence informally other individuals' attitudes or explicit behavior in a desired way with relative frequency. Blogviz allows a broad understanding of opinion leadership in blogspace by tracking and exposing the most influential and innovative topic proliferators. 9
  • 14. 1.4 Epidemiology Throughout this thesis I use several times the terms contamination and infection when describing the adoption process of memes. Even though this practice might lead to unwanted interpretations, its use is not arbitrary, and it actually facilitates the comprehension of information diffusion dynamics. Epidemiology in its broadest sense is the study of disease patterns in human populations (Wikipedia). Epidemiology can also be described as the study of the determinants, occurrence, and distribution of health and disease in a defined population. Infection is the replication of organisms in host tissue, which may cause disease. A carrier is an individual with no overt disease who harbors infectious organisms. And the notion of dissemination is understood as the spread of the organism in the environment. In the above description, regardless of the different terms, we start noticing several similarities with the domain of diffusion of innovations. This analogy is even more explicit when characterizing the three major elements in disease occurrence, the so-called chain of infection: (1) The etiologic agent (parallel to the innovation) (2) The method of transmission (parallel to the communication channel) (3) The host (parallel to a unit of a social system) Further along in characterizing the disease evolution, the epidemiologic descriptive study organizes data by time, place and person. It is unquestionably the closest approach to the concept of Information Diffusion. It divides the element of Time into four main trends; respectively, secular trends, periodic trends, seasonal trends and epidemics. What’s interesting in this typology of Time is that it applies equally well to the evolution of top links across the blogosphere. Because of that I assume a series of parallelisms between them. The secular trend describes the occurrence of disease over a prolonged period. This continual development is less usual then the seasonal trend in the context of blogspace. This trend usually describes commercial or very popular websites that never lose entirely the bloggers’ interest and as a result have a continuous existence among them. 10
  • 15. The periodic trend basically expresses a temporary modification in the overall secular trend. It conveys a sudden new interest in a specific meme that is part of a continual trend. The seasonal trend reflects seasonal changes in disease occurrence following changes in environmental conditions that enhance the ability of the agent to replicate or be transmitted. This short transitory trend is the most common in blogspace. A new meme that spreads quickly and rapidly loses interest, dying in a short period of time. The epidemic incidence of a disease happens generally when it surpasses a threshold of 7% of the target population. An epidemic is a sudden and boost in occurrence due to prevalent factors that support transmission. An information epidemic in blogspace might originate a tipping point, where a specific meme escalates and leaps the blogspace, reaching the mainstream. 11
  • 16. 2 Impetus The main source of motivation for my thesis development is based on a solid cooperation between Information Diffusion, Information Architecture, Data Visualization, and the Science of Complex Networks. My curiosity in Information Architecture was initially fostered in Christopher Kirwan’s MFADT class in the Spring of 2004, and since then, it became a major subject of interest and awareness. I remember observing for the first time a diagram with four interconnected circles representing the continuous Understanding Spectrum. Data originates information, which leads to knowledge and ultimately to wisdom. This concept influenced my vision and made me reflect on the responsibly I had, as a designer, to contribute to this spectrum. The Understanding Spectrum Nathan Shredoff We may have access to an abundance of information but I strongly believe we lack the ability to process it effectively. In face of contemporary technological accomplishments, our ability to generate and acquire data has by far outpaced our ability to make sense of it. Neither raw data nor scattered information offers any level of meaningful understanding. This is where Information Architecture and Information Visualization undertake an important mission. If we are truly entering a fourth phase in human-kind, a theory defended by a large number of anthropologists and sociologist, then Information 12
  • 17. Architecture is going to be a golden key in the process. In a world increasingly driven by information, it rapidly assumes the form of power, and typifies society in terms of those who own it and those who don’t. Meaningful information is not a given fact, and particularly now, when our cultural artifacts are being measured in gigabytes and terabytes, organizing, sorting and displaying information, in an efficient way, is a crucial measure for intelligence, knowledge and wisdom. In the Spring 2004 semester I was involved in two projects that were decisive in the delineation of my thesis domain of interest and my increased alertness towards Information Architecture and Information Visualization. The first one was a group project developed at the Information Architecture class, taught by Christopher Kirwan. Self- Replicating Cloners was a project aimed at producing visualizations of Virus, their progression through time and world scale dissemination. Two viruses were analyzed by comparison, SARS and MyDoom, each one representing its underlying field, human biology and computer technology. Self-Replicating Cloners Visualizations of Virus (biological/computer generated), their progression through time and worldscale dissemination 13
  • 18. The second point of awareness was a group project developed in a collaboration studio with Siemens Corporate Research Center. Aimed at Siemens Medical, DSS – Disease Surveillance System was a visualization and communication tool that shared symptomatological data between hospitals and health care professionals for detecting possible disease outbreaks and recognizing development patterns nation wide. DSS – Disease Surveillance System After these two particular experiences, I started my summer research with some clear interests in mind, but still scattered through distinct areas such as artificial life, virology, cognitive science, genetics, cyber biology, epidemiology, and pattern recognition. Emergence, by Steven Johnson, was the first book I read in my research and it was a surprising start. The paradigm of Emergence, which can be described as a “higher-level pattern arising out of parallel complex interactions between local agents”, was slowly overflowing my mind with bright new discoveries. And with an augmented motivation, I started gradually abandoning some initial ideas and, in other cases, finding common links between them, under the sciences of complexity and self-organization. The search for answers on how order can emerge from disorder, and organization emerge from chaos, guide me to initiate a study on the individual parameters of emergent systems, such as collective/macro behavior, self-organizing communities and bottom-up hierarchy. This research led me inevitably to complex systems. Delving into this new area was even more thrilling. Finding each day, a common structure in apparent distinct fields, or similarities between natural systems and human designs, was beyond doubt overwhelming. From that point on, I became extremely fascinated with the omnipresent 14
  • 19. web of signals and interactions, nodes and links that shape modern complex networks, from social networks, to corporations, cities, living organisms and the Internet. Complexity is a challenge by itself. Complex Networks are everywhere. It is a structural and organizational principle that reaches almost every field we can think of, from genes to power systems, from food webs to market shares. Paraphrasing Albert Barabasi, one of the leading researchers in this area, “the mistery of life begins with the intricate web of interactions, integrating the millions of molecules within each organism”. Humans, since their birth, experience the effect of networks every day, from large complex systems like transportation routes and communication networks, to less conscious interactions, common in social networks. A Scale-Free network, the most common topology in either natural or human systems, is curiously enough, a very recent breakthrough. Since its discovery, 6 years ago, dozens of researchers worldwide have been disentangling the networks around us at an amazing rate. This awareness is helping us understand not only the world around us but also the most intricate web of interactions that shape the human body. The global effort of constructing a general theory of complexity is tremendous and may lead us, not only to a structural understanding of networks, but to major improvements in stability, robustness and security of most complex systems around the globe. Like Barabasi refers in Linked, “Once we stumble across the right vision of complexity, it will take little to bring it to fruition. When that will happen is one of the mysteries that keeps many of us going”. The feature that has always fascinated me the most in complex networks is the dynamics of Dissemination Patterns. The visualization of a path, and inherent duration, of a certain fad, idea, or virus, in a social/biological or computer network has been, since the beginning, a critical point of awareness. How does a particular contagion travel from point A to B, which nodes it affects in its course, and how fast if contaminates a large cluster or the entire network. 15
  • 20. 2.1 Subject of Analysis After my summer research presentation, in the beginning of the Fall 2004 semester, where I showed all the collected knowledge in the domain of complex networks, I went even further on observing and collecting dozens of network visualization examples and trying several open-source applications. This investigation resulted on my second official presentation. Part of this research also coincided with the work I was developing as a design researcher at Parsons Institute of Information Mapping (PIIM). For additional information on this study please consult section 6.2 of chapter 6 – Methodology. After the second official presentation I was sure of two things: 1 – I wanted to continue my visual explorations exercise, by gathering problems and inconsistencies in complex network diagrams and proposing plausible solutions. 2 – I wanted to map a dissemination pattern in a specific network. By doing that, I intended, not only to be innovative and bring something new to the field, but also display a ‘showcase’ of my visual thinking in terms of complex networks visualization. The first objective was well defined, and best of all, already under development. The major problem was finding a solution for the second point. I had to hit upon a subject that represented all the research and knowledge I had gathered through the summer and the beginning of the Fall 2004 semester. Finding an answer to this quest seemed an impossible task, due to the vagueness of possible directions. At a certain point it was as if I had came back to the start, with the fearful blankness of June assaulting my mind once again. Time was urging and I knew whatever subject I chose, I was still facing an enormous workload ahead of me. The first thing I decided was to go back to my initial interest, the main cause that led me in this escalating exploration of complex networks. I quickly found out my early motivations: virus dissemination and relationships between social/biological and computer/technological systems. One thing I discovered on my summer research is that ideas, fads, trends and innovations show similar dissemination patterns as virus in social networks. The concept of word-of-mouth is a fascinating diffusion behavior that has always intrigued psychologists, sociologists, anthropologists, and lately marketers. To be able to map a word-of-mouth epidemic in a specific social network is a blue-sky scenario. And that might be true, in relation to physical interactions in a physical world between physical 16
  • 21. individuals. However, a flourishing movement on the Internet presents an interesting experimental laboratory to explore this behavior. Blogging embodies an incredible case of word-of-mouth, where news, ideas and fads travel through community clusters with high adoption rates. Because of their inherent nature blogs became my ultimate fixation and the main frameset for my Thesis. Their high interconnectivity and shared flow of information represent not only an obvious case study of meme propagation, but an outstanding example of a dissemination pattern in a increasingly high complex network, estimated to be over 8 million nodes. As an example, I’ll mention a topic that emerged from the blog community in the beginning of October, 2004. On the first presidential debate for the US Elections 2004, on September 30, 2004, between President George W. Bush and Senator John Kerry, there was an episode that got the attention of a particular viewer. “You forgot Poland” was the abrupt statement made by George W. Bush while John Kerry was enumerating the allied forces present at the Iraq War. The presidential debate occurred on a Friday evening, September 30, and on the following Monday night, there was a topic already sharing 12 links among bloggers. This topic pointed to a specific URL – By that time, less than 72 hours after the debate, someone had already created a domain ( and was selling online t- shirts and stickers with the same sentence. A new meme had been born and in a short period of time “infected” several people. This intriguing example reveals the accelerating rate of information flow among bloggers and how fast it spreads or “contaminates” online blog communities. Another issue of awareness, demonstrated by this example, is the possibility of tracking a possible outburst. Imagine this topic reaching the mainstream a week later, possibly a major newspaper or a particular TV show. How interesting would it be, to actually go back in time and discover where this outbreak first originated, the way it was adopted and how fast it grew? These last two queries have undoubtedly become a crucial motivation for the development of my thesis. Quoting Duncan Watts, in regard to the mechanics of social networks: “To understand the pattern, we need to delve further into the rules by which individuals make decisions, and how, in the process, our apparently independent choices become inextricably bound together.” 17
  • 22. 3 Context The contextual narrowing of my thesis proposal starts on the broad area of Complex Networks, tights its limits on Social Networks and ends at its ultimate contextual boundary, Online Social Communities. Even though this Thesis proposition places itself on the center of a broad group of domains, I decided to deeply explore its closest and more direct domain – Online Social Communities, and the main subject of analysis – Blogs. Nevertheless, besides the omnipresent field of complex networks, the context of this thesis incorporates the domains of Information Diffusion, Memetics, Information Architecture, Data Visualization, Information Theory, Diffusion of Innovations, Epidemiology and Small Worlds. 3.1 Online Social Communities Online Social Communities, although much more concise than the Science of Complex Networks, is still a wide-ranging field that can include mostly every type of online inter- personal communication medium, from e-mail listings/threads, to Usenet groups, MUDs, chat environments, instant messaging, community forums, weblogs, online gamming, interest groups, among others. Online Communities offer an interesting change on the parameters that until now have defined social interaction. Several years after Milgram’s notorious small-world test, Russell Bernard and Peter Killworth did what they called a “reverse small-world experiment”. They interviewed hundreds of individuals, explaining Milgram’s experiment and asking them what personal criteria would they use to get a specific package to someone they didn’t know. Bernard and Killworth’s study found that most of the subjects used only a couple of dimensions to get their message sent to the next recipient. Most predominant dimensions were geography and occupation. Jon Kleinberg, a computer scientist who attended Cornell and MIT, was also motivated by Milgram’s small-world study, and questioned how did the individuals actually found the paths within the network. Kleinberg concluded that people have generally a strong sense of distance, which they use to distinguish themselves from others. A notion of 18
  • 23. distance can have several factors in which geographical distance is just one of them. Profession, race, religion, income, class, education, are other elements added to the equation, that describe how distant a specific person is from us. From the beginning of human existence, communities were created for the benefits of their own members. Usually by means of expediency, either in relation to the exchange of goods or improved security against enemies, these groups of people occurred as emergent systems by means of social convenience. Geography always played an essential role and without a common shared space most of these communities wouldn’t even exist. With the posterior developments of mail, and more recently, telephone, telex, and fax, human communication became highly enhanced and geography started diminishing its major influence. However, these new “technologies” only improved the way people communicated with each other, by giving them more tools and decreasing the time span and subsequently the distance; other then that, there were no major changes in the way social communities were formed. No matter how fast and easy it became for someone in Europe to talk with someone in America or China, there were never communities created on the basis of telephone calls. If we explore the word syntax structure of most communication tools prior to the Internet, such as telegraph, telex, telegram, and telephone, we encounter the constant presence of the prefix tele-. Tele is a greek word that means “at a distance”, usually implying “to be distant” or “over a distance”. The first use of the prefix tele was in the word telescope which was actually adapted from Galileo’s Italian word telescopi, followed by the word telegraph, meaning “writing at a distance”. Therefore, Telecommunications is the field that embodies all the systems that intent to communicate “at a distant” or “over a distance”. Once again we see the importance of geography as a crucial domain for human communication, where the advancement of technology, since the beginning, has been trying to diminish its constraints, by allowing people to communicate over an ever- present and disturbing distance. I find this analysis particularly interesting in such a way that the Internet, and all features associated with it, has completely abandoned the prefix tele-, drastically assuming the medium, and replaced it with the prefix e-. From e-mail, to e-commerce, and e-business, the prefix e- is usually associated with the latest heat of technological revolution, an abbreviation of the word electronic and an obvious association with the word cyber. 19
  • 24. The advent of the Internet and the World Wide Web changed these secular communal constraints, possibly forever. The Internet became not just a medium for social gathering and communication, but it absorbed it, and the medium became truly the message. The transmission of information on the Internet is regularly measured in milliseconds, and the time it usually takes for a message to leave a computer in Tokyo and arrive at a computer in New York is more or less the same as a message sent to you, from your next-door neighbor. The difference is merely a few milliseconds, which is by itself a measurement difficult to perceive. Geography, as a crucial criterion for the birth of social communities, has been utterly disregarded by online social communities. Without the limitations of geography and physical interaction and identification, online communities had to rely on a more abstract, but equally distinguishing criteria, interests. By analyzing most current online communities, from online players to chat rooms, blogs and newsgroups, we find out that in the absence of physical recognition, social values like trust, confidence, respect and even friendship are ultimately based on a set of shared interests. And of course, this “virtual” interaction would not be possible without specific communication channels, portrayed as technological sub-systems of the larger medium, the Internet. Personal interests are a central element of our social identity, and subsequently, a highly considered factor in relationships. Paraphrasing Duncan Watts in regards to peer-to- peer networks, “social identity is what leads networks to be searchable”. The fabulous aspect of online communities is the possibility of not only searching these clusters of shared interests, but also tracking the exchange of conversations, ideas and messages between them. By analyzing this data, it’s possible to understand, to some extend, how information travels through these virtual environments. Weblogs, in this conjecture, represent units of a remarkable social laboratory. It’s relatively easy to track their connectivity, but also, due to their highly clustering nature, it’s possible to examine in specific communities, how do news and trends travel through individual bloggers. 20
  • 25. 3.2 Weblogs Weblogs (alternate: blogs) are not just a new fad among Internet users and they are much more than a collection of online digital diaries of spread interest groups. Blogs represent a change in online information flow and they are becoming a rising news source for many people. We might not even be aware of how influential blogs will be in the future but one thing is sure, there are currently blogs with close to half a million visitors a day, more than many large newspapers, magazines and news broadcasters. Jorn Barger coined the term in 1997 and in 1999 Peter Merholz coined its alternative abbreviation “blog”. As Jorn Barger stated: quot;Weblogs are often-updated sites that point to articles elsewhere on the web, often with comments, and to on-site articles. A weblog is kind of a continual tour, with a human guide [whom] you get to know. There are many guides to choose from and each develops an audience. There's camaraderie and politics between the people who run weblogs. They point to each other in all kinds of structures, graphs, loops, etc.quot; The most common definition of a blog is that of an online diary of thoughts, links, events, or actions posted on a web page with a dated log format. These posts are often, but not necessarily, in reverse chronological order, and are updated on a daily or very frequent basis with new information about a particular subject or range of subjects. Despite this dry classification, the usefulness of a weblog is incredibly rich. Blogs are the vital elements of the personal publishing revolution. If we go back a few years, before the rise of online publishing, the only way someone could write something for general public would be through a letter to the editor, and hope for its message to be published in the magazine’s next issue. For the first time in the history of human communication, any single person has the opportunity to reach millions with their message, as the cliché proclaims, with “the touch of a button”. Instead of being passive consumers of information, Internet users are becoming active participants. This power to the people is debatably a positive trend, since many people subjectively consider this measure adds to the existent “junk” flowing on the Web. Since most blogs don’t obey to any kind of editorial process or peer review and sometimes “play” with anonymity, their public posts also raise legal concerns about intellectual property, defamation, and alike. 21
  • 26. Controversies apart, blogs, as the World Wide Web, are free democratic resources that embody the concept of free speech, which is unquestionably a right for all. Blogs also exemplify the true concept of diversity. Besides being oblivious to who might use this personal tool, blog content is as varied as the Web itself. The authors of Essential Blogging explain this diversity by pointing out that “creating a taxonomy of the blogiverse is a fruitless task”, since “there’s no good, central directory of blogs that puts each one in its own pigeonhole, because even the most topical blogger will stray from the subject from time to time to celebrate some personal victory or warn his readers off a terrible movie”. One might also argue that in fact, this personal publishing revolution started with the first website, and consequently with the birth of the Internet. This is obviously true, however, until the first blog publishing tools became available, anyone who wanted to circulate their own ideas online, had to be fluent in HTML, web hosting, and aware of most webdesign applications available. Even after GeoCites launch in 1996, offering free web hosting to non-commercial personal pages, web pioneers had to be HTML-savvy people who would spent the evenings working on their websites. Also, these few personal webpages that start populating the Web in the mid 90’s were just a scattered collection of isolated opinions, with no regular updates and unconnected from each other. The big blog phenomenon started escalating in the summer of 1999, when a small web company called Pyra Labs released a product called Blogger. From that point on the blog community exploded and the more bloggers came into scene, more online blog tools became available. This was the beginning of the personal publishing revolution. The inclination towards personalization is reaching every industry, from clothing to cars, from software to medicine. News and Information are just new elements added to the equation. In my opinion, the reasons why many blogs are so successful are due to two major factors: personalization and comforting lassitude. Blogs are usually maintained by a single person who filters the huge amount of available information according to his/her own preferences. For people who share common interests with the blogger, it’s not only exciting to get information from that source, since it’s going to match their inclination to some degree, but it also saves them a lot of time by avoiding the large, more abstract, and sometimes incongruent, news sources. In countries such as the US, where large media sources are becoming increasingly dry and biased, blogs might also represent an oasis of independent information. 22
  • 27. 3.3 Blogosphere Blogosphere (alternate: blogsphere), or blogspace, is the collective term encircling all weblogs (alternate: blogs). It’s almost impossible to determine with precision the existing number of weblogs, or even the ones currently active. Technorati is a leading search engine for the blogosphere, similar to Google or Yahoo, but exclusive to blogs. Technorati, as of February 2005, was tracking 7,245,866 blogs, and this number is far from stagnating. Out of curiosity, when reviewing this paper on April 6, 2005, I checked Technorati to see how the latest number had changed. To my not-so-surprised amazement, Technorati declared to be tracking 8,469,023 weblogs. It translates in an increase of more than 1 million blogs in less then two months. The latest Pew Internet study estimates that about 27%, or about 32 million, of American Internet users are regular blog readers. They say a new weblog is created every 2.2 seconds, which means there are about 38,000 new weblogs a day. Bloggers update their blogs regularly; there are about 500,000 posts daily, or about 5.8 posts per second. When we’re faced with a number of blogs higher than eight million (at least), it becomes hard to consider its whole as a single community. The blogosphere, in analogy to its medium, the Internet, does not represent a single community but a vast collection of endless communities. These communities shape a complex web of more than 8 million nodes and are key factors in the outburst and further development of trends, fads and innovations. Also, due to its inherent diversity, any kind of classification regarding the blogosphere is a mere exercise of oversimplification. 23
  • 28. 4 Audience Scientists/Researchers on Complex Networks Hopefully, Blogviz will offer a significant step in this long scientific journey towards the understanding of the dynamics of complex networks. To all researchers, academics, and scientists that have been persistently and bravely disentangling the networks around us, I truly hope this model can produce one important footprint in this expedition. It doesn’t have to be gigantic, just one step forward. By bringing my visual expertise and interest in Information Architecture, Data Visualization and Interface Design, I expect to make a small corner of the vast Science of Complex Networks more clear and understandable. This corner embodies the domain of Online Social Communities and the phenomenon of blogging. Sociologists Professionals, Researchers, Faculty and Students. Blogviz will offer an interesting case study for analyzing a dynamic, ever-changing and complex online social network – the Blogosphere. To map a word-of-mouth spreading in social communication has been, until now, an almost fruitless task. Blogs in the other hand offer an engaging experimental laboratory to better study and understand this occurrence. Memetics is an expanding field of study in social sciences, which is being explored by a significant number of researchers. Blogviz, by making a parallel between meme propagation and topics diffusion in blogspace, makes an important contribution to the understanding of Memetics. Information Architects and Data Visualization enthusiasts Professionals, Researchers, Faculty and Students. I look forward that my passion and fascination for the field of Information Architecture and Data Visualization can be reflected in my thesis project. I truthfully hope that Blogviz can be a relevant precedent in some of your projects, deserve a mention in your research, inspire or influence you at some level. 24
  • 29. Cultural Critics Blogging presents one of the most intriguing and captivating phenomenons of our time. We might be in for a long ride in the adulteration of most publishing media conglomerates. We cannot really predict the ultimate result of this major drift in the flow of online information, but one thing is sure, it has already started. Blogviz will offer an enhanced insight on the mechanics of this contemporary revolution. Marketers Possibly, the only open door to an eventual commercial viability for the application is based on its relevance for the Marketing industry. Even if Blogviz is a non-commercial research project, it is reassuring to know that it’s potentially useful outside the research and academic realms. Like sociologists, marketers have become more and more interested in the word-of-mouth behavior, even though the more traditional marketing strategists haven’t minimally explored this concept. In the blog community, most bloggers are incorporating the idea of syndication in their blogs, in the form of a data XML file, called RSS, which is basically a list of post summaries and links to them. These files can then be interpreted by a desktop application called a RSS Aggregator, and read by the user without the need to access the specific website. Some consider RSS to be the future of news distribution, and that might well be the case, which explains why, as in any communication medium, advertisement is now starting to infiltrate RSS Feeds. The potential use of Blogviz in this assertion is huge. Marketers interested in investing in the best RSS blog sources for advertisement, could easily track most seen blogs, locate the innovators, the followers, the major dispatchers of information, and then explore the conclusions accordingly. Bloggers Blogviz is a visualization model build to better understand the information dynamics within the blog community. By that order, any interested blogger who feels the need to comprehend the underlying network that he’s part of is a potential user of my research project. 25
  • 30. 5 Precedents The chain of influences and inspiration for my thesis project is, as expected, extremely widespread and goes from new media art, information architecture, data visualization, complex networks, interface design, among so many other fields, and life in general. Even if I started enumerating major key thinkers whose work I admire and respect, and subsequently absorbed for myself, I expect many names would still be unmentioned from the extensive list of people. In enunciating the key precedents for my thesis, I concentrated exclusively in projects developed in the area of Online Social Communities, my closest encircling thesis domain. Since the major goal of my thesis is to visually map a specific diffusion pattern and the connectivity among blog communities, I decided to establish as precedents, projects that make extensive use of a visual structure to portrait their field of research. 5.1 Blog Epidemic Analizer Authors: Eytan Adar, Li Zhang, Lada Adamic, Rajan Lukose Institution: HP Information Dynamics Lab URL: Description: HP Information Dynamics Lab created the Blog Epidemic Analyzer as part of their research on information propagation. They released their paper “Implicit Structure and the Dynamics of Blogspace” as a result of this research. Eytan Adar, Li Zhang, Lada Adamic, and Rajan Lukose, used the search engine BlogPulse to map the behavior of the blog community from May 11 to May 21, 2003. Relevance: This project is the closest to my thesis ambition and it obtained exciting results that became pertinent in selecting specific parameters for my work. Although highly useful as a research project, their few tryouts in terms of visualization were extremely poor. Their major breakthrough was announcing that the most popular blogs are not the most innovative, by commonly “stealing” news and information from smaller, less-known blog sources. I believe it’s a very significant allegation that decisively influences the way we understand the mechanics of blog communities. 26
  • 31. 5.2 Loom2 Authors: Danah Boyd, Hyun-Yeul Lee, Ethan Perry Institution: Sociable Media Group - MIT Media Lab URL: Author’s Description: “The goal of our research is to use the salient features of social interaction to build a ‘legible’ interactive visual representation of Usenet. We started by exploring the Usenet environment, constructing a series of relevant questions. From the questions, we have started to explore how this information can be derived from the textual data available online. Simultaneously, we have started designing segments of visualization, under the assumption that the desired characteristics were ascertainable.” Relevance: This project is a major aesthetical inspiration. I believe the use they make of a radial structure fits the purpose of the project quite well, where specific degrees relate to a time dimension and nodes’ colors to specific theme categories. Usenet represents a subject of analysis closely related to blogging, since message/post threads in newsgroups have a similar pattern of contamination as topics among the blogosphere. For the construction of their appealing visual models it’s not surprising the amount of work they had to undertake: “To build our designs, we drew on a wide variety of theoretical and practical concepts from a range of fields, including graphic and interactive design, architecture, sociology, and computer animation.” 27
  • 32. 5.3 Social Network Fragments Authors: Danah Boyd, Jeff Potter Institution: Sociable Media Group - MIT Media Lab URL: Description: “Social Network Fragments was developed as a self-awareness tool for individuals to explore the social networks that they create without structural consideration”. Its goal was to “help users examine their structure so as to unveil the structural holes that are built in such complex networks. These structural holes exist when users choose to fragment portions of their network, often revealing facets of their own identity. As an individual interacts with a diverse range of people, they are motivated to reveal different aspects of their identity, thereby creating a multi-faceted social identity, whereby different people know different things about the individual. In engaging in this behavior, individuals start to segment their social network into a variety of different clusters, or types of people.” Relevance: The visualization of social networks undertakes a major leap in many of the projects produced by the Sociable Media Group (SMG) at MIT Media Lab. With some amazing visual displays the SMG “investigates issues concerning society and identity in the networked world”, addressing questions such as “How do we perceive other people on- line? What does a virtual crowd look like? How do social conventions develop in the networked world?”. Social Network Fragments aims at something so extraordinary as mapping someone’s unnoticed social network. Although it may seem simple and intuitive to track any individual connections to others, this project tries to reach further more then the immediate first-degree acquaintances, by reaching a friend-of-a-friend network. 28
  • 33. This approach to small world theory has been pursued by some companies, which sell products focusing on social networking management. The idea is simple: don’t just get to the people you know, get to the people they know. Manage your friend-of-a-friend network in order to find the shortest path for whatever you’re looking. Among the leading companies incorporating this concept are: Spoke Software, Visible Path, SRD and In-Q- Tel. Social Network Fragments offers a reasonable visual solution, where I believe some improvements could be implemented. By emphasizing the visual criteria solely on text, color and depth (simulated 3rd dimension), the interface becomes somehow limited to fully explore its content. 5.4 PostHistory Author: Fernanda Viégas Institution: Sociable Media Group - MIT Media Lab URL: Author’s Description: “Most of us deal with email on an everyday basis and some of us have been doing so for several years. Nevertheless, it is hard to perceive the accumulation of this frantic activity, it is hard to get a sense of the number of messages sent and received, not to mention how difficult it is keeping track of how many people have written to you or received messages from you. The aim is to provide users with a novel and hopefully richer experience of their email activities. PostHistory represents an opportunity for reflection and insightful monitoring of fundamental patterns of interactivity. The visualization aims at impressing on the user a sense of daily accumulation, of growth and scale – dimensions not normally conveyed on current email applications.” 29
  • 34. Relevance: Fernanda Viégas, a brazilian graduate student at MIT Media Lab, is a prolific new media designer that has been involved in many relevant projects. PostHistory is one of her best. What I find most interesting in this project is the series of new structures and features she proposes in order to better understand the pattern created by e-mail activity. This project is visually innovative and it’s a quite an impressive contribute to the field of Information Visualization. Another project conceptually related to PostHistory is Thread Arcs, a fresh interactive visualization technique designed to help people use threads found in email. Thread Arcs, which resulted in a published paper, is a truly interesting visual approach to e-mail threads and even to small sized graphs. This concept is part of a major E-mail Application developed by the Collaborative User Experience team at IBM Research. ReMail is being developed for almost a decade and it aims at improving the knowledge of how people use e-mail, and also, make that experience more functional and straightforward. Some of its features are very encouraging. Thread Arcs ReMail (IBM Research) 30
  • 35. 5.5 Social Circles Author: Marcos Weskamp URL: Author’s Description: “Social Circles intends to partially reveal the social networks that emerge in mailing lists. The idea was to visualize in near real-time the social hierarchies and the main subjects they address. When subscribing to a mailing you never know who the principals are, how many people are listening or what subjects they are talking about. It's like entering a meeting room with plenty of people in the darkness and then having to learn who is who by just listening to their voices. Social Circles does not pretend to be a statistical application, but rather aims to raise the lights in that room just enough to let you enhance your perception of what’s happening.” Relevance: Marcos Weskamp is a key thinker in digital information design and a major personal influence. Newsmap, Weskamp’s most famous project, and one of the best online examples of data visualization, gathers google news and displays it in an innovative tree structure map in several languages ( In Social Circles, even thought Marcos Weskamp doesn’t push the project far from the most common network visualization schemas, its concept is very strong, particularly in a recent version of it, where the user can map its own inbox of e-mail messages. 31
  • 36. 5.6 WebFan Author: Rebecca Xiong Institution: Sociable Media Group - MIT Media Lab URL: Author’s Description: “WebFan visualizes user activities at WebBoards, or Web-based message boards, which contain messages posted by users. It uses the reply structure of the messages to lay them out using a fan-like hierarchical structure. This abstract structure allows a large set of Web pages with multiple levels to be represented at the same time for overview and comparison. Users can also interactively explore the fan structure to find out more about individual pages. Dynamic user activity is overlaid on top of this display.” Relevance: “Currently, Web users have little knowledge about the activities of fellow users. They cannot see the flow of on-line crowds or identify centers of on-line activity.” WebFan seeks to enrich this experience by visualizing the activity of other people in the message boards. I believe this is a very relevant project, particularly for the unconventional medium of WebBoards, that Rebecca Xiong chose to map. WebFan relates to my thesis project by visualizing overall patterns of usage and answering questions such as: What are people looking at? What is hot? Where do clusters of similar interests form? 32
  • 37. 5.7 Visual Who Author: Judith S. Donath Institution: Sociable Media Group - MIT Media Lab URL: Author’s Description: “The population of a real-world community creates many visual patterns. Some are patterns of activity: the web and flow of rush hour traffic or the swift appearance of umbrellas at the onset of a rain-shower. Others are patterns of affiliation, such as the sea of business suits streaming from a commuter train, or the bright t-shirts and sun- glasses of tourists circling a historic site. Visual Who makes these patterns visible. It creates an interactive visualization of the members’ affiliations and animates their arrivals and departures. The visualization uses a spring model. The user chooses groups (for example, subscribers to a mailing-list) to place on the screen as anchor points. The names of the community members are pulled to each anchor by a spring, the strength of which is determined by the individual’s degree of affiliation with the group represented by the anchor”. Relevance: Visual Who, besides offering a motivating contextual precedent in relation to social networks, portraits a tempting method of mapping social connectivity among a set of individuals. It offers an interesting approach to pattern recognition and visualization, although I think it suffers from the same inconsistencies pointed out in the Social Network Fragments project. 33
  • 38. 5.8 Avatars 2002 Authors: Katy Börner, William Hazlewood, Sy-Miaw Lin Institution: School of Library and Information Science, Indiana University URL: Description: This project originated a research paper: “Visualizing the Spatial and Temporal Distribution of User Interaction Data Collected in Three-Dimensional Virtual Worlds”. The project is a visualization of the social patterns in the Culture virtual environment, part of the Quest Atlantis universe. The map shows user trails over time. It was produced using a visualization tool developed by Katy Börner and colleagues at the School of Library and Information Science, Indiana University. Relevance: The particular relevance of this project relies on its visual pattern analysis. I think the underlying concept of being able to visually recognize different user trails on a 3D online game is extremely captivating. In a virtual game, many times played with unknown faces, the notions of time and space alter considerably, which makes this project particularly challenging by trying to recreate a defined user trail pattern throughout a physically undefined space. 34
  • 39. 5.9 PeopleGarden Author: Rebecca Xiong Institution: Sociable Media Group - MIT Media Lab URL: Description: PeopleGarden: Creating Data Portraits for Users proposes the “Data Portrait” as a graphical medium for the visualization of information related to individual users of interactive media. The visual metaphor that PeopleGarden uses is of flowers in a garden. Each data portrait is the trace of the user’s activities and takes the shape of a flower. Relevance: “On-line interaction environments such as Web-based message boards, chat rooms, and Usenet newsgroups have become widely popular. As the number of participants rises, it is increasingly difficult to distinguish individual users and to comprehend the overall interaction context.” In PeopleGarden the representation of a vague virtual space reaches its extreme by allowing it to be portrayed as a digital garden. The concept is that flowers represent individuals in a chat room, and the more time a user stays active in a conversation the more its flower can grow and expand. I think this project is conceptually very strong as it presents an innovative visual method for representing a vague unspecified space. 35
  • 40. 5.10 History Flow Authors: Martin Wattenberg, Fernanda Viégas Institution: IBM Watson Research Center URL: Author’s Description: “The history flow application charts the evolution of a document as it is edited by many people using a very simple visualization technique. History flow provides answers at a glance to questions like, Has a community contributed to the text or has it been mostly written by a single author? How much has a particular contributor influenced the current version of the document? Is the text's evolution marked by spurts of intense revision activity or does it reflect a smooth transition from its beginning to the present? The current version of history flow visualizes the evolution of pages from Wikipedia”. Relevance: HistoryFlow is truly one of the most significant projects in reveling hidden patterns from a set of data, otherwise unnoticed by the user. This feature is undoubtedly one of the key strengths of Information Visualization. Using available data from the Wikipedia website, the authors build an inventive visualization model for analyzing the evolutionary pattern of individual contributions to Wikipedia articles through time. This visualization method has some resemblance to Theme River™, developed by the Pacific Northwest National Laboratory (PNNL), but it’s quite impressive the amount of conclusions history flow was able to facilitate. In a lecture given at Parsons D+T Lab, on February 23, 2005, Martin Wattenberg speaking on this project, mentioned that it takes an average of 2 minutes for any kind of article vandalism to be noticed and repaired. 36
  • 41. 5.11 Listening Post Authors: Mark Hansen, Ben Rubin URL: Author’s Description: “Listening Post is an art installation that culls text fragments in real time from thousands of unrestricted Internet chat rooms, bulletin boards and other public forums. The texts are read (or sung) by a voice synthesizer, and simultaneously displayed across a suspended grid of more than two hundred small electronic screens.” Relevance: Although the toolset and the medium of this project are quite different from the screen- based interactive application intended for my thesis, I believe this project is an amazing precedent and one of the best installations I have ever seen. Exhibited at the List Visual Arts Center, Cambridge, Mass, and the Whitney Museum of American Art, New York, Listening Post has recently been awarded a prize at the Ars Electronica 2004 Festival. Co-author Ben Rubin emphasizes the motivation for the project: “My starting place was simple curiosity: What do 100,000 people chatting on the Internet sound like?”. The significance of Listening Post is remarkable. It displays short messages, randomly picked from chat rooms according to a specific set of keywords, and then, not only it gives life to them by placing the messages in a specific spatial configuration, a “suspended grid of more than two hundred small electronic screens”, but also gives them a sound dimension, which makes the experience truly memorable. This large display of small screens resembles a “window” overseeing the activity in cyberspace. 37
  • 42. 6 Methodology 6.1 Summer Research My first presentation in the beginning of the Fall 2004 semester enclosed some of the widespread research done through summer. It was entitled “Discovering Complex Networks”. My approach to this first assignment was to face the presentation as a lecture, by educating my audience about the engaging science of complex networks and narrating all the discoveries and knowledge gathered in this initial phase. The presentation contained explanations and diagrams about the specific properties of scale-free networks and took a holistic view by showing diverse examples of complex networks in different domains, as diverse as Gene Networks and Airline Routes. All the images shown at this presentation can be seen in Appendix A – Summer Research Presentation, at the end of this paper. In order to better understand the successive steps that led me to the study of complex networks one should consult the Impetus chapter on this Thesis. There I describe in detail the evolution of my research inclination and motivation course. I ended my Summer Research Presentation with a slide where I stated that my main interest was to “Visually map a dissemination/propagation pattern in a scale-free network”. I also made a short list of additional enquiries, where one could read: > How does an idea, innovation, fad, trend, disease or virus travel from A to B in a specific scale-free network? > How long does it takes? > How many nodes are affected? > How do the hubs react? 38
  • 43. I finally concluded the presentation by stating what were my future goals. “To choose an area and subject to analyze, where I can bring something new to the field and contribute to its development.” 6.2 Visual Explorations After an extensive research on Complex Networks I started to delve into different ways of visualizing them. The main premise was that complex networks are difficult to visualize, but we don't need to make them more complex in the process of trying. On September 27, 2004, I wrote the following in my thesis diary blog: “My thesis assertion has always been the visualization of dissemination patterns in a particular scale-free network. (…) However, I quickly found out that this premise is based on the assumption that the target network displays a visual structure suitable for analysis. Naturally, most of the time, this assumption is incorrect. Since a visual representation of a dissemination pattern cannot exist without a functional visual representation of the underlying network, I decided to dedicate my time, for now, to the visualization of complex networks. I've been delving into a set of visual explorations, collecting problems and proposing solutions.” quot;Functional visualizations are more than innovative statistical analyses and computational algorithms. They must make sense to the user and require a visual language system that uses colour, shape, line, hierarchy and composition to communicate clearly and appropriately, much like the alphabetic and character-based languages used worldwide between humans.quot; Matt Woolman Digital Information Graphics 39
  • 44. As acknowledged in another blog entry, also on September 2004: “I've tried several open-source network visualization tools and seen hundreds of visualization examples. I think I found a critical problem. In most tools I've seen, the user starts building its network from an initial node. The user places the first node in the center of the drawing board and then, node after node, link after link, the network starts expanding. Since there's no preceding method of organizing the nodes and links in the designated area, new nodes start naturally occupying any free space available. Unsurprisingly, after a certain threshold, the lattice of lines and nodes becomes unbearable. This problem happens so many times.” The difference between this method and Mark Lombardi's drawings, for example, is a question of organization. Instead of a bottom-up hierarchy described before, Lombardi used to plan his overall design with a holistic view of the entire network, knowing beforehand the amount of space he had and the exact number of nodes and links he needed to draw. Because of this, the cleanness of his drawings, where rarely there's an edge overlapping, is an excellent example of network visualization. What I cannot understand is why Lombardi's method, and alike, aren't taken into consideration whenever someone decides to build a visual representation of a network. A macro approach to the problem is definitely more appropriate. A top-down hierarchy instead of bottom-up. And to say Lombardi's networks where not complex enough is a mere exercise of oversimplifying his work. The beautiful and eloquent global networks of Mark Lombardi 40
  • 45. Besides the mentioned problem, I encountered two others in my research, which contribute drastically to the huge amount of bad visualization examples of complex networks. First, most visual applications are based in constructive algorithms that obey one rule: display the inputted data. Rarely the notion of how the data is displayed is considered. By that reason, often-stunning visual forms demonstrate a low level of clarity and function. Second, usually programmers who built open-source applications and scientists/researchers who use them, have no visual sensibility or graph drawing knowledge. Many researches produce a visual model of the analyzed network as a mere additional element for showing their research. Sometimes it adds nothing to it. On my second thesis presentation in the Fall 2004 semester, I applied many of my reflections and sketches to practical examples, proposing possible solutions to improve the visualization of complex networks. I divided my solutions into five major steps: The main slides of this presentation can be seen in Appendix B – Complex Networks: Visual Explorations, at the end of this paper. 41
  • 46. 6.3 Prototype #1 This was my first visual prototype shown at the Fall 2004 mid-term review. This review also marked the birth of the thesis title: Blogviz. The mid-term presentation was entitled Blogviz: An experimental social laboratory. The underlying concept was based on a major aspiration: nodes local stability and links global connectivity. The goal was to map the connectivity among blogs. What I tried was to position the nodes in a structured way, so they would remain fixed, and to some level, under control. The links, however, would be in constant change and the outcome would be highly random and unpredictable. The reason why I chose to sort all the nodes in a precise manner was to be able to isolate the major hubs and have some control over the lattice resulting from the links agglomeration. Looking at it now, it seems the result was too rigid and strict. The radial diagram with its implosive structure reinforces the structure rigidness by resembling a closed system that probably doesn’t describe so well the blogs fundamental openness. Blogviz Visual Studies – Prototype #1 I realized I had to take a different path. I was trying too hard to control the outcome and I believe the result showed exactly that. I had to loose some of my constant need for control and let the system be more auto-sufficient, self-organizing and adaptive. As stated in my Thesis blog in October 24, 2004: “Another criticism I received during the presentation was that I was being to concerned with the visual aspect of it, and that I was thinking too much as a visual designer. Well, although I agree in part with the critic, 42
  • 47. my thesis assertion has always been the visualization of a specific dissemination pattern, and from my extensive research in complex networks, I truly believe that the only way I can positively contribute to this field is by employing my visual and interface design knowledge. In my first prototype presentation I dissected several problems on the visualization of complex networks and proposed distinct solutions that might solve some of its inconsistencies. I believe there has to be a balance between highly complex network visualizations that offer a poor functionality and highly aesthetic/innovative visual representations that might suffer from the same dilemma. I just have to pursue that balance.” On this same presentation I also illustrated some of my initial studies regarding the linkage among blogs. Connectivity in the blogsphere is a very binary process; we only need to make two questions. Is blog A connected to blog B? If so, who is linking whom? If none of them is linking to the other, they become momentarily isolated islands. For that presentation I showed a few visual studies where I mainly explored the concept of directional linkage, by visualizing inbound or outbound links, or putting it simple, who is linking whom. The images below portrait some of these explorations. 43
  • 48. 6.4 Prototype #2 While on my first prototype I was trying to deal with a structured way to map connectivity among blogs, by isolating the hubs and sort the nodes according to popularity, on my second prototype, I basically explored possible ways of visualizing diffusion patterns over time. I tried several models based on a radial structure where time became the major imposing element. In most of these experiences I faced a common problem in representing a continuous flow of infected blogs. The underlying radial structure seemed to impose its rigidness by enforcing fractures in the pattern, particularly whenever there was a day transition. Blogviz Visual Studies – Prototype #2 44
  • 49. Blogviz Visual Studies – Prototype #2 Blogviz Visual Studies – Prototype #2 45
  • 50. I quickly found out I had to make a change in my visualization thinking, since a radial structure didn’t quite apply to my subject of analysis. Perhaps I was too much influenced or distracted with the Radial Form of Organization Chart from the Alexander Hamilton Institute or Loom 2, by Danah Boyd (et al). Radial Form of Organization Chart (1924) Loom2 - Danah Boyd, Hyun-Yeul lee, Ethan Perry Alexander Hamilton Institute Sociable Media Group - MIT Media Lab As I wrote in my thesis blog on November 16, 2004: “At the moment I’m becoming convinced that a horizontal array is truly the best way of representing the quantitative and temporal qualities of a pattern. Time is a crucial domain in a dissemination pattern, particularly in a word-of-mouth social behavior. The amazing potentialities of a horizontal assortment is the uninterrupted continuous flow of data and the possibility of collapsing time frames and still maintain a sense of scale and understanding of the pattern dynamics.” Blogviz Visual Studies – Horizontal array of adopting units 46
  • 51. Blogviz Visual Studies Different tryouts where adopting units (blogs) are structured in a vertical and horizontal array After this critical change in my visualization studies I started doing a lot of sketching and writing. I built a few diagrams to get a full understanding of my system; built several taxonomies and dissected the mechanics of blogging. This examination helped me putting my ideas straight and getting a sense of what I was dealing with. 6.5 Prototype #3 On my third prototype I introduced Blogviz as a “topological model of meme behavior”. From the conclusions of my previous tryouts, I decided to deeply explore the notion of a horizontal array of adopting units (weblogs) to portrait the propagation pattern of a specific topic. By doing that I would be constraining the Time element to the X axis. The following images represent a series of tryouts in this context. 47
  • 52. 48
  • 53. On this phase of the project I also introduced the first visual taxonomy of blogviz, by dissecting the system and its intrinsic elements. The following image portraits a critical understanding of the inherent structure of blogviz at that stage. At the same time, a list of goals was created (left image) in order to better understand the intent of Blogviz. 49
  • 54. 6.6 Prototype #4 From a series of independent and spread visual studies that characterized the initial trials, this fourth prototype was the first solid tryout for acknowledging Blogviz as an interactive visualization model. At the time I was pushing the concept of application or tool of analysis, which according to some critics was implying a need for commercial viability. Even though I’m convinced this thesis has several elements that could be successfully applied in commercial applications, my goal with this project is to elevate the understanding of Memetics in a specific social network and conduct a serious research experiment, which I believe fits more adequately within the academic realm. Another point worth of consideration is that, when developing this prototype, Blogviz was intended to work with real-time data, in the form of hourly updated XML RSS feeds. This idea changed afterwards, however, it was a crucial deliberation in the development of this prototype. Prototype #4 – Default First Page 50
  • 55. A quick explanation on the previous image’s visual schema is that circles represent topics; the diameter corresponds to the total number of adopting blogs; and the colors, pink and green, denote respectively, a decreasing or increasing course. Time is again incorporated in the X-axis, where the closer a circle is from the right edge of the window, the more recent was its last dispatch. The Y-axis position of each circle helps reinforce its level of adoption. The main interaction on this fourth prototype was based on a simple flow. The default first page would allow a swift view on the general pattern by showing the overall condition of current topics popularity. If one decided to investigate more deeply the structure and evolution of a particular topic, it would be taken to a sequence of examination methods. The following images illustrate some of the techniques proposed. Prototype #4 – Blogs’ evolutionary paths through time Prototype #4 – Plotting blogs according to time/popularity 51
  • 56. Prototype #4 – Detailed View Prototype #4 – Detailed View Prototype #4 – Blogs’ adoption represented by a Tree Map Prototype #4 – Blogs’ analysis by Theme and Generator Prototype #4 – Blogs’ relationship analysis 52
  • 57. 6.7 Final Application A major drift in the development of Blogviz was the decision of not incorporating real- time data for the backend of the application. As previously stated, on my fourth prototype I was mostly concentrated on developing a visualization schema that would expose current trends in the topics diffusion process, by reading data from hourly updated XML feeds. It would basically display the most adopted topics spreading in the blogosphere in any given time. Even if the application allowed an extended breakdown of each topic other then just a quick view at the present information tendencies, it was just considering a restrict number of topics. I believe Blogviz’s concept, at that phase, was trying to incorporate to many features, or levels of analysis, without being able to develop one efficiently. It was also becoming a trend analysis tool rather then a comprehensive model of topics distribution. I wanted Blogviz to become a serious visualization study on information diffusion in blogspace, and not so much a marketing application. I still believe there’s enormous potential on visualizing popular topics with real-time data integration, and that might be something Blogviz will incorporate in the future. However, I first wanted to better understand the topics’ inner structure and evolution through time. This change in Blogviz progress also coincided with a parallel immersion in the domains of Epidemiology and Diffusion of Innovations Theory. I never imagined that an apparent minor adjustment would require such a drastic turnaround in the project’s conceptualization. Until now, Blogviz had been dealing with a very restrict and manageable time span. Real time data visualization was merely constrained to one day, or at the most, one week. In opposition, by aiming at an adaptive model, the critical goal was to come up with a visualization method that could easily include time variations and still be consistent. Another crucial problem was to visualize, in a very tight space, a high number of topics. I had to come up with a visualization model that would answer these last two problems accordingly. First, it had to be flexible enough to embrace distinct time spans, but at the same time maintain uniformity throughout the process. Second, it had to be able to include a high number of topics, and also, allow an immediate understanding of the overall pattern and the individual life cycle of each topic. 53
  • 58. On the process of looking for inspiration in diverse sources, I came up with an elucidating diagram by E. J. Marey, on Edward Tufte’s The Visual Display of Quantitative Information, that resolved particularly well many of the challenges I was facing. Original Image: E. J. Marey, La Méthode Graphique (Paris, 1885), p.20. Source: Tufte, Edward R., The Visual Display of Quantitative Information The preceding image illustrates Marey’s graphical train schedule for Paris and Lyon in the 1880’s. The X-axis incorporates Time, measured in hours, and maintains the same scale in both the top edge (corresponding to departures and arrivals from Paris) and the bottom edge (for departures and arrivals from Lyon). The remaining horizontal lines represent other train stations between Paris and Lyon. The diagonal lines represent different trains, leaving and arriving from the two main stations, and the horizontal line- breaks represent waiting time in secondary stations. This chart influenced me greatly in the following steps of my project. I believe it is an extraordinary example of information visualization, where time and pattern become one intrinsic entity, allowing a substantial understanding of the data dynamics in one brief look. I applied a modified version of this concept to Blogviz, where the lines became representative of topics, and the time scale was measured in days. Blogviz’s model doesn’t incorporate any type of constraint on the Y-axis, as Marey’s graph does, therefore the overall height of the main window is rather arbitrary. The following image represents the main visualization window for topics’ evolution within the Blogviz environment. 54
  • 59. Blogviz’s topics visualization – Topic Lines and Time Scale The interesting characteristic of this model is that, as in the Paris/Lyon train schedule example, the angle of each line has a specific meaning. This happens because both top and bottom edges of the window maintain the same time scale. Therefore, the wider the angle, the shortest is the duration, in this case, the topic’s duration. On the image above for example, one may see a line, close to the center of the window, which seems to be almost vertical; what it means is that the life cycle of that particular topic was very short. This feature is even more relevant for topic lines that have either the starting or ending point outside the present timeframe. I conducted a small experiment within the same model, where the lines, instead of their diagonal placement, were drawn horizontally. This method was probably even more successful when the lines had the starting and ending point inside the selected time span. However, when topic lines had a first day or last day of spreading outside this frame, it would be unpredictable to calculate the amount of days beyond it. What the diagonal alignment facilitates is a full understanding of the topic’s life cycle, even when it spreads outside the present time span. To better understand the intricacies of this visualization model, the following images illustrate the four possible life cycles for every topic line, within each timeframe, and the way they are represented. 55
  • 60. Topic with first and last day of spreading within the current time span Topic with first day of spreading outside the current time span Topic with last day of spreading outside the current time span 56
  • 61. Topic with first and last day of spreading outside the current time span The prediction line angle for outsider dates is made through an equation that multiplies the number of days (topic duration) by the number of pixels of each day parcel. So if a specific topic line has the starting point (first day of spreading) within the present timeframe, the last day outside of it, and its total days are 64; the system multiplies 64 by 12 (number of pixels of a day parcel) from the starting point, and as a result, a line is drawn dynamically to the resulting end point. Another feature of this visualization method, further explained in the following Blogviz Interface section, refers to the brightness or color saturation of each line. In Blogviz, the default setting for the lines’ brightness is a depiction of the total number of adopting blogs. This allows for a comprehensible insight when evaluating the overall pattern. On a brief look, one is able to identify the life cycle of each topic, and also, the number of blogs that adopted it. I like to consider the visual representation of this model as a metaphor of a window, overlooking cyberspace, where lines of information flow continuously cross it. 57
  • 62. Blogviz Interface (1) General information about the current analyzed data. It displays the total number of topics and the total number of blogs presently in Blogviz’s database. (2) Timeframe navigation. This feature is still under development. Presently Blogviz is solely mapping the evolution of topics within the first 64 days of 2005, respectively, from January 1st to March 5th, 2005. However, Blogviz time span analysis is intended to grow in the future, and this area is reserved for its control and navigation. (3) Main visualization window. It displays all the topics, represented by lines, their titles, total number of adopting blogs, peak day of diffusion, first and last day of spreading. The lines’ angles are representative of the topic’s life cycle or duration. 58
  • 63. The wider the angle, the shortest is its duration. As an example, when a topic line assumes a close to vertical alignment, it indicates that the represented topic has sustained for a short period of time. In opposition, when lines take a near horizontal position, it translates in an extended life cycle. Also, the default setting for the lines’ brightness is a reflection of the total number of adopting blogs. Brighter lines basically denote highly adopted topics. This last feature can be changed in the Visualization Settings (11), where one can also choose to see the lines’ brightness constrained by the average number of inbound links per blog. (4) Bottom visualization window. This environment reflects whatever choice is made on the upper window, whenever a particular topic is selected by clicking its respective line. It can either display a topic evolution by daily incidence or rate of adoption. These options can be better understood by reading point (11) - Visualization Settings. (5) Topic Title. This area has a button action that when pressed opens the topic URL in a new browser window. (6) Screenshot of the topic URL (7) First and Last blog. This panel displays information regarding the first (innovator) and last (laggard) “known” blogs to adopt a topic. It specifies the blog title, date and time of adoption, type of generator, and number of inbound links (indicating its level of popularity) (8) Overall information about each topic. This area shows the total number of adopting blogs, the total number of days (duration), the peak day of diffusion (date format) and the average number of links per blog (which reflects the topic’s popularity or level of authority), for every selected topic. 59
  • 64. (9) List of all generators by descending order. It displays the generator name (either Blogger, MovableType, etc) and the corresponding number of blogs that use it, within the specified topic. (10) Top Innovators/Adopters. This panel allows a quick view of the most popular topic adopters, popular generators, and popular innovators (not yet implemented). The overall numbers shown in this section are related to current data in Blogviz’s database. The Popular topic adopters’ option displays the name of the blog and the number of topics it has adopted. The Popular generators’ choice shows a basic ranking of generators used by all blogs in the database. Popular innovators will be organized by generator, and will display the number of topics in which a generator was used by its innovator (first adopting blog). This last preference will allow interesting results, as to understand if Blogger users, for example, are more or less innovative then LiveJournal users, and vice-versa. The options available in this area will continuously expand in the future. (11) Visualization Settings. This panel is intended to control the visualization settings for Blogviz’s two main windows. The features included in this panel will expand in the future, but currently there are two possible options for each window. Visualization settings –Top Window Visualization settings –Bottom Window 60
  • 65. On the top window, one may opt for the topic lines’ brightness to correspond to either the total number of adopting blogs (reflecting the highly adopted topics) or to the average number of inbound links per blog (which depicts the topic’s level of popularity or authority). As an example, there might be a topic that has been adopted by a large number of blogs, but its average number of links per blog might be very low; which basically means that the blogs who adopted this topic are not highly popular, therefore, decreasing the topic’s level of influence. On the bottom window one can choose to analyze the topic evolution either by daily incidence (default preference) or by cumulative rate of adoption. The following images demonstrate the different visualization settings for the same topic. Top Window – Visualization settings – Lines’ brightness by Total number of adopting blogs Top Window – Visualization settings – Lines’ brightness by Average number of links per blog 61
  • 66. Bottom Window – Visualization settings – Topic Evolution by Daily Incidence Bottom Window – Visualization settings – Topic Evolution by Cumulative Rate of Adoption 62
  • 67. 7 Technical Sources Although most of the following sources could be considered precedents, I decided to make a distinction according to visual depiction. Even though the listed websites represent strong conceptual precedents to my thesis, its outlined structure denotes a stronger and more tangible technical source for its future development. Also, in opposition to visual analysis/representation, most of these resources display only text- based information, regarding blogs connectivity and topics popularity. 7.1 Blog Engines BlogPulse, Bloogz, Technorati, Popdex and Bloglines, among others, are Blog Search Engines, similar to Google or Yahoo, but restricted to the blog community. Besides offering a free blogosphere search, they list the most popular blogs (by daily number of visitors or inbound links), the most common search inputs, and the trendiest topics (by recent added links/quotes on daily blog entries). Some of these websites offer additional information, such as: the evolution of a key sentence/word in static diagrams, shared links between blogs, news/words popularity, key people, discussion threads, neighborhoods, and blogrolling. Most of these features are text based, listed according to date, popularity, or a specific ranking. Many of these resources also provide RSS feeds of their content in XML format, which can be read by browsers or newsreaders, applications similar to email programs that interpret the feeds and list them by title, with a small descriptive paragraph and a link to the source. Regardless of the interesting features of these services, on capturing the trendiest bustle in the blogosphere, most of them present solely textual rankings, which are difficult to compare with additional factors and derive further conclusions from an eventual information overlapping. Besides the lack of visualization, that would help better understand this complex network of dependencies, most of these services are only worried in capturing the momentary fad. As one can read in homepage, regarding the constantly updated list of websites displayed there, they represent “the most contagious information currently spreading in the weblog community”. Most of these services have disregarded the understanding of the evolutionary process of information contagion and the dynamics of these diffusion patterns. 63
  • 68. Here are some of the most prominent blog services: ∴ ∴ http:/// ∴ ∴ ∴ ∴ ∴ ∴ ∴ ∴ 7.2 Blogviz Data All the data used in Blogviz was obtained from three main blog services. The daily topics were collected through The correspondent blogs for each topic were taken from And finally, the number of inbound links for each topic was obtained from The main reason why this procedure was spread over 3 different sources was that none of them individually congregated all the needed elements to use in Blogviz. In the beginning I was expecting to work with a single data source, but further along in the process I concluded that each source had a specific core asset that was particularly useful for the development of the project. I believe this process was also positive since it took a more pluralistic approach not constrained by the limitations of a single source. The data was collected through the months of April and March 2005. I started by building a personal database, with the use of ColdFusion, MySQL and Microsoft Access. It took some time to have the database up and running, but after the structure was created and operational, all the effort was concentrated on inputting the data manually in the new datasource. 64
  • 69. 7.2.1 Blogpulse ( “BlogPulse is an automated trend discovery system for blogs. (…) BlogPulse applies machine-learning and natural-language processing techniques to discover trends in the highly dynamic world of blogs.” is basically a portal into the world of blogs. Here are some of its features: – A search engine for blogs – A daily list for blog content (top links) – A look at real-world trends as reflected through blogs (static diagrams) – A showcase, seen as virtual sandbox where researchers bring ideas, tools and gadgets for blogging The key advantage of Blogpulse is that it saves its lists of daily top links in a large archive, easily accessible by the general public. Each html file can be searched by date of occurrence. So imagining that in November 24, 2005, I would be interested in knowing what were the most popular links among weblogs in February 08, 2004, Blogpulse would allow an immediate access to this information. This feature is of extreme relevance to the development of Blogviz. Its use not only allowed this thesis’s contention to expand, but also, will facilitate its continuous development. In spite of the utility of this feature, Blogpulse is slightly limited when it comes to the description of adopting blogs (or citations) for each topic. The list does not contain a time of “adoption” for each blog, and even the overall perpetuating days are not easy to perceive. This is why comes as a second data source for Blogviz. 65
  • 70. 7.2.2 Blogdex ( Blogdex is a research project developed at MIT Media Lab intended to track the diffusion of information through the weblog community. It lists a sequence of links considered “the most contagious information currently spreading in the weblog community”. Cameron Marlow, a PhD candidate at MIT Media Lab, is the key protagonist behind this project. “Blogdex crawls all of the weblogs in its database every time they are updated and collects the links that have been made since the last time it was updated. The system then looks across all weblogs and generates a list of fastest spreading ideas. This is the list shown on the front page. For each of these links, further detail is provided as to where the link was found, and at what time.” The core usefulness of Blogdex in the development of Blogviz is that it has one of the most organized sorting systems of adopting weblogs for every topic. Whatever topic a user searches, besides the front page list, Blogdex displays a list of adopting blogs, efficiently organized by date and specific time of adoption. This allows a significant understanding of the topic evolution by tracking the exact time of “contagion” for every blog. 66
  • 71. 7.2.3 Technorati ( “Technorati is a real-time search engine for the blogosphere. Technorati tracks the number of links, and the perceived relevance of blogs, as well as the real-time nature of blogging. Because Technorati automatically receives notification from weblogs as soon as they are updated, it can track the thousands of updates per hour that occur in the blogosphere, and monitor the communities (who's linking to whom) underlying these conversations.” Because Technorati is probably the current largest blogosphere search engine, it was used to measure the broad scale of popularity for each weblog collected in Blogviz’s database. The popularity of a blog represents the number of inbound links it has, or in other words, the number of blogs that link to it. 67
  • 72. 8 Conclusion Blogviz is not the most accurate or the most reliable method for visualizing topics’ diffusion in the blogosphere. However, it’s a well thought attempt. It’s also a tryout in the process of leveraging the understanding of memetics, as it portraits to the comprehension of the enveloping information dynamics in our surroundings. Blogviz is a visualization model, and has implicit in its deliberation the notions of trial and prototype. I like to consider this project as an experiment in the field of Information Diffusion. Blogviz has still a long way to go, but the first step was made. I believe the MFA program was a crucial context for its embryonic development, but there are still so many other features I want to implement in the model, that it’s hard to enunciate them individually. I also think there’s an increased motivation when developing a one year thesis project and expect to continue expanding it beyond its immediate delivery deadline. Information Diffusion Models (IDM) are not abundant in the research arena, compared for example with Diffusion of Innovations Models (DIM), which strongly captivate commercial interest. Due to DIM’s applicability in management and economics theory, they’ve become an important measurement in new product releases by most industries. Nonetheless, even if IDM’s commercial significance is still not so obvious, Information Dynamics is becoming an area of great interest among several researches. In a world increasingly overflowed with petabytes of information, it’s critical to understand how this flow of information behaves. I believe the fields of Information Architecture and Information Visualization, among others, have not only the responsibility of making information more useful and understandable, but also to inquisitively investigate how the information itself propagates. This is an area of study where other sciences such as sociology, semiotics, linguistics, human-computer interaction, cognitive psychology, memetics, epidemiology, and diffusion of innovations, play a major role in deciphering the intricacies of Information dynamics. And this is in fact the main reason why Blogviz came to life. 68
  • 73. I have this higher goal, or perhaps delusion, of building this perfect visualization system that doesn’t need any extra panels or descriptive features to amplify its understanding. I’m conscious, though, that it’s quite utopian to idealize such a system. From the moment one decides to let other people interact with the application, there are vast arrays of learning curves to be considered. Different people have different levels of expectation, knowledge and patience. One cannot expect for a system, as simple and intuitive it may seem for the creator, to be as straightforward to other users. There was no user testing on Blogviz, mostly due to lack of time. The model, even if it’s not intended to be used and understood by every single Internet user (since it’s where it resides), it has to be comprehensible and insightful to its immediate audience, exposed in this paper. Therefore, Blogviz will evolve from the feedback and judgments of its users. I’m still not sure if the right visualization choices were made and that’s something I will probably never feel confident enough. I had extremely high ambitions for this project, that relate closely to my own nature, which sometimes might prove to be contra- productive. I tried hard to come up with a visual schema that would be somehow revolutionary and could even be adapted to different models in other fields. I’m sure I wasn’t able to do that. But one thing that comforts me is knowing that I put all my effort and dedication in this project, and even if the result doesn’t prove to be as groundbreaking as initially planned, its development was an extremely enriching experience. The thinking process, as it may be illustrated in the Methodology chapter, was a long and laborious course with several important iterations. I have to admit, at this stage, that I’m not certain that the last iteration was the best resolved one in terms of visual depiction of my subject of analysis. I guess this state of uncertainty is softened by the thought that under the natural progression of the project, this resolution was the final result based on improvements of the previous. Sometimes I also feel I’m overflowing Blogviz with too much data, and by doing it, the visualization features are becoming diminished in this assortment of info-bars and text. But at the same time there’s so much information I want to include, so much interesting 69
  • 74. comparisons to be made, that I feel it would be nonsense to discard them. It’s quite a challenge to find a balance within these two forces. On one side I’m trying to keep the visualization simple and intuitive, and on the other, wanting to lengthen the data analysis for an even richer knowledgeable experience. I guess it comes down to the eternal balance between form and function. This project was indubitably a technical achievement for me. This was the first time I deeply explored technologies such as: ColdFusion, MySQL, XML, Flash Remoting, Flash Communication Server, among others. It was my first solid attempt to deal with dynamic data exchange, middleware and database communication technologies. In the end I was able to overcome a series of obstacles that made me very self-doubting from the beginning. Throughout the thesis development process I though to myself, many times, that I might not be able to overcome many of the technological and programming hurdles needed to make this project a reality. Fortunately enough, I did. The outcome, within this framework, was exactly as I expected and I must admit that the achievements made in this process were the ones that made me jump the most. On this context I thing this project was very successful and rewarding to me. I cannot wait to apply the technical knowledge I acquired in other endeavors. It’s rather curious that my main fixation, from the beginning of this thesis process, it’s not currently represented in Blogviz. This obsession is associated with the qualitative dimension of the analyzed pattern. I believe that in terms of pattern depiction, as it relates to the core concept of Blogviz, both its quantitative and temporal dimensions were satisfactory resolved. The missing link, that leverages an even more critical insight on the pattern evolution, concerns the existing relationships between weblogs. From my first prototypes, my main obsession has always been to visualize the inbound and outbound links among blogs. It was naïve on my part to believe that by understanding the linkage I would be able to explain how the information spreads. The linkage practice, as said before in this paper, is not elucidating enough to explain the adopting process of information in blogspace. This was clearly exposed in the HP Information Dynamics Lab study, also mentioned before; where it was said that roughly 75% of blogs for each topic don’t have any direct link between them. However there are still levels of visualization I plan to implement in order to portrait the relationships between topics’ adopting units (weblogs). Some of these levels will be constrained to specific timeframes and generator groups. 70
  • 75. 8.1 Initial Results However scarce a small sample of 330 blogs might be, I believe a few patterns are starting to emerge from Blogviz, some of which I was not expecting in the beginning. I’m aware that these initial results are still to immature to fundament any reliable conclusion, but I though I should mention them nonetheless. – The first blogs to start a topic are usually not highly popular. It seems the most popular blogs only start appearing within a certain number of days from the initial date of spreading. It will be interesting to analyze if this behavior follows some sort of ratio or consistency through the analysis of a larger portion of topics. However, this is an outcome I was somehow expecting, based on previous studies made in this area. – There are some topics clearly dominated by a particular type of generator. Interestingly enough these appear to be more common within LiveJournal users, who count themselves among the less popular weblogs in blogspace. – 8 topics, from the 9 analyzed so far, have the peak day within 1 or 2 days from the first day of spreading, even when they extend through long periods of time. 8.2 Next Steps – Create an automatic script to input the data in the existing database. This method will allow a much larger number of blogs to be seamlessly incorporated into Blogviz. The more data Blogviz gathers, the more plausible will be its subsequent results on sustaining the key intent of the project: to improve our understanding on the dynamics of information propagation in blogspace. – Improve the code. This is probably a never-ending task, but nonetheless, it’s important to acknowledge that there’s always space for improvement in the application coding. – Implement nodes visualization (blogs) and subsequent layers of analysis. So far, Blogviz is employing most of its visual depiction in portraying patterns’ evolution, or in other words, topics’ inner structure and transmission through 71
  • 76. time. The next step will be to include a series of visual techniques to represent single adopting units’ (blogs) behaviors, relationships, and levels of popularity, adoption and innovation. – Extent the current time span of the application, to the possible point of including real-time data. Presently Blogviz is mapping the evolution of topics within the first 64 days of 2005, respectively, from January 1st to March 5th, 2005. This decision was made under the time limitations for this thesis development; nonetheless, the model was built in order to accommodate other timeframes. – Feedback. As mentioned before, there was no time to conduct any kind of user testing, so I plan to build a bug reporting and comment form to help solving existing problems and hopefully receive interesting suggestions. 72
  • 77. 9 Bibliography Emergence/Self-Organization Holland, John H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. Reprint edition. Cambridge, Massachusetts: The MIT Press, April 29, 1992. Johnson, Steven. Emergence: The Connected Lives of Ants, Brains, Cities, and Software. New York: Scribner, September 10, 2002. Marcus, Gary. the Birth of the Mind: How a Tiny Number of Genes Creates the Complexities of Human Thought. New York: Basic Books, December 16, 2003. Strogatz, Steven H. SYNC: The Emerging Science of Spontaneous Order. 1st edition. Hyperion, March 5, 2003. Complex Networks Barabasi, Albert-Laszlo. Linked: How Everything is Connected to Everything Else and what it Means for Business, Science, and Everyday Life. Pluma Books, April 29, 2003. Buchanan, Mark. Nexus: Small Worlds and the Groundbreaking Science of Networks. 1st edition. New York: W. W. Norton & Company, May 1, 2002. Huberman, Bernardo. the Laws of the Web: Patterns in the Ecology of Information. Cambridge, Massachusetts: The MIT Press, 2001. Watts, Duncan J. Six Degrees: The Science of a Connected Age. 1st edition. New York: W. W. Norton & Company, February, 2003. Watts, Duncan J. Small Worlds : The Dynamics of Networks between Order and Randomness (Princeton Studies in Complexity). Princeton University Press, November, 2003. Social Networks Gladwell, Malcolm. the Tipping Point: How Little Things can make a Big Difference. New York: Back Bay Books, January 7, 2002. Keller, Edward B., et al. the Influentials: One American in Ten Tells the Other Nine how to Vote, Where to Eat, and what to Buy. Free Press, January, 2003. 73
  • 78. Data Visualization Fawcett-Tang, Robert, and William Owen. Mapping: An Illustrated Guide to Graphic Navigational Systems. Rockport Publishers, 2002. Haggett, Peter, and Richard J. Chorley. Network Analysis in Geography. London: Edward Arnold Publishers, 1969. Herdeg, Walter. Graphis Diagrams. 4th Expanded ed. Zurich, Switzerland: Graphics Press Corp., 1981. Jacobson, Robert, ed. Information Design. 1st ed. Cambridge, Massachusetts: The MIT Press, August 28, 2000. Kitchin, Rob, and Martin Dodge. Atlas of Cyberspace. 1st edition. Pearson Education, January 15, 2002. Lombardi, Mark. Mark Lombardi: Global Networks. New York: Independent Curators Inc., August 1, 2003. Tufte, Edward R. Envisioning Information. Graphics Press, May 1, 1990. Tufte, Edward R. the Visual Display of Quantitative Information. 2nd edition. Graphics Press, May 1, 2001. Woolman, Mark. Digital Information Graphics. Watson-Guptill Publications, 2002. Graph Theory/Representation Arkin, Herbert. Graphs: How to make and use them. Revised edition. New York: Harper & Brothers Publishers, 1940. Bertin, Jacques. Semiology of Graphics: Diagrams, Networks, Maps. Trans. William J. Berg. Madison, Wisconsin: The University of Wisconsin Press, 1983. Chartrand, Gary. Introductory Graph Theory. Unabridged ed. Dover Publications, 1985. Copes, Wayne, et al. Graph Theory: Euler's Rich Legacy. Providence, Rhode Island: Janson Publications, 1987. Ore, Oystein. Graphs and their Uses. Washington, D.C.: The Mathematical Association of America, 1963. Smith, William Henry. Graphic Statistics in Management. 1st edition. New York: McGraw- Hill Book Company, 1924. Trudeau, Richard J. Introduction to Graph Theory. Dover Publications, 1994. 74
  • 79. Blogging Powers, Shelley, et al. Essential Blogging. 1st edition. O'Reilly, 2002. Stone, Biz. Blogging: Genius Strategies for Instant Web Content. 1st edition. Pearson Education, 2002. Pattern Analysis Alexander, Christopher. A Pattern Language: Towns, Buildings, Construction. New York: Oxford University Press, 1977. Thesis Chang, Chun Wei. quot;Saveafriend.Com: A Viral Consumption Network.quot; Master of Fine Arts - Design and Technology, Parsons School of Design, May, 2004. Fry, Benjamin Jotham. quot;Organic Information Design.quot; Master of Science in Media Arts and Sciences, Massachusetts Institute of Technology, May, 2002. Park, Sunha. quot;iDwheel: A Cross-Platform Customized Virtual Identity for Individuals and Business People.quot; Master of Fine Arts - Design and Technology, Parsons School of Design, April, 2004. Torres, Paul A. quot;Visualizing Social Networks: A Social Network Visualization of Groups in the Online Chat Community of Habbo Hotel.quot; Master of Fine Arts - Design and Technology, Parsons School of Design, May, 2004. Scientific Papers Barabasi, Albert-Laszlo, Bonabeau, Eric. Scale-Free Networks. Scientific American, May 2003, 50-59. Barabasi, Albert-Laszlo, Oltvai, Zoltan N. et al. Aggregation of topological motifs in the Escherichia coli transcriptional regulatory network. BMC Bioinformatics 2004, 5, 1-7. Barabasi, Albert-Laszlo, Albert, Reka. Statistical mechanics of complex networks. Reviews of Modern Physics, Volume 74, January 2002. Viégas, Fernanda B., Smith, Marc. Newsgroup Crowds and AuthorLines: Visualizing the Activity of Individuals in Conversational Cyberspaces Eytan Adar, Li Zhang, Lada A. Adamic, Rajan M. Lukose. Implicit Structure and the Dynamics of Blogspace. HP Information Dynamics Lab. 75
  • 80. Krempel, Lothar. Simple Representations of Complex Networks: Strategies for Visualizing Network Structure. Max-Planck-Institut fuer Gesellschaftsforschung. Brian M. Dennis, Azzari C. Jarrett. NusEye: Visualizing Network Structure to Support Navigation of Aggregated Content. Computer Science Department, Northwestern University. Jeffrey Heer, Stuart K. Card, James A. Landay. prefuse: a toolkit for interactive information visualization. Borner, Katy, Penumarthy, Shashikant. Social diffusion patterns in three-dimensional virtual worlds. Indiana University, SLIS. Kerr, Bernard J. Kerr. THREAD ARCS: An Email Thread Visualization. IBM Research Xiong, Rebecca, Donath, Judith. PeopleGarden: Creating Data Portraits for Users. MIT Media Laboratory Newman, M. E. J., Girvan, M. Finding and evaluating community structure in networks Freeman, Linton C. Visualizing Social Networks. University of California, Irvine N. Kashtan, S. Itzkovitz, R. Milo, U. Alon. Topological Generalizations of network motifs. Departments of Molecular Cell Biology and Physics of Complex Systems, Weizmann Institute of Science, Rehovot, Israel 76100. Itzkovitz, Shalev, Alon, Uri. Subgraphs and network motifs in geometric networks. Departments of Molecular Cell Biology and Physics of Complex Systems, Weizmann Institute of Science, Rehovot, Israel 76100. Newman, M. E. J., Girvan, Michelle. Community structure in social and biological networks. Santa Fe Institute, Santa Fe, NM. Watts, Duncan J., Dodds, Peter S. Dodds, Newman, M. E. J. Identity and search in social networks. Epidemic spreading in scale-free networks Pastor-Satorras, Romualdo, Vespignani, Alessandro. Epidemic spreading in scale-free networks. Dodge, Martin. Mapping the World-Wide Web. Centre for Advanced Spatial Analysis (CASA), University College London. Others Rogers, Everett M. Diffusion of Innovations. 5th edition. Free Press, 2003. Dawkins, Richard. The Selfish Gene. 2nd edition. Oxford University Press, 1990. 76
  • 81. blogviz Mapping the dynamics of Information Diffusion in Blogspace by Manuel Lima Appendix A Summer Research Presentation
  • 82. Appendix A Summer Research Presentation Discovering Complex Networks Scale-Free Network Distributed Network Network Topologies Internet pioneer Paul Baran's suggestion of three possible network structures for the Internet. He suggested the mesh-like structure of what he denominated a distributed network, since it was less vulnerable to potential attacks. Air / Road network structure By comparing India's road map and air route map we can easily grasp the structural differences between both networks. These topologies link to Paul Baran's schematic of three possible network structures for the Internet. While India's road map, as most road networks, characterizes a Distributed network model, India's air route exemplifies Baran's Decentralized model or Barabasi's Scale-free network model. A-1
  • 83. Continental Airlines Air Route This is an example to show that most destination maps of airline companies offer an interesting case of a Scale- free network, where major airports play the role of networks hubs due to their large number of links (air connections). These hubs are the most reliable elements for the network robustness and sustainability. Considering an airline route map as a network, where the nodes are the airports and the links are the air connections between them, we can easily understand how it fits the scale-free model by satisfying all its characteristics, such as: growth, preferential attachment (rich get richer), power law distribution and modularity. Gene Disruption Network Biological meaning of neighborhoods Copyright European Bioinformatics Institute Yeast protein interaction network A map of protein-protein interactions in Saccharomyces cerevisiae. Copyright Macmillan Magazines Ltd. (from the Barabasi article at Nature Magazine) A-2
  • 84. The Worm Brain Eckmann and Moses used a curvature analysis as a test in known biological systems to explore if it gave meaningful results. In C. elegans worm they plotted the reciprocal connections between neurons in the worm. The height is proportional to curvature. The red nodes are amphid cells, the yellow nodes are other sensory neurons of the head, and blue nodes are motor neurons of the nerve ring. Only co- links are shown, and triangles are enhanced. Copyright Eckmann/Moses Spreading Virus in Scale-Free Networks This image demonstrates how hubs aid the spread of viruses, in a scale-free network. Once a hub gets a virus it can pass it on to a very large number of nodes. This particular image is a case of airborne contagion, such as SARS or TB. Copyright OrgNet Citations Network This concept was initially led by Sid Redner, from Boston University, who showed that the network of scientific papers, connected by citations, has a power law degree distribution. A-3
  • 85. Biotech Industry Network This model shows the emergence of the industry network of contractual collaborations from 1988-99 in relation to both firm-level organizational and financial changes. Social Network Analysis 9/11 Terrorist Network Copyright OrgNet Gnutella Network Snapshot of a local gnutella peer network in a particular neighbourhood. Copyright Martin Dodge A-4
  • 86. High-School Dating Network of Sexual Contacts Note how the nodes with a high number of sexual partners become bigger and brighter. Here, as in the High School Dating image, we can once again notice the importance of the main hubs in securing a scale-free network topology. If one could hypothetically remove the hubs, all that would remain would be a scattered set of independent clusters with no connections or even weak ties between them. Book Network During my research I found this interesting example of Information Visualization. What a better way to research complex networks than finding the inherited similarities between books I've been reading displayed under a network structure. The web shows several titles I've already read and others currently in my wish list. This could be an interesting approach to the people-who-bought-this-book-also-bought- these Amazon style approach to the problem. This solution can obviously be improved but it demonstrates how easy it is to visually grasp the shared links between similar books. A-5
  • 87. Predators and Trophic Species Most food webs in nature have proved to have a scale- free network topology where some species have a much larger number of dependencies and interactions then all the others. These screenshots represent food web models from Dr. Joseph Luczkovich's Java Application, developed at the Biology Department at East Carolina University. These models are extremely interesting, appealing and functional. One can select the different species and elements to analyze, compare their interactions and zoom extensively in and out of the digitally produced food web. Copyright Dr. Joseph Luczkovich. Cod Food Web The importance of fully understanding the dynamics of scale-free networks as been recognized by the cod fishery industry in the worst way. The collapse of the Northwest Atlantic cod fishery has become a metaphor for ecological catastrophe and is universally cited as an example of failed management of a natural resource (MacKenzie 1995). Prof. David Lavigne, a zoologist researcher sponsored by the Natural Sciences and Engineering Research Council and the International Marine Management Association is a leading force in combating this miscomprehension of food webs. Regarding the cod stock decrease, he claims that seals are being used as scapegoats because government scientists are failing to look at the problem in a macro level, the way any network should to be analyzed. The image below is Lavigne's effort to understand the complex map of interactions in a food web. This astonishing work shows the Cod food web displaying some trophic interactions for part of the Northwest Atlantic. Copyright David Lavigne A-6
  • 88. Computational Biology - Soft Clustering A soft clustering of genes in a subset of the compendium data set for S. cerevisiae of Hughes 1999. The lines connect genes or experiments that exhibit strong correlations (red more so than black lines). The placement of the points in the plane is chosen to put correlated points close to each other. The coloring of the points expresses their correlation to the selected point (red in the large cluster) Internet Mapping Project The Internet Mapping Project was started at Bell Labs in Major ISP's the summer of 1998. Its long-term goal is to acquire and save Internet topological data over a long period of time. This data has been used in the study of routing problems and changes, DDoS attacks, and graph theory. A-7
  • 89. The Opte Project The goal of the Opte Project, started by Barrett Lyon, is to use a single computer and single Internet connection to map the location of every single class C network on the Internet. CAIDA Internet Graph CAIDA (Cooperative Association for Internet Data Analysis) provides tools and analyses promoting the engineering and maintenance of a robust, scalable global Internet infrastructure. A-8
  • 90. blogviz Mapping the dynamics of Information Diffusion in Blogspace by Manuel Lima Appendix B Complex Networks: Visual Explorations
  • 91. Appendix B Complex Networks: Visual Explorations B-1
  • 92. B-2
  • 93. B-3
  • 94. B-4
  • 95. B-5
  • 96. © 2005 Manuel Sousa Lima All Rights Reserved