I survey three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I also discuss some methods for visualizing large data sets.
AdRoll Tech Talk Presentation:
In this talk I present three approaches to understanding monads:
- How Monads Arise Naturally
- Monads as Implemented in Haskell
- Monads in Category Theory
ggtimeseries-->ggplot2 extensions
This R package offers novel time series visualisations. It is based on ggplot2 and offers geoms and pre-packaged functions for easily creating any of the offered charts. Some examples are listed below.
This package can be installed from github by installing devtools library and then running the following command - devtools::install_github('Ather-Energy/ggTimeSeries').
reference: https://github.com/Ather-Energy/ggTimeSeries
AdRoll Tech Talk Presentation:
In this talk I present three approaches to understanding monads:
- How Monads Arise Naturally
- Monads as Implemented in Haskell
- Monads in Category Theory
ggtimeseries-->ggplot2 extensions
This R package offers novel time series visualisations. It is based on ggplot2 and offers geoms and pre-packaged functions for easily creating any of the offered charts. Some examples are listed below.
This package can be installed from github by installing devtools library and then running the following command - devtools::install_github('Ather-Energy/ggTimeSeries').
reference: https://github.com/Ather-Energy/ggTimeSeries
Symmetry in the interrelation of flatMap/foldMap/traverse and flatten/fold/se...Philip Schwarz
(download for perfect quality) A simple but nice example of the symmetries that help us reason about functional programs.
Errata: in "foldMapping is mapping and then folding – folding is just flatMapping identity", flatMapping should of course be foldMapping - spotted by Vasco Figueira
download for better quality - Learn about the sequence and traverse functions
through the work of Runar Bjarnason and Paul Chiusano, authors of Functional Programming in Scala https://www.manning.com/books/functional-programming-in-scala
This slide is used to do an introduction for the matplotlib library and this will be a very basic introduction. As matplotlib is a very used and famous library for machine learning this will be very helpful to teach a student with no coding background and they can start the plotting of maps from the ending of the slide by there own.
This presentation takes you on a functional programming journey, it starts from basic Scala programming language design concepts and leads to a concept of Monads, how some of them designed in Scala and what is the purpose of them
A, B, C. 1, 2, 3. Iterables you and me - Willian Martins (ebay)Shift Conference
The Iterable protocol was introduced in 2015, but it wasn't really caught on, and people have doubts regarding how it works, how can we leverage it to write better and more expressive code. This talk tries to break this fantastic ECMAScript feature down in a one-two step, showing little by little the use cases, properties, and the *new async Iterator protocol*, quickly and smoothly, like trying to learn how to dance this fun Jackson's 5 Soul music ;) If you are a beginner in JS, you will learn how to build custom iterable objects in a bunch of different ways, and if you already got it, I will challenge you to go an extra mile and experiment neat tricks like composing iterables or creating a PoC of a state/side effect management based on Iterables.
Function Programming in Scala.
A lot of my examples here comes from the book
Functional programming in Scala By Paul Chiusano and Rúnar Bjarnason, It is a good book, buy it.
This set of slides is based on the presentation I gave at ACM DataScience camp 2014. This is suitable for those who are still new to R. It has a few basic data manipulation techniques, and then goes into the basics of using of the dplyr package (Hadley Wickham) #rstats #dplyr
A survey of data visualization functions and packages in R. In particular, I discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I also discuss some methods for visualizing large data sets.
Symmetry in the interrelation of flatMap/foldMap/traverse and flatten/fold/se...Philip Schwarz
(download for perfect quality) A simple but nice example of the symmetries that help us reason about functional programs.
Errata: in "foldMapping is mapping and then folding – folding is just flatMapping identity", flatMapping should of course be foldMapping - spotted by Vasco Figueira
download for better quality - Learn about the sequence and traverse functions
through the work of Runar Bjarnason and Paul Chiusano, authors of Functional Programming in Scala https://www.manning.com/books/functional-programming-in-scala
This slide is used to do an introduction for the matplotlib library and this will be a very basic introduction. As matplotlib is a very used and famous library for machine learning this will be very helpful to teach a student with no coding background and they can start the plotting of maps from the ending of the slide by there own.
This presentation takes you on a functional programming journey, it starts from basic Scala programming language design concepts and leads to a concept of Monads, how some of them designed in Scala and what is the purpose of them
A, B, C. 1, 2, 3. Iterables you and me - Willian Martins (ebay)Shift Conference
The Iterable protocol was introduced in 2015, but it wasn't really caught on, and people have doubts regarding how it works, how can we leverage it to write better and more expressive code. This talk tries to break this fantastic ECMAScript feature down in a one-two step, showing little by little the use cases, properties, and the *new async Iterator protocol*, quickly and smoothly, like trying to learn how to dance this fun Jackson's 5 Soul music ;) If you are a beginner in JS, you will learn how to build custom iterable objects in a bunch of different ways, and if you already got it, I will challenge you to go an extra mile and experiment neat tricks like composing iterables or creating a PoC of a state/side effect management based on Iterables.
Function Programming in Scala.
A lot of my examples here comes from the book
Functional programming in Scala By Paul Chiusano and Rúnar Bjarnason, It is a good book, buy it.
This set of slides is based on the presentation I gave at ACM DataScience camp 2014. This is suitable for those who are still new to R. It has a few basic data manipulation techniques, and then goes into the basics of using of the dplyr package (Hadley Wickham) #rstats #dplyr
A survey of data visualization functions and packages in R. In particular, I discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I also discuss some methods for visualizing large data sets.
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...Dr. Volkan OBAN
Some R Examples[R table and Graphics]
Advanced Data Visualization in R (Some Examples)
References:
http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/
http://www.cookbook-r.com/
http://moderndata.plot.ly/trisurf-plots-in-r-using-plotly/
I hope that it would ne useful for UseRs.
Umarım; R programı ile ilgilenen herkes için yararlı olur.
Volkan OBAN
Implementing virtual machines in go & c 2018 reduxEleanor McHugh
An updated version of my talk on virtual machine cores comparing techniques in C and Go for implementing dispatch loops, stacks & hash maps.
Lots of tested and debugged code is provided as well as references to some useful/interesting books.
FITC events. For digital creators.
Save 10% off ANY FITC event with discount code 'slideshare'
See our upcoming events at www.fitc.ca
An Intro To ES6
with Grant Skinner
OVERVIEW
ECMAScript 6 is the approved and published standard for the next version of JavaScript. It offers new syntax and language features that provide new ways of tackling coding problems, and increase your productivity.
This session will introduce ES6 and delve into many of the new features of the language. It will also cover real-world use, including transpilers, runtimes, and browser support.
OBJECTIVE
Create confidence in evaluating and getting started using ES6.
TARGET AUDIENCE
JavaScript developers.
ASSUMED AUDIENCE KNOWLEDGE
JavaScript.
FOUR THINGS AUDIENCE MEMBERS WILL LEARN
Status of ES6
How to get started with ES6
ES6 feature overview
Practical considerations for adopting ES6
PHP has its own treasure chest of classic mistakes that surprises even the most seasoned expert : code that dies just by changing its namespace, strpos() that fails to find strings or arrays that changes without touching them. Do that get on your nerves too? Let’s make a list of them, so we can always teach them to the new guys, spot them during code reviews and kick them out of our code once and for all. Come on, you’re not frightening us?
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
45. L inux A pache M ySQL R http://labs.dataspora.com/gameday
46.
47.
48.
49.
50. Contact Us Michael E. Driscoll, Ph.D. Principal [email_address] www.dataspora.com
Editor's Notes
“ A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009. Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I’ll also discuss some methods for visualizing large data sets. I’ll end with an overview of Rapache, a tool for embedding R in web applications. For questions beyond this talk, I can be contacted at: Michael E Driscoll http://www.dataspora.com [email_address]
Hal Varian said that “The sexy job in the next ten years will be statisticians…” (in an 2009 interview with McKinsey Quarterly). Data visualization is the fastest means to feeds our brains data, because it leverages our highest bandwidth sensory organ: our eyes. Statistical visualization is sexy both because high-density information plots tickle our brains – we crave information – and because it is hard to do well.
A data visualization is often the final step in a three-step data sense-making process, whereby data is (i) “munged” e.g. collected, cleansed, and structured), (ii) modeled , relationships in the data are explored and hypotheses tested, and finally (iii) visualized , a particular model of the data is represented graphically. At Facebook, their data engineers are called “data scientists.” I like this term because it conveys that working with data involves the scientific method, predicated on making hypotheses and testing them. Ultimately, we are interested in using data to make hypotheses about the world.
Like this one, from Jessica Hagy’s witty blog – this is indexed.com She visualizes a hypothesis that free time and money are related – e.g. that you have the most free time when you’re broke and when you’re rich. I decided to test this hypothesis with data on working hours (its complement = free time) and GDP from 29 OECD countries.
Using R, I decided to test this hypothesis. I modeled it with a polynomial regression. Data for 29 countries in the OECD, using 2006 data on annual hours worked, and GDP per capita. I modeled it with both linear and polynomial regression models. Just a few lines of code.
And using R, I visualized it. Her wealth-free time hypothesis was half-right. The richer you are, the more free time you have (the extreme rightmost point is Luxembourg). But at least for this subset of countries that we examined, the relationship is strictly linear – the poorest OECD countries have the least free time. (In the code shown on the right, I’m using ggplot2 here, not the base graphics plot function in the previous slide. But ggplot2 will automatically do a loess fit for us).
In this section, I describe built-in graphics functions in R, that require no external packages.
First, let’s peek under the covers of the R graphics stack. At the top-most level are packages, like “maps”, “lattice”, and “ggplot2”. These packages make calls to a lower-level graphics system, of which in R there are two – called “graphics” and “grid”. According to Nicholas Lewin-Koh, the goal of these graphics systems is to “create coordinates for each graphical object and render them to a device or canvas. In addition the system may manage (i) a stack of graphics objects, (ii) local state information, (iii) redrawing and resizing.” Finally these graphics systems are capable of rendering output to a variety of devices – which for our purposes, can be considered image formats such as PNG, JPG, and PDF. Devices are most commonly include interactive displays – such as those in Windows of Mac OS X – which R sends its output to by default during an interactive session. Grid is a newer system, and both “lattice” and “ggplot2”, which I’ll discuss later, use Grid.
plot() is a “do the right thing” graphics command plot() is the simplest R command for generating a visualization of an R object. It’s an overloaded function that just “does the right thing”, and yields a quick few for many R objects that are passed to it. These built-in basic plotting commands are useful if you’re just doing quick, exploratory analysis, and publication quality graphs are not what you’re looking for.
We can interactively add layers – lines, points, and text -- to plots using basic graphics functions. One such example is abline – so named for its a slope, b intercept parameters it uses to draw a line (from that saw y = a x + b ).
par is a function for setting graphical parameters for base graphics – and, nota bene, these parameters are often shared by the higher level packages I discuss later. Once parameters are defined via par , graphics functions like plot will use these new parameters in subsequent plots. The example above shows the setting of three parameters: pch to set a p lotting ch aracter (21 denotes a filled circle), cex to set size or c haracter ex pansion (1 is default, 5 is bigger) col to set color, which is definable as a name (“blue”), an integer (1-7 for primaries), or an RGB value (as above).
graphics parameters can be set via par(), or passed directly to graphics functions Above are some more parameters that you can set using par() . For a full list, type help(par) at the R prompt. You can also pass these parameters directly to graphics functions, for example, “ points(5,3, pch=19, col=blue)” The chart on the right is example of a plot painstakingly created with the low-level plotting parameters and functions above. This was done by interactively layering additional text labels and legends on after the initial points were plotted.
Edward Tufte has lauded the value of “small multiples” in information graphics: namely, the incorporation of many small plots in a single graphic. R provides a basic facility for the subdivision of a display device (or ultimately its printed representation) into several panels. This can be achieved by setting the graphics parameter mfrow , which stands for m ultiple f igures plotted row -wise.
With the mfrow parameter, a 2 x 2 matrix of sub-panels -- as in the example above -- can be set up, and plots will be interactively drawn in these sub-panels. The code above illustrates the creation of four figures in a single graphic, and the result is shown in the next slide. (There is also a mfcol function for plotting multiple figures in a col umn-wise manner.)
Unless a data visualization is of unusually high density, most modern display devices allow for upwards of 16 figures to be suitably resolved on a single device. See the splom() function for automatic creation of such dense graphics.
R graphics devices can present some “gotchas” Normally one need not have any knowledge of the graphics devices that underly the R graphics system. But in a few cases, it’s worth knowing something about: while typical users can save R graphics in the Windows or Mac OS X (via a “Save As” dialog in the graphics window), if one is not using a GUI, exporting graphics requires manually opening a device – with one of several device commands (such as pdf() or png() ) – and closing it properly (using dev.off() ). also, when exporting graphics in a non-interactive environment (via a script for instance) – it’s critical to invoke the print() function – which will properly write a graphic to the available device. this “print” issue can be a real gotcha for scripts.
Okay, now I want you to try and forget everything you just heard about base graphics. ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham. It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
the ‘gg’ in ggplot2 is a reference to a book called The G rammar of G raphics written done by Leland Wilkinson The book conceives graphics as compositional – made up colors, visual shapes, and coordinates, much as sentences are made up of parts of speech.
I’ve illustrated an incomplete version of Wilkinson’s grammar in this slide, to convey how graphics are built up – and out of – their component parts. As such, Wilkinson advocates that graphical tools should leave behind what he deems “chart typologies” – rigid casts of a pie charts, bar graphs, or scatter plots, which data is poured into. (Excel chart wizard might be thought of as the Mad Libs of graphics –with pre-defined structure, and limited degrees of freedom). Conceived as compositional, a graphical grammar allows for an infinite variety of graphical constructions.
In the upcoming examples, drawn directly from Hadley Wickham’s book on ggplot2, we’ll visualize data concerning ~ 50,000. We’ll start simple and build to more complex graphs by specifying additional elements of the graphical grammar. This data is in the ggplot2 package, more information is available with help(diamonds) (after loading ggplot2). For our purposes, we’re concerned examining relationships between just three dimensions of this data, namely: carat, cut, clarity, price.
In ggplot2 , the command to build this plot is qplot() , which stands for “ q uick plot”. We pass qplot() two dimensions of our data (carat and price), and it defaults to a scatter plot representation. Also worth noting is ggplot2’s other visual defaults are quite easy on the eyes – in contrast to most of R’s base graphics. We begin with a basic scatter plot of these 50,000 diamonds. This plot reveals that, not surprisingly, the price of diamonds increases as they get bigger (in terms of carats). Somewhat more interesting is how: we perceive that price seems to increase exponentially (and we test this hypothesis in the next slide).
Next, we log normalize the our data, and reveal that as we suspected, the relationship between a diamond’s price and its carat is exponential. It should be noted that we can achieve this transformation in two equivalent ways: (i) we can directly transform our data with the log function, or (ii) we can transform our coordinate scales on which our data is plotted. In ggplot2, this latter approach is achieved by passing the parameter ‘log=“xy”’ to qplot. Because both normalization approaches rely on different parts of graphical speech – data and scale – this nicely illustrates that, as in language, there is more than one way to express data visually using this grammar of graphics and ggplot2.
Another element of the graphical grammar is the aesthetic appearance of plotting points. Here, we pass a parameter, alpha , which controls the transparency of the points plotted. The parameter’s value, I(1/20) , indicates that each point should have 1/20 th of full intensity: thus 20 overplotted points are required at any given location to achieve full saturation (in this case, to black). (Note: the “I” function in R inhibits further interpretation of its arguments, so can be thought of simply the fraction 1/20) This method uncovers some interesting distributions in the data that were previously obscured by overplotting. For example, we can detect that points are highly concentrated around specific carat sizes. Contrast this method with our earlier approach to alpha blending with base graphics, which required manually specifying the RGB hex code.
Here we layer on yet another element of grammar, the color, to show how clearer stones are more expensive. ggplot2 automatically creates a legend for the mapping of color variables onto color. (Note, Wickham’s choice of a default color palette is not accidental – they of equal luminance, thus no one dominates over the other. For more than you ever want to know about color choice, see http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture13.pdf ).
Now we use another element of the grammar – what is termed ‘facets’ – to splinter our graphic into a number of subplots along a given dimension. Here we achieve the small multiples that we previously did using the par function and mfrow parameter. These sorts of sub-divided plots are what the Lattice system, excels at, which we’ll see later. What can say from this plot? Well, if anything, clear colored diamonds (“D”) seem to get more expensive more quickly (slightly steeper slope as a function of their size) versus yellower diamonds.
Let’s take another view of the data. Here we’re interested in seeing how color influences the per carat cost of a diamond. The boxplot on the left shows that nearly clear diamonds (color categories ‘D’ and ‘E’) have a greater number of high-priced outliers, but their median (the center line of each box) is nearly identical to the others. The so-called jitter plot on the right shows this same view of the data, but all of the points are shown – in this case, the points plotted into bins according a categorical variable, diamond color, and “jittered” within each bin to prevent overplotting, and allow a sense of the local density at difference values along the common y-dimension of price/carat.
A display of 50,000 data points. Why not? Our eyes can handle, and I submit, crave these kind of rich visualizations. This also allows us to detect features of the data (for example, several thin white bands across the bottom of the bars – perhaps preferred price/carat combinations?) that may be missing in from more simplified data views.
lattice is an alternative high-level graphics package for R. Like ggplot2 it is built on the grid graphics system.
lattice is named in honor of its predecessor, trellis , which was a visualization library developed for the S language by William Cleveland. trellis was so named because of how it visualizes higher dimensions of data: it splinters these dimensions across space, producing a grid of small multiples that resemble a trellis. In the next series of slides I show how we can use lattice to visualize up to six dimensions of data in a single plot.
To demonstrate lattice’s multivariate visualizing abilities, we’ll use a fascinating data set called MLB Gameday. Since 2007, Major League Baseball has tracked the path and velocity of > 1 million pitches thrown. Sample data is here: http://gd2.mlb.com/components/game/mlb/year_2008/month_03/day_30/gid_2008_03_30_atlmlb_wasmlb_1/pbp/pitchers/400010.xml
With just two dimensions of data to describe — the x and y location in the strike zone — we can use lattice’s xyplot function. Unlike ggplot2, the first that we pass to lattice’s plotting functions (of which xyplot is just one) are formulas that describe a relationship in the data to be plotted. In this case, “x ~ y” can be read as “x depends on y”. Note the visual defaults: not as easy on the eyes as ggplot2 (which has a lower contrast gray background), but an improvement on R’s base graphics plots.
In this plot, I’ve layered a third dimension, pitch type, into our plot by using lattice’s “groups” parameter, which uses a different plotting symbol for each type, and includes a legend across the top. Alas, this is not a particularly informative chart. The symbols are overplotted on top of each other: trends among the pitch types are hard to discern. With lattice, we can use yet another approach.
Now we’re doing what lattice does best – splintering a dimension, in this case pitch type, into space. We do this by using R’s “condition” operator in the formula we pass to lattice (the formula “x ~ y | type” can be read as “x depends on y conditioned on type”).
Now we include a fourth dimension in our plot – pitch speed – by using color. The speed to color mapping is relatively intuitive (seen in upper right), red is fast, blue is slow. How we achieve this is not particularly simple: we must use what lattice deems “panel functions”, which allow us to extend the default appearance of the chart.
Finally we add a fifth dimension, local density, to our plots using a two-dimensional color palette, where speed is related to chroma, and local density to luminance. This is an attempt to control for some overplotting that might otherwise occur when we shrink these pitch plots down in size.
Now we can compare two different pitchers – the sixth dimension – in a single graphic. The six dimensions of data we visualized with lattice are thus: 1. and 2. x and y location of the pitch 3. pitch type 4. pitch speed 5. pitch density (lots of pitches make darker luminosity with out changing hue) 6. pitcher (Cole or Hamels)
As mentioned, the lattice package provides several other graphics functions besides xyplot. Some are listed above here, and the densityplot() function is highlighted at the bottom. This is a particularly useful alternative to standard histograms, which can suffer from binning artifacts.
In this section I mention a couple of techniques for handling large data sets.
This is bad for two reasons: (1) overplotting obscures data, even when alpha blending is used. (2) it’s highly inefficient, both on screen – and especially if saved as vector graphic (huge PDFs). Two solutions: - resort to sampling map density of points onto some other attribute – such as color hexbinplot and geneplotter do just this.
hexbinplot() is a graphics function (in an self-named package) divides a scatter plot area into hexagons, counts occurrences within each these hexagonal areas, and maps these counts to a color scale. The result is a plot, as shown, where the graphics device need only draw as many points as there are hexagons. In the case of the diamond data, rather than 50,000 points being graphed, just ~ 2000 hexagons are. This also reveals some of the clumpiness in the data, though not as well as ggplot2’s alpha-blended scatterplots.
This is an Affymetrix gene chip, with 100,000 data points. On the right we have the output of a typical microarray assay: the colors correspond to RNA expression levels. With R, I can distill these 100,000 data points down to a simple model – and visualize it.
The data visualization on the right, called an M-A plot, is a variation of an XY scatter plot, where we are comparing the observed signals for particular microarray, to a composite background distribution – both are ordered by intensity of signal– deviations from the straight line show differences between our array and the background (in this case, our array tends to have higher signals across the board). Typically we generate an M-A plot for every array in our compendium to yield a big picture view of the consistency of our arrays across experiments – the flatter the red lines, the better (remember that in most models of cellular behavior we expect only a small fraction of genes to change in expression).
Ross Ihaka’s Colorspace package provides access to useful colorspaces beyond RGB, like LAB and HSV. These colorspaces are preferred by artists and designers for their more intuitive properties. This is the package I used to design the palettes in the pitching plots shown earlier. For my opinionated comments on using color in data visualizations, visit: http://dataspora.com/blog/how-to-color-multivariate-data/
Before we end, some thoughts on how R can be used a visualization engine on the web.
So I’ve pushed this pitch visualization application into a web app, using RApache. I can do this because R is open source – without licensing restrictions. Data and the processing can both live on the server – important when your data set is huge (this one is around 20 Gigabytes). And when the data changes, the dashboard updates. No local software installation needed, and updates are instantly available to all web users. It can be part of the open source web-analytics stack, with a catchy name – LAMR. If you can think of something less lame, let me know.
Why Embed R into a Web-based Architecture? Immediately access the many benefits of a web architecture that is: * Stateless/Scalable – URL requests can be distributed across one or many servers * Cacheable - common requests made to the R server can be cached by Apache * Secure - we can piggyback on existing HTTPS architecture for analysis of sensitive data
rapache: Embedding R within the Apache Server Our tool of choice is rapache, developed by Jeff Horner at Vanderbilt University. http://biostat.mc.vanderbilt.edu/rapache/
Naturally this is just scratching the surface of what rapache can do. An alternative approach to printing HTML directly, is to use a templating system, similar to PHP. This is available via the R package brew (also developed by Jeffrey Horner), downloadable on CRAN and at: http://www.rforge.net/brew/
The ggplot2 and lattice books are both published by Springer (ggplot2 as of July 2009), available via Amazon. example code and figures from ggplot2 book http://had.co.nz/ggplot2 example code and figures from lattice book http://lmdvr.r-forge.r-project.org/
Michael E. Driscoll is Principal and Founder of Dataspora LLC. He has a decade of experience developing large-scale databases and data mining algorithms within industry, government, and academic institutions. He founded and until 2008 served on the board of CustomInk.com, an Inc. 500 online retailer. Michael has a Ph.D. in Bioinformatics from Boston University and an A.B. from Harvard College.