A survey of data visualization functions and packages in R. In particular, I discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I also discuss some methods for visualizing large data sets.
The document provides an overview of the Clojure programming language, describing its features such as functional programming, persistent data structures, and concurrency without manual lock management. It also demonstrates how to define functions, work with sequences and maps, and integrate Clojure code with the Grails framework by calling Clojure functions from Grails services.
This document discusses data structures and looping in LISP. It notes that the only data structure in LISP is lists. It describes functions for accessing parts of lists like first, rest, append, and cons. It also demonstrates using dolist to iterate over lists, such as to write a function to reverse a list. Examples are provided to demonstrate applying these functions and looping construct to lists.
The document describes a Python module called r.ipso that is used in GRASS GIS to generate ipsographic and ipsometric curves from raster elevation data. The module imports GRASS and NumPy libraries, reads elevation and cell count statistics from a raster, calculates normalized elevation and area values, and uses these to plot the curves and output quantile information. The module demonstrates calling GRASS functionality from Python scripts.
This document discusses processing large datasets with Python and Hadoop. It begins with an example of finding the highest temperature from a climate dataset using a map-reduce approach. Next, it provides code examples for implementing map-reduce in pure Python, with Hadoop Streaming, and with the Dumbo library. The document then discusses using Amazon Elastic MapReduce for running Hadoop jobs on AWS. It poses a question about how to implement breadth-first search as a map-reduce algorithm and ends with an example of using MongoDB's map-reduce functionality.
This document provides an overview of abstract machines for evaluating lambda calculus expressions and arithmetic expressions. It discusses the SECD machine, which evaluates lambda calculus using a stack, environment, code, and dump. It also discusses the K machine and tail call optimization. The document explains evaluation schemes and transitions for abstract machines and provides examples of evaluating expressions using different machines.
Python 101 language features and functional programmingLukasz Dynowski
1. The document provides examples of using various programming languages like JavaScript, Java, PHP, Python to perform common operations like reversing a string, finding an element in an array, and using data structures like lists, dictionaries, tuples, and sets.
2. It also discusses functional programming concepts like iterators, generators, map, filter and reduce functions, and using lambda expressions.
3. Examples are given for file handling, pickling, list and generator comprehensions in Python.
The document provides an overview of the Clojure programming language, describing its features such as functional programming, persistent data structures, and concurrency without manual lock management. It also demonstrates how to define functions, work with sequences and maps, and integrate Clojure code with the Grails framework by calling Clojure functions from Grails services.
This document discusses data structures and looping in LISP. It notes that the only data structure in LISP is lists. It describes functions for accessing parts of lists like first, rest, append, and cons. It also demonstrates using dolist to iterate over lists, such as to write a function to reverse a list. Examples are provided to demonstrate applying these functions and looping construct to lists.
The document describes a Python module called r.ipso that is used in GRASS GIS to generate ipsographic and ipsometric curves from raster elevation data. The module imports GRASS and NumPy libraries, reads elevation and cell count statistics from a raster, calculates normalized elevation and area values, and uses these to plot the curves and output quantile information. The module demonstrates calling GRASS functionality from Python scripts.
This document discusses processing large datasets with Python and Hadoop. It begins with an example of finding the highest temperature from a climate dataset using a map-reduce approach. Next, it provides code examples for implementing map-reduce in pure Python, with Hadoop Streaming, and with the Dumbo library. The document then discusses using Amazon Elastic MapReduce for running Hadoop jobs on AWS. It poses a question about how to implement breadth-first search as a map-reduce algorithm and ends with an example of using MongoDB's map-reduce functionality.
This document provides an overview of abstract machines for evaluating lambda calculus expressions and arithmetic expressions. It discusses the SECD machine, which evaluates lambda calculus using a stack, environment, code, and dump. It also discusses the K machine and tail call optimization. The document explains evaluation schemes and transitions for abstract machines and provides examples of evaluating expressions using different machines.
Python 101 language features and functional programmingLukasz Dynowski
1. The document provides examples of using various programming languages like JavaScript, Java, PHP, Python to perform common operations like reversing a string, finding an element in an array, and using data structures like lists, dictionaries, tuples, and sets.
2. It also discusses functional programming concepts like iterators, generators, map, filter and reduce functions, and using lambda expressions.
3. Examples are given for file handling, pickling, list and generator comprehensions in Python.
This document appears to be notes from a Clojure community night talk that discussed various topics:
- Why Clojure is a data-oriented programming language and how its inherent data structures can be used to derive program structure
- Examples of how Clojure's data behaviors like sequences, maps, macros can be used for programming
- How a "monster let" expression that locks dependencies into an opaque function can be refactored into a composable graph structure using functions
- Examples of Clojure libraries that enable fast array math (HipHip), DOM manipulation (Dommy), and data validation (Schema)
- Areas where Clojure could be improved including debugging tools, compiler speed and tooling,
This document discusses various natural language processing APIs and techniques:
1. It discusses end-to-end APIs that can perform tasks like question answering without requiring specifying rules or patterns. Examples of applications that can use these APIs are chatbots and FAQ systems.
2. It also discusses using domain-specific languages like SQL within APIs to query databases and knowledge bases. Sequence-to-sequence models are mentioned for translating natural language to structured queries.
3. Various natural language processing tools and techniques are mentioned that can be used as part of APIs, such as word embeddings, parsers, named entity recognition, and semantic role labeling.
This document provides an overview of using various Python libraries and tools for image recognition, including OpenCV for image processing, Selenium and Hulu for web scraping, Keras and TensorFlow for building convolutional neural networks, and Bottle as a web framework. Code examples are given for preprocessing images, creating a CNN model in Keras to classify images into 10 classes using the CIFAR-10 dataset, and exporting the trained model.
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...PingCAP
This document describes RESIN, a query optimizer that eliminates redundant I/O for big data queries. RESIN introduces two new operators - ResinMap and ResinReduce - and two optimization rules - sub-query fusion and binary-operator elimination. These optimizations were found to benefit 40% of queries in the TPC-DS benchmark, improving performance by an average of 1.4x. The optimizer works by fusing operators applied to the same table, eliminating redundant joins or unions, and combining grouped aggregations. An evaluation on a 10GB TPC-DS dataset found RESIN's optimizations significantly reduced redundant I/O for many real-world analytical queries.
This document discusses functional programming concepts using the Haskell programming language. It provides examples of how to write functions like sum and product in a functional style. It explains that Haskell is a strongly typed, lazy, purely functional language where everything is a function. It also discusses some benefits of learning functional programming concepts like Haskell, even if you don't use a purely functional language, as ideas from FP are used in languages like Python, Ruby and C#.
This document provides information on tools for research plotting in Python and R. It discusses matplotlib and R for creating plots in Python and R respectively. It provides examples of different plot types that can be created such as line plots, bar plots, scatter plots, and histograms. It also discusses installing and working with matplotlib and R Studio, and provides code examples to generate various plots from data.
This document discusses using parallel computing in R with the snow package. It provides an overview of using snow to distribute computations across multiple CPUs. Examples are given showing how snow can be used with functions like parApply to speed up matrix multiplication by performing the operation in parallel on a cluster. The document also discusses using snow together with Rmpi and a job scheduler like Sun Grid Engine to enable parallel computing on a computing cluster.
The document discusses graph databases and their advantages over traditional databases. It begins by describing the morphology of graphs and how they have evolved from dots and lines to property graphs. It then discusses persistence mechanisms for storing graphs and how graph databases provide index-free adjacency for efficient traversal. Several examples are provided using Gremlin to query a sample graph to demonstrate graph traversal operations. The document concludes by discussing use cases for graph databases in applications involving relationships, recommendations, ranking and data analysis.
This document provides an overview of Lamina, a Clojure library for working with streams of event data:
- Lamina allows transforming, aggregating, analyzing, and reacting to streams of event data. It is useful for creating narratives from sequences of events over time.
- Key concepts include channels for streaming data, functions for processing streams like map and reduce, and probe channels for instrumenting code and generating event streams.
- Lamina includes tools for visualizing event streams, calculating statistics like rates and quantiles, and distributing streams by properties before aggregation. This allows flexible analysis of distributed event-based applications.
1. The document discusses energy-based models (EBMs) and how they can be applied to classifiers. It introduces noise contrastive estimation and flow contrastive estimation as methods to train EBMs.
2. One paper presented trains energy-based models using flow contrastive estimation by passing data through a flow-based generator. This allows implicit modeling with EBMs.
3. Another paper argues that classifiers can be viewed as joint energy-based models over inputs and outputs, and should be treated as such. It introduces a method to train classifiers as EBMs using contrastive divergence.
From Trill to Quill: Pushing the Envelope of Functionality and ScaleBadrish Chandramouli
In this talk, I overview Trill, describe two projects that expand Trill's functionality, and describe Quill, a new multi-node offline analytics system I have been working on at MSR.
This document summarizes a presentation on offline reinforcement learning. It discusses how offline RL can learn from fixed datasets without further interaction with the environment, which allows for fully off-policy learning. However, offline RL faces challenges from distribution shift between the behavior policy that generated the data and the learned target policy. The document reviews several offline policy evaluation, policy gradient, and deep deterministic policy gradient methods, and also discusses using uncertainty and constraints to address distribution shift in offline deep reinforcement learning.
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
FREQ_FOLCA and LOSSY_FOLCA are variants of FOLCA that work in constant space by removing infrequent production rules from the hash table. FREQ_FOLCA divides text into blocks and removes the lowest frequency rules each time the hash table reaches a size limit. LOSSY_FOLCA divides text into blocks and keeps rules for successive blocks based on frequency. Experiments show they can compress 100 human genomes totaling 306GB in about one day while using only a few dozen megabytes of working space.
Incremental and parallel computation of structural graph summaries for evolvi...Till Blume
This document presents an incremental and parallel algorithm for computing structural graph summaries of evolving graphs. The algorithm incrementally updates graph summaries when the input graph changes, which is often faster than recomputing from scratch. The algorithm partitions the graph and computes summaries in parallel. Experimental results on real-world and benchmark graphs show the incremental algorithm outperforms batch computation even when 50% of the graph changes. The algorithm runs in linear time with respect to graph changes and degree.
FOLCA is a fully-online grammar compression method that builds a partial parse tree in an online manner and directly encodes it into a succinct representation using just nlgn+2n+o(n) bits of space. This is asymptotically optimal. It achieves small working space of (1+α)nlgn+n(3+lg(αn)) bits using a compressed hash table. It can extract substrings in O(l+h) time using extra space of nlg(N/n)+3n+o(n) bits. Experiments show it compresses and extracts faster than LZend while using less space.
This document summarizes research presented at the 24th Annual Symposium on Combinatorial Pattern Matching. It discusses three open problems in optimally encoding Straight Line Programs (SLPs), which are compressed representations of strings. The document presents information theoretic lower bounds on SLP size and describes novel techniques for building optimal encodings of SLPs in close to minimal space. It also proposes a space-efficient data structure for the reverse dictionary of an SLP.
M/DB and M/DB:X are open source NoSQL databases based on GT.M. M/DB emulates the Amazon SimpleDB API and data model, allowing use of SimpleDB-compatible clients on premise. M/DB:X provides a native XML database with DOM and XPath APIs that can store and retrieve XML documents in JSON or XML format using the SimpleDB security model. Both leverage the high performance and scalability of the underlying GT.M database.
This document provides instructions for installing the Linux operating system on a computer. It recommends installing Ubuntu Linux and describes how to create a bootable Ubuntu USB drive using UNetbootin. This allows trying out Ubuntu without installing it permanently. The document then describes how to install Ubuntu permanently on the hard drive to dual boot with Windows. Additional topics covered include installing Linux on a USB drive or external hard drive, post-installation configuration, using Linux applications, and some advanced usage topics. The overall document provides beginner-friendly guidance for getting started with Linux.
The document discusses streaming data and concurrency in R. It notes that R is inherently single-threaded, but there are efforts to enable parallelization through distributed computation and interfaces to other technologies. While multithreading R itself is challenging due to its internal workings, concurrency could be useful for real-time streaming data analysis. The document presents an example application using a C++ extension to subscribe to real-time market data feeds in R, but notes issues with blocking the interpreter thread. An alternative approach using inter-process communication is suggested to decouple the subscription logic.
The document discusses the Analysis Data Model (ADaM), which is used to standardize the organization of clinical trial data for statistical analysis. ADaM has two main data structures - Analysis Data Structure, Level (ADSL), which contains one record per subject, and Basic Data Structure (BDS), which can have multiple records per subject. BDS includes variables for subject identifiers, treatments, timings, analysis parameters, and other metadata. Using ADaM makes clinical trial data analysis-ready and traceable. It allows statisticians to perform various analyses like survival analysis and comparisons between treatment groups using standard SAS procedures without additional data manipulation.
This document appears to be notes from a Clojure community night talk that discussed various topics:
- Why Clojure is a data-oriented programming language and how its inherent data structures can be used to derive program structure
- Examples of how Clojure's data behaviors like sequences, maps, macros can be used for programming
- How a "monster let" expression that locks dependencies into an opaque function can be refactored into a composable graph structure using functions
- Examples of Clojure libraries that enable fast array math (HipHip), DOM manipulation (Dommy), and data validation (Schema)
- Areas where Clojure could be improved including debugging tools, compiler speed and tooling,
This document discusses various natural language processing APIs and techniques:
1. It discusses end-to-end APIs that can perform tasks like question answering without requiring specifying rules or patterns. Examples of applications that can use these APIs are chatbots and FAQ systems.
2. It also discusses using domain-specific languages like SQL within APIs to query databases and knowledge bases. Sequence-to-sequence models are mentioned for translating natural language to structured queries.
3. Various natural language processing tools and techniques are mentioned that can be used as part of APIs, such as word embeddings, parsers, named entity recognition, and semantic role labeling.
This document provides an overview of using various Python libraries and tools for image recognition, including OpenCV for image processing, Selenium and Hulu for web scraping, Keras and TensorFlow for building convolutional neural networks, and Bottle as a web framework. Code examples are given for preprocessing images, creating a CNN model in Keras to classify images into 10 classes using the CIFAR-10 dataset, and exporting the trained model.
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...PingCAP
This document describes RESIN, a query optimizer that eliminates redundant I/O for big data queries. RESIN introduces two new operators - ResinMap and ResinReduce - and two optimization rules - sub-query fusion and binary-operator elimination. These optimizations were found to benefit 40% of queries in the TPC-DS benchmark, improving performance by an average of 1.4x. The optimizer works by fusing operators applied to the same table, eliminating redundant joins or unions, and combining grouped aggregations. An evaluation on a 10GB TPC-DS dataset found RESIN's optimizations significantly reduced redundant I/O for many real-world analytical queries.
This document discusses functional programming concepts using the Haskell programming language. It provides examples of how to write functions like sum and product in a functional style. It explains that Haskell is a strongly typed, lazy, purely functional language where everything is a function. It also discusses some benefits of learning functional programming concepts like Haskell, even if you don't use a purely functional language, as ideas from FP are used in languages like Python, Ruby and C#.
This document provides information on tools for research plotting in Python and R. It discusses matplotlib and R for creating plots in Python and R respectively. It provides examples of different plot types that can be created such as line plots, bar plots, scatter plots, and histograms. It also discusses installing and working with matplotlib and R Studio, and provides code examples to generate various plots from data.
This document discusses using parallel computing in R with the snow package. It provides an overview of using snow to distribute computations across multiple CPUs. Examples are given showing how snow can be used with functions like parApply to speed up matrix multiplication by performing the operation in parallel on a cluster. The document also discusses using snow together with Rmpi and a job scheduler like Sun Grid Engine to enable parallel computing on a computing cluster.
The document discusses graph databases and their advantages over traditional databases. It begins by describing the morphology of graphs and how they have evolved from dots and lines to property graphs. It then discusses persistence mechanisms for storing graphs and how graph databases provide index-free adjacency for efficient traversal. Several examples are provided using Gremlin to query a sample graph to demonstrate graph traversal operations. The document concludes by discussing use cases for graph databases in applications involving relationships, recommendations, ranking and data analysis.
This document provides an overview of Lamina, a Clojure library for working with streams of event data:
- Lamina allows transforming, aggregating, analyzing, and reacting to streams of event data. It is useful for creating narratives from sequences of events over time.
- Key concepts include channels for streaming data, functions for processing streams like map and reduce, and probe channels for instrumenting code and generating event streams.
- Lamina includes tools for visualizing event streams, calculating statistics like rates and quantiles, and distributing streams by properties before aggregation. This allows flexible analysis of distributed event-based applications.
1. The document discusses energy-based models (EBMs) and how they can be applied to classifiers. It introduces noise contrastive estimation and flow contrastive estimation as methods to train EBMs.
2. One paper presented trains energy-based models using flow contrastive estimation by passing data through a flow-based generator. This allows implicit modeling with EBMs.
3. Another paper argues that classifiers can be viewed as joint energy-based models over inputs and outputs, and should be treated as such. It introduces a method to train classifiers as EBMs using contrastive divergence.
From Trill to Quill: Pushing the Envelope of Functionality and ScaleBadrish Chandramouli
In this talk, I overview Trill, describe two projects that expand Trill's functionality, and describe Quill, a new multi-node offline analytics system I have been working on at MSR.
This document summarizes a presentation on offline reinforcement learning. It discusses how offline RL can learn from fixed datasets without further interaction with the environment, which allows for fully off-policy learning. However, offline RL faces challenges from distribution shift between the behavior policy that generated the data and the learned target policy. The document reviews several offline policy evaluation, policy gradient, and deep deterministic policy gradient methods, and also discusses using uncertainty and constraints to address distribution shift in offline deep reinforcement learning.
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
FREQ_FOLCA and LOSSY_FOLCA are variants of FOLCA that work in constant space by removing infrequent production rules from the hash table. FREQ_FOLCA divides text into blocks and removes the lowest frequency rules each time the hash table reaches a size limit. LOSSY_FOLCA divides text into blocks and keeps rules for successive blocks based on frequency. Experiments show they can compress 100 human genomes totaling 306GB in about one day while using only a few dozen megabytes of working space.
Incremental and parallel computation of structural graph summaries for evolvi...Till Blume
This document presents an incremental and parallel algorithm for computing structural graph summaries of evolving graphs. The algorithm incrementally updates graph summaries when the input graph changes, which is often faster than recomputing from scratch. The algorithm partitions the graph and computes summaries in parallel. Experimental results on real-world and benchmark graphs show the incremental algorithm outperforms batch computation even when 50% of the graph changes. The algorithm runs in linear time with respect to graph changes and degree.
FOLCA is a fully-online grammar compression method that builds a partial parse tree in an online manner and directly encodes it into a succinct representation using just nlgn+2n+o(n) bits of space. This is asymptotically optimal. It achieves small working space of (1+α)nlgn+n(3+lg(αn)) bits using a compressed hash table. It can extract substrings in O(l+h) time using extra space of nlg(N/n)+3n+o(n) bits. Experiments show it compresses and extracts faster than LZend while using less space.
This document summarizes research presented at the 24th Annual Symposium on Combinatorial Pattern Matching. It discusses three open problems in optimally encoding Straight Line Programs (SLPs), which are compressed representations of strings. The document presents information theoretic lower bounds on SLP size and describes novel techniques for building optimal encodings of SLPs in close to minimal space. It also proposes a space-efficient data structure for the reverse dictionary of an SLP.
M/DB and M/DB:X are open source NoSQL databases based on GT.M. M/DB emulates the Amazon SimpleDB API and data model, allowing use of SimpleDB-compatible clients on premise. M/DB:X provides a native XML database with DOM and XPath APIs that can store and retrieve XML documents in JSON or XML format using the SimpleDB security model. Both leverage the high performance and scalability of the underlying GT.M database.
This document provides instructions for installing the Linux operating system on a computer. It recommends installing Ubuntu Linux and describes how to create a bootable Ubuntu USB drive using UNetbootin. This allows trying out Ubuntu without installing it permanently. The document then describes how to install Ubuntu permanently on the hard drive to dual boot with Windows. Additional topics covered include installing Linux on a USB drive or external hard drive, post-installation configuration, using Linux applications, and some advanced usage topics. The overall document provides beginner-friendly guidance for getting started with Linux.
The document discusses streaming data and concurrency in R. It notes that R is inherently single-threaded, but there are efforts to enable parallelization through distributed computation and interfaces to other technologies. While multithreading R itself is challenging due to its internal workings, concurrency could be useful for real-time streaming data analysis. The document presents an example application using a C++ extension to subscribe to real-time market data feeds in R, but notes issues with blocking the interpreter thread. An alternative approach using inter-process communication is suggested to decouple the subscription logic.
The document discusses the Analysis Data Model (ADaM), which is used to standardize the organization of clinical trial data for statistical analysis. ADaM has two main data structures - Analysis Data Structure, Level (ADSL), which contains one record per subject, and Basic Data Structure (BDS), which can have multiple records per subject. BDS includes variables for subject identifiers, treatments, timings, analysis parameters, and other metadata. Using ADaM makes clinical trial data analysis-ready and traceable. It allows statisticians to perform various analyses like survival analysis and comparisons between treatment groups using standard SAS procedures without additional data manipulation.
The document outlines various statistical and data analysis techniques that can be performed in R including importing data, data visualization, correlation and regression, and provides code examples for functions to conduct t-tests, ANOVA, PCA, clustering, time series analysis, and producing publication-quality output. It also reviews basic R syntax and functions for computing summary statistics, transforming data, and performing vector and matrix operations.
Big data refers to datasets that are too large to be managed by traditional database tools. It is characterized by volume, velocity, and variety. Hadoop is an open-source software framework that allows distributed processing of large datasets across clusters of computers. It works by distributing storage across nodes as blocks and distributing computation via a MapReduce programming paradigm where nodes process data in parallel. Common uses of big data include analyzing social media, sensor data, and using machine learning on large datasets.
I survey three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package. I also discuss some methods for visualizing large data sets.
This document discusses metaprogramming techniques in Common Lisp (CL) using macros. It covers macro uses such as conditional expressions, decorators, tail call optimization, embedded domain-specific languages, reader macros, and compiler macros. Macros allow CL code to generate and transform other code, enabling powerful abstractions.
Rstudio is an integrated development environment for R that allows users to i...SWAROOP KUMAR K
R is a widely used statistical programming language and software environment for statistical analysis and graphics. It includes over 6,700 packages and was originally based on S, which was developed at Bell Labs in the 1970s. RStudio is a popular integrated development environment for R that provides a simpler interface compared to using R alone. R is object-oriented and there are often multiple ways to perform the same task.
This document introduces ggplot2, an R package for creating graphs and plots. It discusses the core components of ggplot2 including ggplot() for initializing plots, geom for geometries like points and lines, stat for statistical transformations, and opts for setting plot options. It provides examples using the mtcars dataset to demonstrate how to create scatter plots and add regression lines using the grammar of graphics of ggplot2.
R is a widely used statistical programming language and software environment for statistical analysis and graphics. It includes over 6,700 packages and was originally based on S, which was developed in the 1970s. RStudio is a popular integrated development environment for R that provides a simpler interface. R supports object-oriented programming and there are many ways to perform the same tasks in R, such as calculating statistics, building models, and creating visualizations of data.
Exploratory Analysis Part1 Coursera DataScience SpecialisationWesley Goi
The document discusses exploratory data analysis techniques in R, including various plotting systems and graph types. It provides code examples for creating boxplots, histograms, bar plots, and scatter plots in Base, Lattice, and ggplot2. It also covers downloading data, transforming data, adding scales and themes, and creating faceted plots. The final challenge involves creating a boxplot with rectangles to represent regions and jittered points to show trends over years.
Aaron Ellison Keynote: Reaching the 99%David LeBauer
This document discusses statistical software options for ecologists and the need for software that is reproducible, traceable, usable, and comparable. It notes that R is a useful option as it has many relevant packages and the code is open-source, allowing others to verify and build on the work. However, some R packages can be difficult for newcomers and require additional software. The document evaluates different statistical modeling packages available in R.
Alpine Data Labs presents a deep dive into our implementation of Multinomial Logistic Regression with Apache Spark. Machine Learning Engineer DB Tsai takes us through the technical implementation details step by step. First, he explains how the state of the art Machine Learning on Hadoop is not doing fulfilling the promise of Big Data. Next, he explains how Spark is a perfect match for machine learning through their in-memory cache-ing capability demonstrating 100x performance improvement. Third, he takes us through each aspect of a multinomial logistic regression and how this is developed with Spark APIs. Fourth, he demonstrates an extension of MLOR and training parameters. Finally, he benchmarks MLOR with 11M rows, 123 features, 11% non-zero elements with a 5 node Hadoop cluster. Finally, he shows Alpine's unique visual environment with Spark and verifies the performance with the job tracker. In conclusion, Alpine supports the state of the art Cloudera and Pivotal Hadoop clusters and performances at a level that far exceeds its next nearest competitor.
Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
This document outlines an introduction to R graphics using ggplot2 presented by the Harvard MIT Data Center. The presentation introduces key concepts in ggplot2 including geometric objects, aesthetic mappings, statistical transformations, scales, faceting, and themes. It uses examples from the built-in mtcars dataset to demonstrate how to create common plot types like scatter plots, box plots, and regression lines. The goal is for students to be able to recreate a sample graphic by the end of the workshop.
The document discusses the history and evolution of JavaScript from its origins in 1995 as a scripting language for web browsers called Mocha, to becoming JavaScript and being rewritten in a week to be more like Java, to its use today with features like classes and arrow functions. It provides an example of a LCMCalculator class written in modern JavaScript syntax to calculate the lowest common multiple of two numbers.
This document provides examples of various plotting functions in R including plot(), boxplot(), hist(), pairs(), barplot(), densityplot(), dotplot(), histogram(), xyplot(), cloud(), and biplot/triplot. Functions are demonstrated using built-in datasets like iris and by plotting variables against each other to create scatter plots, histograms, and other visualizations.
These are my slides from a presentation to the Chicago R User Group on Oct 3, 2012. It covers how to use R and Gephi to visualize a map of influence in the history of philosophy.
More detail is available on the Design & Analytics Blog.
Many experts believe that ageing can be delayed, this is one of the main goals of the the Institute of Healthy Ageing at University College London. I will present the results of my lifespan-extension research where we integrated publicly available genes databases in order to identify ageing related genes. I will show what challenges we met and what we have learned about the process of ageing.
Ageing is one of the fundamental mysteries in biology and many scientists are starting to study this fascinating process. I am part of the research group led by Dr Eugene Schuster at UCL Institute of Healthy Ageing. We experiment with Drosophila and Caenorhabditis elegans by modifying their genes in order to create long-lived mutants. The results of our experiments are quantified using high-throughput microarray analysis. Finally we apply information technology in order to understand how the ageing process works. I will show how we mine microarrays data in order to find the connections between thousands of genes and how we identify candidates for ageing genes.
We are interested in building a better understanding of genes functions by harnessing the large quantity of experimental microarray data in the public databases. Our hope is that after understanding the ageing process in simpler organisms we will be able to apply this knowledge in humans.
Cross-referencing expressions levels in thousands of genes and hundreds of experiments turned out to be a computationally challenging problem but Hadoop and Amazon cloud came to our rescue. In this talk I will present a case study based on our use of R with Amazon Elastic MapReduce and will give background on our bioinformatics challenges.
These slides were presented at ApacheCon Europe 2012:
http://www.apachecon.eu/schedule/presentation/3/
The document provides an overview of the Scala programming language. It discusses how Scala removes some features from Java like break/continue and static, unifies functional programming and object-oriented programming, and treats functions as first-class objects. Key aspects of Scala covered include treating all operators as methods, higher-order functions, pattern matching with case classes, and functional operations on collections like List.
Similar to La R Users Group Survey Of R Graphics (20)
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
12. par sets graphical parameters parameters for par() pch col adj srt pt.cex graphing functions points() text() xlab() legend()
13. Paneling Graphics By setting one parameter in particular, mfrow, we can partition the graphics display to give us a multiple framework in which to panel our plots, rowwise. par(mfrow= c( nrow, ncol)) Number of rows Number of columns
16. Working with Graphics Devices Starting up a new graphic X11 window x11() To write graphics to a file, open a device, write to it, close. pdf(“mygraphic.pdf”,width=7,height=7) plot(x) dev.off() In Linux, the package “Cairo “ is recommended for a device that renders high-quality vector and raster images (alpha blending!). The command would read Cairo(“mygraphic.pdf”, … Common gotcha: under non-interactive sessions, you should explicitly invoke a print command to send a plot object to an open device. For example print(plot(x))
49. Data Visualization References ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham http://had.co.nz/ggplot2 Lattice: Multivariate Data Visualization with R by DeepayanSarkar http://lmdvr.r-forge.r-project.org/
Editor's Notes
“A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
Hal Varian said that “The sexy job in the next ten years will be statisticians…”(in an 2009 interview with McKinsey Quarterly).Data visualization is the fastest means to feeds our brains data, because it leverages our highest bandwidth sensory organ: our eyes. Statistical visualization is sexy both because high-density information plots tickle our brains – we crave information – and because it is hard to do well.
A datavisualization is often the final step in a three-step data sense-making process, whereby data is (i) “munged” e.g. collected, cleansed, and structured), (ii) modeled, relationships in the data are explored and hypotheses tested, and finally (iii) visualized, a particular model of the data is represented graphically.At Facebook, their data engineers are called “data scientists.” I like this term because it conveys that working with data involves the scientific method, predicated on making hypotheses and testing them.Ultimately, we are interested in using data to make hypotheses about the world.
Like this one, from Jessica Hagy’s witty blog – this is indexed.comShe visualizes a hypothesis that free time and money are related – e.g. that youhave the most free time when you’re broke and when you’re rich.I decided to test this hypothesis with data on working hours (its complement = free time) and GDP from 29 OECD countries.
Using R, I decided to test this hypothesis.I modeled it with a polynomial regression.Data for 29 countries in the OECD, using 2006 data on annual hours worked, and GDP per capita.I modeled it with both linear and polynomial regression models.Just a few lines of code.
And using R, I visualized it. The wealth-free time hypothesiswas half-right.Here’s the result – for OECD countries, Jessica was partially right:the richer you are, the more free time you have (the extreme rightmost point is Luxembourg). But at least for the subset of countries that we examined, the relationship is strictly linear – the poorest OECD countries have the least free time.(In the code shown on the right, I’m using ggplot2 here, not the base graphics plot function in the previous slide. But ggplot2 will automatically do a loess fit for us).
In this section, I describe built-in graphics functions in R, that require no external packages.
First, a peek under the covers of the R graphics stack. At the top-most level are packages, like “maps”, “lattice”, and “ggplot2”. These packages make calls to a lower-level graphics system, of which in R there are two – called “graphics” and “grid”.According to Nicholas Lewin-Koh, the goal of these graphics systems is to “create coordinates for each graphical object and render them to a device or canvas. In addition the system may manage (i) a stack of graphics objects, (ii) local state information, (iii) redrawing and resizing.”Finally these graphics systems are capable of rendering output to a variety of devices – which for our purposes, can be considered image formats such as PNG, JPG, and PDF. Devices are most commonly include interactive displays – such as those in Windows of Mac OS X – which R sends its output to by default during an interactive session.Grid is a newer system, and both “lattice” and “ggplot2”, which I’ll discuss later, use Grid.
plot() is a “do the right thing” graphics commandplot() is the simplest R command for generating a visualization of an R object. It’s an overloaded function that just “does the right thing”, and yields a quick few for many R objects that are passed to it.These built-in basic plotting commands are useful if you’re just doing quick, exploratory analysis, and publication quality graphs are not what you’re looking for.
We can interactively add layers – lines, points, and text -- to plots using basic graphics functions.One such example is abline– so named for its a slope, b intercept parameters it uses to draw a line (from that saw y = ax + b).
par is a function for setting graphical parameters for base graphics – and, nota bene, these parameters are often shared by the higher level packages I discuss later.Once parameters are defined via par, graphics functions like plot will use these new parameters in subsequent plots.The example above shows the setting of three parameters: pchto set a plotting character (21 denotes a filled circle),cexto set size or character expansion (1 is default, 5 is bigger)col to set color, which is definable as a name (“blue”), an integer (1-7 for primaries), or an RGB value (as above).
graphics parameters can be set via par(), or passed directly to graphics functionsAboveare some more parameters that you can set using par(). For a full list, type help(par) at the R prompt.You can also pass these parameters directly to graphics functions, for example, “points(5,3, pch=19, col=blue)”The chart on the right is example of a plot painstakingly created with the low-level plotting parameters and functions above. This was done by interactively layering additional text labels and legends on after the initial points were plotted.
Edward Tufte has lauded the value of “small multiples” in information graphics: namely, the incorporation of many small plots in a single graphic.R provides a basic facility for the subdivision of a display device (or ultimately its printed representation) into several panels. This can be achieved by setting the graphics parameter mfrow, which stands for multiple figures plotted row-wise.
With the mfrowparameter, a 2 x 2 matrix of sub-panels -- as in the example above -- can be set up, and plots will be interactively drawn in these sub-panels.The code above illustrates the creation of four figures in a single graphic, and the result is shown in the next slide.(There is also a mfcol function for plotting multiple figures in a column-wise manner.)
Unless a data visualization is of unusually high density, most modern display devices allow for upwards of 16 figures to be suitably resolved on a single device. See the splom() function for automatic creation of such dense graphics.
R graphics devices can present some “gotchas”Normally one need not have any knowledge of the graphics devices that underly the R graphics system. But in a few cases, it’s worth knowing something about: while typical users can save R graphics in the Windows or Mac OS X (via a “Save As” dialog in the graphics window), if one is not using a GUI, exporting graphics requires manually opening a device – with one of several device commands (such as pdf() or png() ) – and closing it properly (using dev.off() ).also, when exporting graphics in a non-interactive environment (via a script for instance) – it’s critical to invoke the print() function – which will properly write a graphic to the available device.this “print” issue can be a real gotcha for scripts.
Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
the ‘gg’ in ggplot2 is a reference to a book called The Grammar of Graphics written done by Leland Wilkinson The book conceives graphics as compositional– made up colors, visual shapes, and coordinates, much as sentences are made up of parts of speech.
I’ve illustrated an incomplete version of Wilkinson’s grammar in this slide, to convey how graphics are built up – and out of – their component parts.As such, Wilkinson advocates that graphical tools should leave behind what he deems “chart typologies” – rigid casts of a pie charts, bar graphs, or scatter plots, which data is poured into. (These programs might be thought of as the Mad Libs analogs of graphics –with pre-defined structure, and limited degrees of freedom).Conceived as compositional, a graphical grammar allows for an infinite variety of graphical constructions.
In the upcoming examples, drawn directly from Hadley Wickham’s book on ggplot2, we’ll visualize data concerning ~ 50,000.We’ll start simple and build to more complex graphs by specifying additional elements of the graphical grammar.This data is in the ggplot2 package, more information is available with help(diamonds) (after loading ggplot2).For our purposes, we’re concerned examining relationships between just three dimensions of this data, namely: carat, cut, clarity, price.
In ggplot2, the command to build this plot is qplot(), which stands for “quick plot”. We pass qplot() two dimensions of our data (carat and price), and it defaults to a scatter plot representation. Also worth noting is ggplot2’s other visual defaults are quite easy on the eyes – in contrast to most of R’s base graphics.We begin with a basic scatter plot of these 50,000 diamonds. This plot reveals that, not surprisingly, the price of diamonds increases as they get bigger (in terms of carats). Somewhat more interesting is how: we perceive that price seems to increase exponentially (and we test this hypothesis in the next slide).
Next, we log normalize the our data, and reveal that as we suspected, the relationship between a diamond’s price and its carat is exponential.It should be noted that we can achieve this transformation in two equivalent ways: (i) we can directly transform our data with the log function, or (ii) we can transform our coordinate scales on which our data is plotted. In ggplot2, this latter approach is achieved by passingthe parameter ‘log=“xy”’ to qplot. Because both normalization approaches rely on different parts of graphical speech – data and scale – this nicely illustrates that, as in language, there is more than one way to express data visually using this grammar of graphics and ggplot2.
Another element of the graphical grammar is the aesthetic appearance of plotting points. Here, we pass a parameter, alpha, which controls the transparency of the points plotted. The parameter’s value, I(1/20), indicates that each point should have 1/20th of full intensity: thus 20 overplotted points are required at any given location to achieve full saturation (in this case, to black).(Note: the “I” function in R inhibits further interpretation of its arguments, so can be thought of simply the fraction 1/20)This method uncovers some interesting distributions in the data that were previously obscured by overplotting. For example, we can detect that points are highly concentrated around specific carat sizes.Contrast this method with our earlier approach to alpha blending with base graphics, which required manually specifying the RGB hex code.
Here we layer on yet another element of grammar, the color, to show how clearer stones are more expensive.ggplot2 automatically creates a legend for the mapping of color variables onto color.(Note, Wickham’s choice of a default color palette is not accidental – they of equal luminance, thus no one dominates over the other. For more than you ever want to know about color choice, see http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture13.pdf).
Now we use another element of the grammar – what is termed ‘facets’ – to splinter our graphic into a number of subplots along a given dimension. Here we achieve the small multiples that we previously did using the par function and mfrow parameter.These sorts of sub-divided plots are what the Lattice system, excels at, which we’ll see later.What can say from this plot? Well, if anything, clear colored diamonds (“D”) seem to get more expensive more quickly (slightly steeper slope as a function of their size) versus yellower diamonds.
Let’s take another view of the data. Here we’re interested in seeing how color influences the per carat cost of a diamond. Theboxplot on the left shows that nearly clear diamonds (color categories ‘D’ and ‘E’) have a greater number of high-priced outliers, but their median (the center line of each box) is nearly identical to the others.The so-called jitter plot on the right shows this same view of the data, but all of the points are shown – in this case, the points plotted into bins according a categorical variable, diamond color, and “jittered” within each bin to prevent overplotting, and allow a sense of the local density at difference values along the common y-dimension of price/carat.
A display of 50,000 data points. Whynot? Our eyes can handle, and I submit, crave these kind of rich visualizations.This also allows us to detect features of the data (for example, several thin white bands across the bottom of the bars – perhaps preferred price/carat combinations?) that may be missing in from more simplified data views.
lattice is an alternativehigh-level graphics package for R. Like ggplot2 it is built on the grid graphics system.
lattice is named in honor of its predecessor, trellis, which was a visualization library developed for the S language by William Cleveland. trellis was so named because of how it visualizes higher dimensions of data: it splinters these dimensions across space, producing a grid of small multiples that resemble a trellis. In the next series of slides I show how we can use lattice to visualize up to six dimensions of data in a single plot.
To demonstrate lattice’s multivariate visualizing abilities, we’ll use a fascinating data set called MLB Gameday.Since 2007, Major League Baseball has tracked the path and velocity of > 1 million pitches thrown.Sample data is here:http://gd2.mlb.com/components/game/mlb/year_2008/month_03/day_30/gid_2008_03_30_atlmlb_wasmlb_1/pbp/pitchers/400010.xml
With just two dimensions of data to describe — the x and y location in the strike zone — we can use lattice’s xyplotfunction.Unlike ggplot2, the first that we pass to lattice’s plotting functions (of which xyplot is just one) are formulas that describe a relationship in the data to be plotted. In this case, “x ~ y” can be read as “x depends on y”.Note the visual defaults: not as easy on the eyes as ggplot2 (which has a lower contrast gray background), but an improvement on R’s base graphics plots.
In this plot, I’ve layered a third dimension, pitch type,into our plot by using lattice’s “groups” parameter, which uses a different plotting symbol for each type, and includes a legend across the top.Alas, this is not a particularly informative chart. The symbols are overplotted on top of each other: trends among the pitch types are hard to discern.With lattice, we can use yet another approach.
Now we’re doing what lattice does best – splintering a dimension, in this case pitch type, into space.We do this by using R’s “condition” operator in the formula we pass to lattice (the formula “x ~ y | type” can be read as “x depends on y conditioned on type”).
Now we include a fourth dimension in our plot – pitch speed – by using color. The speed to color mapping is relatively intuitive (seen in upper right), red is fast, blue is slow.How we achieve this is not particularly simple: we must use what lattice deems “panel functions”, which allow us to extend the default appearance of the chart.
Finally we add a fifth dimension, local density, to our plots using a two-dimensional color palette, where speed is related to chroma, and local density to luminance. This is an attempt to control for some overplotting that might otherwise occur when we shrink these pitch plots down in size.
Now we can compare two different pitchers – the sixth dimension – in a single graphic.Thesix dimensions of data we visualized with lattice are thus: 1. and 2. x and y location of the pitch 3. pitch type 4. pitch speed 5. pitch density (lots of pitches make darker luminosity with out changing hue) 6. pitcher (Cole or Hamels)
As mentioned, the lattice package provides several other graphics functions besidesxyplot.Some are listed above here, and the densityplot() function is highlighted at the bottom. This is a particularly useful alternative to standard histograms, which can suffer from binning artifacts.
In this section I mention a couple of techniques for handling large data sets.
This is bad for two reasons: (1) overplotting obscures data, even when alpha blending is used.(2) it’s highly inefficient, both on screen – and especially if saved as vector graphic (huge PDFs).Two solutions:- resort to sampling map density of points onto some other attribute – such as colorhexbinplotand geneplotter do just this.
hexbinplot() is a graphics function (in an self-named package) divides a scatter plot area into hexagons, counts occurrences within each these hexagonal areas, and maps these counts to a color scale. The result is a plot, as shown, where the graphics device need only draw as many points as there are hexagons. In the case of the diamond data, rather than 50,000 points being graphed, just ~ 2000 hexagons are.This also reveals some of the clumpiness in the data, though not as well as ggplot2’s alpha-blended scatterplots.
This is an Affymetrix gene chip, with 100,000 data points.On the right we have the output of a typical microarray assay: the colors correspond to RNA expression levels.With R, I can distill these 100,000 data points down to a simple model – and visualize it.
The data visualization on the right, called an M-A plot, is a variation of an XY scatter plot, where we are comparing the observed signals for particular microarray, to a composite background distribution – both are ordered by intensity of signal– deviations from the straight line show differences between our array and the background (in this case, our array tends to have higher signals across the board). Typically we generate an M-A plot for every array in our compendium to yield a big picture view of the consistency of our arrays across experiments – the flatter the red lines, the better (remember that in most models of cellular behavior we expect only a small fraction of genes to change in expression).
Ross Ihaka’sColorspace package provides access to useful colorspaces beyond RGB, like LAB and HSV. These colorspaces are preferred by artists and designers for their more intuitive properties. This is the package I used to design the palettes in the pitching plots shown earlier. For my opinionated comments on using color in data visualizations, visit:http://dataspora.com/blog/how-to-color-multivariate-data/
Before we end, some thoughts on how R can be used a visualization engine on the web.
So I’ve pushed this pitch visualization application into a web app, using RApache.I can do this because R is open source – without licensing restrictions.Data and the processing can both live on the server – important when your data set is huge (this one is around 20 Gigabytes). And when the data changes, the dashboard updates.No local software installation needed, and updates are instantly available to all web users.It can be part of the open source web-analytics stack, with a catchy name – LAMR. If you can think of something less lame, let me know.
Why EmbedR into a Web-based Architecture?Immediately access the many benefits of a web architecture that is: * Stateless/Scalable – URL requests can be distributed across one or many servers * Cacheable - common requests made to the R server can be cached by Apache * Secure - we can piggyback on existing HTTPS architecture for analysis of sensitive data
rapache: Embedding R within the Apache ServerOur tool of choice is rapache, developed by Jeff Horner at Vanderbilt University. http://biostat.mc.vanderbilt.edu/rapache/
Naturally this is just scratching the surface of what rapachecan do.An alternative approach to printing HTML directly, is to use a templating system, similar to PHP.This is available via the R package brew (also developed by Jeffrey Horner), downloadable on CRAN and at:http://www.rforge.net/brew/
The ggplot2 and lattice books are both published by Springer (ggplot2 as of July 2009), available via Amazon.example code and figures from ggplot2 bookhttp://had.co.nz/ggplot2example code and figures from lattice bookhttp://lmdvr.r-forge.r-project.org/